A vocabulary for data sources

Tue Feb 3 13:57:04 CET 2026

Dear Colleagues,

Since the College Park interop, I had promised to produce a proposal
for how we can mark up data collections that do not (directly)
represent physical observations.  That is important as we pull more
and more "theoretical" data, including simulated observations, into
the VO, and we will want to have some means of alerting users that a
given artefact isn't "real".  This will hopefully be part of
VODataService 1.3.

There is prior art for that: SSAP has had a data source metadata item
since the late 2000s.  Here's how this looks like in SimpleDALRegExt:
<https://ivoa.net/documents/SimpleDALRegExt/20220222/REC-SimpleDALRegExt-1.2.html#tth_sEc3.3>
(look for dataSource).

Once you start writing definitions for what what is, it becomes
surprisingly difficult to draw the lines between the various data
sources.  This is one reason why I'd like to go for a full vocabulary
(rather than just a few strings in the schema).  Another reason is
that I can see that one day we might want to re-introduce SSAP's
distinction between survey and pointed, both of which are narrower
than the current observation.

That said, here's what I've come up with, in Turtle format:

<> a owl:Ontology;
    dc:created "2026-02-03";
    dc:creator [ foaf:name "Demleitner, M." ],
    [ foaf:name "Tody, D." ];
    dc:license <http://creativecommons.org/publicdomain/zero/1.0/>;
    rdfs:label "Data Sources"@en;
    dc:title "Data Sources"@en;
    dc:description """A rough classification of the processes that have produced data
artefacts.  The classic use case for this is to distinguish between
artificial data not directly going back to messengers coming from
real objects and data resulting from actual observations, typically of
the sky.  This is going back to SSAP's Data Source meta item and is
used, for instance, in VODataService's dataSource element.""";
    ivoasem:vocflavour "RDF Class".

<#artificial> a rdfs:Class;
  rdfs:label "Artificial";
  rdfs:comment "The production of data artefacts with the goal of predicting observations of specific individual world objects, usually taking into account properties of certain instruments or instrument classes.  An example would be images generated from catalogues for exercising pipelines or reduction software.".

<#observation> a rdfs:Class;
  rdfs:label "Observation";
  rdfs:comment "The use of instruments to directly measure or observe the actual world.  Subsequent reduction steps are allowed, even if they are significant (e.g., remapping), but a direct path must link most bits in resulting data to a sensor reading or messenger particle.".

<#theory> a rdfs:Class;
  rdfs:label "Theory";
  rdfs:comment "The production of data artefacts from physical principles with the goal of modelling the world without reference to specific individual world objects.  Examples would include common cosmological simulations or model spectra of atmospheres.".

All this is also in a PR against Vocabularies:

https://github.com/ivoa-std/Vocabularies/pull/47

So, what does everyone think?  Are there better definitions?  Is
distinguishing #theory and #artificial even worth it?  Is all things
overengineering and we should make do with just a few terms in the
schema?

I think I'd rather discuss matters on-list, but if you prefer github
bugs, that'd work for me, too.

Thanks,

        Markus