From msdemlei at ari.uni-heidelberg.de Tue Feb 3 13:57:04 2026 From: msdemlei at ari.uni-heidelberg.de (Markus Demleitner) Date: Tue, 3 Feb 2026 13:57:04 +0100 Subject: A vocabulary for data sources Message-ID: Dear Colleagues, Since the College Park interop, I had promised to produce a proposal for how we can mark up data collections that do not (directly) represent physical observations. That is important as we pull more and more "theoretical" data, including simulated observations, into the VO, and we will want to have some means of alerting users that a given artefact isn't "real". This will hopefully be part of VODataService 1.3. There is prior art for that: SSAP has had a data source metadata item since the late 2000s. Here's how this looks like in SimpleDALRegExt: (look for dataSource). Once you start writing definitions for what what is, it becomes surprisingly difficult to draw the lines between the various data sources. This is one reason why I'd like to go for a full vocabulary (rather than just a few strings in the schema). Another reason is that I can see that one day we might want to re-introduce SSAP's distinction between survey and pointed, both of which are narrower than the current observation. That said, here's what I've come up with, in Turtle format: <> a owl:Ontology; dc:created "2026-02-03"; dc:creator [ foaf:name "Demleitner, M." ], [ foaf:name "Tody, D." ]; dc:license ; rdfs:label "Data Sources"@en; dc:title "Data Sources"@en; dc:description """A rough classification of the processes that have produced data artefacts. The classic use case for this is to distinguish between artificial data not directly going back to messengers coming from real objects and data resulting from actual observations, typically of the sky. This is going back to SSAP's Data Source meta item and is used, for instance, in VODataService's dataSource element."""; ivoasem:vocflavour "RDF Class". <#artificial> a rdfs:Class; rdfs:label "Artificial"; rdfs:comment "The production of data artefacts with the goal of predicting observations of specific individual world objects, usually taking into account properties of certain instruments or instrument classes. An example would be images generated from catalogues for exercising pipelines or reduction software.". <#observation> a rdfs:Class; rdfs:label "Observation"; rdfs:comment "The use of instruments to directly measure or observe the actual world. Subsequent reduction steps are allowed, even if they are significant (e.g., remapping), but a direct path must link most bits in resulting data to a sensor reading or messenger particle.". <#theory> a rdfs:Class; rdfs:label "Theory"; rdfs:comment "The production of data artefacts from physical principles with the goal of modelling the world without reference to specific individual world objects. Examples would include common cosmological simulations or model spectra of atmospheres.". All this is also in a PR against Vocabularies: https://github.com/ivoa-std/Vocabularies/pull/47 So, what does everyone think? Are there better definitions? Is distinguishing #theory and #artificial even worth it? Is all things overengineering and we should make do with just a few terms in the schema? I think I'd rather discuss matters on-list, but if you prefer github bugs, that'd work for me, too. Thanks, Markus