[VEP-0001] DataLink semantics vocabulary enhacement proposal

Tue Oct 22 18:23:32 CEST 2019

Hi Markus, all,

Le 22/10/2019 à 10:53, Markus Demleitner a écrit :
> Hi DAL,
>
> On Mon, Oct 21, 2019 at 05:38:32PM +0200, Petr Skoda wrote:
>>> So, a question to all (including Carlos, who's posted on voevent@):
>>> Which of these terms do you actually need *now* (or at least for data
>>> that you will want to publish in the safely forseeable future)?  And
>>> can you see a clear scenario for a *machine* to have to understand
>>> matters at that level of detail (for human, there's always the
>>> description in datalink)?
>> I would like to point out that the list suggested by Francois is still not
>> sufficient for many archival (Vizier e.g.) and future surveys data.
> Well, you see, that is the question: sufficient for exactly *what*?
> These terms are directed at machines, and for humans there's still
> description (and a lot more channels).  So, the question is: how much
> distinction do you have to convey to a machine client?
>
> As far as I can see, there are two use cases in general for datalink
> semantics:
>
> (a) link filtering: The client, based on the semantics, selects a
> subset of the links provided to present to its users -- for instance,
> calibration data will not be shown outside of a debugging session.
> Or they're just used for grouping.  This was, I think, the original
> use case that triggered the introduction of the semantics column.
>
> (b) figure out what do do with a link: When Aladin implemented
> datalink, they found that based on what's in a datalink row, they
> didn't know how to deal with a link: they'd like to send spectra to
> clients listening to spectrum.load.ssa-generic, images to those
> listening to image.load.fits and so forth.  The datalink content_type
> column isn't quite sufficient for this, because
> application/x-votable+xml can be a spectrum or an object catalog,
> whereas image/fits might be some kind of cube or a plain image (or an
> IRAF spectrum, or still something else).  That's the "SAMP sending
> use case" that, I think, was largely missed when we wrote datalink.
Well, that's strange because from the beginning some of us (authors) had 
something like that in mind. Well not exactly "samp" but more generally. 
What will the client do with this link. Try to manage it herself and do 
something appropriate or send it to some other tool or a WEB browser. So 
I am in favor of extended-(b). And indeed that's the way Aladin is using 
semantics as far as possible.
>
> Does anyone have more use cases for Datalink semantics?  If so, this
> would be the perfect moment to bring them forward, in particular so
> we can put them into Datalink 1.1.
>
>
> Having established this much, after a mail from Ada I had another of
> my dangerous epiphanies.  That is, if we really want to deal with use
> case (b) in semantics, we'll end up reproducing the distinction that
> VEP-0001 proposes on in every branch: not only will we have
>
> #associated-cube #associated-image #associated-radialvelocitycurve ...
>
> but also
>
> #derivation-cube #derivation-image #derivation-radialvelocitycurve ...
>
> and (we've already seen use cases for that)
>
> #progenitor-cube #progenitor-image #progenitor-radialvelocitycurve ...
OK. This means that we are facing the three branches were the links 
targets to datasets or datasets exerpts.
> We *could* do this.  But if we go there, we should be aware of what
> ugly thing we're doing.  And I'd suggest we think about alternatives
> first.
>
> First off: I think #associated-data as such is a good term, although
> we may want to try get the distinction to the existing #auxiliary a
> bit clearer.  Essentially, if we model provenance as a tree, then
> #progenitor is an ancestor of the current item, #derivation a
> descendant, and #associated-data a sibling.  I like it, and I can see
> why this fits into use case (a).  Also, we have Gaia DR2, where this
> can be immediately applied.
>
> I'm still unhappy about putting #auxiliary against #associated-data;
> the fact that the description of the former is just "auxiliary
> resources" may underline the importance of trying hard to come up
> with helpful descriptions.  But that's for another day.
>
> Let's look at use case (b).  Really, what we'd like to have is a
> mapping of "something" to the SAMP mtypes
> (https://wiki.ivoa.net/twiki/bin/view/IVOA/SampMTypes).  I suppose
> we're doing our adopters a favour if we start from obscore
> dataproduct_types, because they'll have to deal with them anyway.
> I think François' intent has been pretty much that in the proposed
> vocabulary, which largely takes up 3.3.1 of obscore, except for
> the attempt to additionally describe the nature of cube axes in that
> scheme (which we could discuss separately).
>
> If we accept this, the question transforms into: "Where can we
> communicate an obscore dataproduct_type in datalink?".
>
> I can see three options:
>
> (1) The semantics column -- the consequences I've described above.
> No disaster, but certainly ugly.
MY preference. See below.
>
> (2) The datalink content_type column.  As said above, media types
> don't quite work out of the box, because dataproduct types and media
> types don't really map onto each other.  However, RFC 6838 media
> types have structure: You can add parameters.  We already exploit
> this in datalink to say that datalink documents should come with a
> media type of application/x-votable+xml;content=datalink.
>
> What if we just said, in datalink: "Whereever possible, the
> content_type should indicate the dataproduct type communicated, using
> a content parameter taken from the vocabulary associated with obscore
> dataproduct_type.  For instance, a spectrum in a VOTable would have
> application/x-votable+xml;content=spectrum, whereas some kind of cube
> in a FITS serialisation would be application/fits;content=cube."
>
> We can immediately start doing this; there's strings attached,
> though, in that I doubt too many clients parse media types at this
> point, and these might become confused it we did this.
Mmmm. We probably have to reproduce the hierachy in the content then. 
And I think it's better to let content-type manage formats (and format 
refinment is also possible. I find it better to use semantics, because 
for associated-timeseries-lightcurve, we say well 'this link is an 
association to a lightcurve related to the current item in my main 
table" which look like a semantic things
>
> (3) Adding a dataproduct_type column in datalink.  If we started from
> scratch, this is probably what I'd do.  As things are now... don't
> know.  As for (2), this can start immediately (because datalink lets
> you add extra columns), and at it would even have the advantage that
> clients that don't parse media types would still understand
> content_type.
Well, some other people (Alberto for example) have asked for this. I'm 
reluctant because for most of the links this column will be unused (most 
of the links usecase are not "dataproducts" at all). In general I think 
we should try to avoid adding columns in DataLink response and should 
try to keep it simple. And sepcialy when these columns come from another 
spec (Obscore) Of course this is not a rigid position, if people think 
they absolutely need an extra column and they absolutely cannot do it, 
let's consider it, but here I don't think we are outside the semantics 
scope. We are qualifying the link with the help of obscore vocabulary.

Cheers
François
>
> Any opinions or preferences from datalink adopters or authors?
>
>
> Coming back to the vocabulary as such -- Petr's mail IMHO admirably
> makes clear that the full problem is probably beyond the means of a
> single term from a vocabulary and thus underlines my appeal to try
> and solve problems we have right now and know can be solved with
> simple terms.  See:
>
>> E.g. what is missing is the associated link to timeseries where the
>> horizontal axis is not time but circular phase associated with given
>> frequency in a periodogram and the associated periodogram itself.
>>
> [...]
>> If you want the example of timeseries of spectra
>> there is so called dynamical spectrum (e.g. in my old pictures
> [...]
>> There are of course better examples of quick time resolved spectroscopy etc
> [...]
>> Also I can imagine the time series of datacubes (in ALMA, radio) ...
>>
>> And lastly , what about the gravity wave associated information
>> (strain/frequency - I a have asked people from GW community for detailed
>> examples ...
>> and it seems that the common "timeseries" they use is
>> either strain/time   or power density of strain/frequency
>> (strain is relative displacement/baseline of mirrors)
> [...]
>
>> As something more understandable for optical astronomers we should think
>> about folded curves as well as so called phase portaits of those curves
>> (important for analysis of deterministic chaos - which some sources may be
>> driven by)
> [...]
>
>> If I go to details - even the single order specrum has associated the 2D
>> image of spectrum (e.g. the rainbow) on a CCD chip as a strip of light and
>> in echelle - still not properly handled even by SSAP it is even complicated
>> ... perhaps the cutout of whole echellogram of a given spectral order is a
>> good approximation for proposed "associated image"
> (I've elided a few more cases of stuff we would have to annotate if
> we wanted to machine-readably label all possible kinds of data
> products). Which is why I like Petr's conclusions:
>
>> IMHO we should have easily extensible vocabulary and let the client
>> developers to decide how they will use the information
>> The people publishing certain product at datalink end will have clear vision
>> what they want to show and the new clients will be able to display this ....
>>
>>
>> But in practice I think that the most different part of clients is the
>> dimension - e.g. timeseries as light curves, folded light curves (in phases)
>> , spectra, power spectra , gravitation waves etc ... are just the same task
>> to display as 1D vector - and all "semantics": is given by description of
>> axes - units, variables...
>>
>> This is what we wanted to express in our IVOA note - SPLAT is tool for
>> displaying 1D vectors. No semantics needed. Thats why we could use it to
>> time series immediately with changing a few lines of code ;-)
>>
>> The image is domain of Aladin and we need a 3D viewers for data cubes ...
>> Thats all - number of axes determines the product and client to use.
> So -- I'd no say #associated-data is enough to satisfy the filtering
> use case (a).  Whereas the SAMP sending use case (b) is probably
> better solved by something else.
>
>               -- Markus