[VEP-0001] DataLink semantics vocabulary enhacement proposal

Tue Oct 22 10:53:23 CEST 2019

Hi DAL,

On Mon, Oct 21, 2019 at 05:38:32PM +0200, Petr Skoda wrote:
> > So, a question to all (including Carlos, who's posted on voevent@):
> > Which of these terms do you actually need *now* (or at least for data
> > that you will want to publish in the safely forseeable future)?  And
> > can you see a clear scenario for a *machine* to have to understand
> > matters at that level of detail (for human, there's always the
> > description in datalink)?
> 
> I would like to point out that the list suggested by Francois is still not
> sufficient for many archival (Vizier e.g.) and future surveys data.

Well, you see, that is the question: sufficient for exactly *what*?
These terms are directed at machines, and for humans there's still
description (and a lot more channels).  So, the question is: how much
distinction do you have to convey to a machine client?

As far as I can see, there are two use cases in general for datalink
semantics:

(a) link filtering: The client, based on the semantics, selects a
subset of the links provided to present to its users -- for instance,
calibration data will not be shown outside of a debugging session.
Or they're just used for grouping.  This was, I think, the original
use case that triggered the introduction of the semantics column.

(b) figure out what do do with a link: When Aladin implemented
datalink, they found that based on what's in a datalink row, they
didn't know how to deal with a link: they'd like to send spectra to
clients listening to spectrum.load.ssa-generic, images to those
listening to image.load.fits and so forth.  The datalink content_type
column isn't quite sufficient for this, because
application/x-votable+xml can be a spectrum or an object catalog,
whereas image/fits might be some kind of cube or a plain image (or an
IRAF spectrum, or still something else).  That's the "SAMP sending
use case" that, I think, was largely missed when we wrote datalink.

Does anyone have more use cases for Datalink semantics?  If so, this
would be the perfect moment to bring them forward, in particular so
we can put them into Datalink 1.1.

Having established this much, after a mail from Ada I had another of
my dangerous epiphanies.  That is, if we really want to deal with use
case (b) in semantics, we'll end up reproducing the distinction that
VEP-0001 proposes on in every branch: not only will we have 

#associated-cube #associated-image #associated-radialvelocitycurve ...

but also

#derivation-cube #derivation-image #derivation-radialvelocitycurve ...

and (we've already seen use cases for that)

#progenitor-cube #progenitor-image #progenitor-radialvelocitycurve ...

We *could* do this.  But if we go there, we should be aware of what
ugly thing we're doing.  And I'd suggest we think about alternatives
first.

First off: I think #associated-data as such is a good term, although
we may want to try get the distinction to the existing #auxiliary a
bit clearer.  Essentially, if we model provenance as a tree, then
#progenitor is an ancestor of the current item, #derivation a
descendant, and #associated-data a sibling.  I like it, and I can see
why this fits into use case (a).  Also, we have Gaia DR2, where this
can be immediately applied.

I'm still unhappy about putting #auxiliary against #associated-data;
the fact that the description of the former is just "auxiliary
resources" may underline the importance of trying hard to come up
with helpful descriptions.  But that's for another day.

Let's look at use case (b).  Really, what we'd like to have is a
mapping of "something" to the SAMP mtypes
(https://wiki.ivoa.net/twiki/bin/view/IVOA/SampMTypes).  I suppose
we're doing our adopters a favour if we start from obscore
dataproduct_types, because they'll have to deal with them anyway.
I think François' intent has been pretty much that in the proposed
vocabulary, which largely takes up 3.3.1 of obscore, except for
the attempt to additionally describe the nature of cube axes in that
scheme (which we could discuss separately).

If we accept this, the question transforms into: "Where can we
communicate an obscore dataproduct_type in datalink?".

I can see three options:

(1) The semantics column -- the consequences I've described above.
No disaster, but certainly ugly.

(2) The datalink content_type column.  As said above, media types
don't quite work out of the box, because dataproduct types and media
types don't really map onto each other.  However, RFC 6838 media
types have structure: You can add parameters.  We already exploit
this in datalink to say that datalink documents should come with a
media type of application/x-votable+xml;content=datalink.

What if we just said, in datalink: "Whereever possible, the
content_type should indicate the dataproduct type communicated, using
a content parameter taken from the vocabulary associated with obscore
dataproduct_type.  For instance, a spectrum in a VOTable would have
application/x-votable+xml;content=spectrum, whereas some kind of cube
in a FITS serialisation would be application/fits;content=cube."

We can immediately start doing this; there's strings attached,
though, in that I doubt too many clients parse media types at this
point, and these might become confused it we did this.

(3) Adding a dataproduct_type column in datalink.  If we started from
scratch, this is probably what I'd do.  As things are now... don't
know.  As for (2), this can start immediately (because datalink lets
you add extra columns), and at it would even have the advantage that
clients that don't parse media types would still understand
content_type.

Any opinions or preferences from datalink adopters or authors?

Coming back to the vocabulary as such -- Petr's mail IMHO admirably
makes clear that the full problem is probably beyond the means of a
single term from a vocabulary and thus underlines my appeal to try
and solve problems we have right now and know can be solved with
simple terms.  See:

> E.g. what is missing is the associated link to timeseries where the
> horizontal axis is not time but circular phase associated with given
> frequency in a periodogram and the associated periodogram itself.
> 
[...]
> If you want the example of timeseries of spectra
> there is so called dynamical spectrum (e.g. in my old pictures
[...]
> There are of course better examples of quick time resolved spectroscopy etc
[...]
> Also I can imagine the time series of datacubes (in ALMA, radio) ...
> 
> And lastly , what about the gravity wave associated information
> (strain/frequency - I a have asked people from GW community for detailed
> examples ...
> and it seems that the common "timeseries" they use is
> either strain/time   or power density of strain/frequency
> (strain is relative displacement/baseline of mirrors)
[...]

> As something more understandable for optical astronomers we should think
> about folded curves as well as so called phase portaits of those curves
> (important for analysis of deterministic chaos - which some sources may be
> driven by)
[...]

> If I go to details - even the single order specrum has associated the 2D
> image of spectrum (e.g. the rainbow) on a CCD chip as a strip of light and
> in echelle - still not properly handled even by SSAP it is even complicated
> ... perhaps the cutout of whole echellogram of a given spectral order is a
> good approximation for proposed "associated image"

(I've elided a few more cases of stuff we would have to annotate if
we wanted to machine-readably label all possible kinds of data
products). Which is why I like Petr's conclusions:

> IMHO we should have easily extensible vocabulary and let the client
> developers to decide how they will use the information
> The people publishing certain product at datalink end will have clear vision
> what they want to show and the new clients will be able to display this ....
> 
> 
> But in practice I think that the most different part of clients is the
> dimension - e.g. timeseries as light curves, folded light curves (in phases)
> , spectra, power spectra , gravitation waves etc ... are just the same task
> to display as 1D vector - and all "semantics": is given by description of
> axes - units, variables...
> 
> This is what we wanted to express in our IVOA note - SPLAT is tool for
> displaying 1D vectors. No semantics needed. Thats why we could use it to
> time series immediately with changing a few lines of code ;-)
> 
> The image is domain of Aladin and we need a 3D viewers for data cubes ...
> Thats all - number of axes determines the product and client to use.

So -- I'd no say #associated-data is enough to satisfy the filtering
use case (a).  Whereas the SAMP sending use case (b) is probably
better solved by something else.

             -- Markus