DataLink target meaning : "observation results of a source" use case

François Bonnarel francois.bonnarel at astro.unistra.fr
Thu Feb 13 19:13:49 CET 2020


Hi all,a

Trying to go further on this point (also related to Markus email VEP003 
yesterday)

2 things...

I ) After discussing with a couple of people, I think the productype for 
these associated datasets can be set in the content param of the 
media-type-value of DataLink content_type eg

application/fits;content=timeseries;subtype=lightcurve

This will probably require a new change proposal in the DataLink spec 
itself. It's not a big one . I will prepare it tommorrow.

II ) For the semantics term we need to relate a dataset to a source in a 
catalog, I think the most general thing we cans say about it is that it 
is "cross-correlation". I propose the term "CrossedDataset". This can be 
the head term for #sibling, #contains, #folowup, etc...

Cheers
François

Le 07/01/2020 à 23:04, Patrick Dowler a écrit :
>
> First, although DataLink was conceived with an implicit "resource is a 
> dataset" that leaked into the terminology and examples, I agree that 
> there is no reason that it cannot be used for other kinds of entities. 
> Using that particular word does conjure up provenance, but datalink 
> and provenance are already related (#progenitor) conceptually.
>
> The way I am still seeing this, dataproduct_type (from ObsCore) says 
> what something *is* and that is not a relationship per se. Aside: on 
> the issue of subtype, I would prefer/like to make dataproduct_type a 
> vocabulary so people could extend it rather than using a two-level 
> type/subtype mechanism -- but only if we can figure out a sane/nice 
> way to query vocabulary terms via TAP that actually works.
>
> I can think of several relationships from a source in a catalogue to a 
> dataset and I still feel that the concept behind 
> "Observation_Result_of_source" is eluding me. The relation could be:
>
> #progenitor : some/all source properties were measured in that dataset
> #derivation : the dataset was created from the source properties
>
> other possible relationships:
>
> contains : the dataset contains the source (seems like this is a 
> top-level very general and vague statement; I would interpret this to 
> also mean "and not progenitor")
>
> followup : the existence/discovery of the source caused a new 
> observation to occur (child of contains, causal relation)
>
> So, for someone with a source (catalogue) and a realted 
> image|spectrum|lightcurve, is that data one of these or is it some 
> other concept?
>
>
> --
> Patrick Dowler
> Canadian Astronomy Data Centre
> Victoria, BC, Canada
>
>
> On Fri, 20 Dec 2019 at 07:46, François Bonnarel 
> <francois.bonnarel at astro.unistra.fr 
> <mailto:francois.bonnarel at astro.unistra.fr>> wrote:
>
>     This email was sent yesterday in another thread.
>
>     Following Markus' recommendation I open now a new thread for this
>     discussion of the "astronomical source observation results" use cases.
>
>     Cheers
>
>     François
>
>     Dear all,
>
>       * When I proposed VEP0001 immediately after Groningen Interop I
>         could not imagine that such a controversy discussion would occur.
>           o Before considering the use case we have I would like to go
>             back to the current usages of DataLink I know.
>           o Then go back to the "new" use case
>           o And then check some of the proposed solutions on this list
>           o And then argue for my preference
>       * According to DataLink 1.0
>           o the semantics field contains a "Term from a controlled
>             vocabulary describing the link" as stated in Table 1 and
>           o section 3.2.6 reads :
>           o "The semantics column contains a single term from an
>             external RDF vocabulary that describes the meaning of this
>             linked resource relative to the identified dataset. The
>             semantics column is intended to be machine-readable and
>             assist automating data retrieval and processing."
>           o Let's call the initial thing we are starting from and to
>             which we want to link resources "Main" and the various
>             linked resources "Target".
>               + Two remarks  :
>                   # The text in section 3.2.6, consistently with the
>                     use cases described in the introduction considers
>                     that the "Main" is a dataset
>                   # The  semantics field describes globally what the
>                     target is "with respect to the main"
>               + More classical is the group of columns access_URL ,
>                 content_type, content_length which references and
>                 describes the "Target" itself (independently from the
>                 "Main")
>               + Now I tried to look a little bit at the current usage
>                 of DataLink using Aladin DeskTop as a client and the
>                 three following SIAP2 servers
>                   # CADC :
>                       * In the example I found The DataLink service
>                         had "this" in semantics for the full retrieval
>                         of the dataset,
>                       *  "cutout" for a SODA service
>                       * and a couple of "auxiliary" Rows for
>                         additional data such as PSF images, etc...
>                       *  cutout is related to the fact that it is a
>                         service, described as "service descriptor".
>                         Aladin opens a specific menu in that case
>                         while it downloads the datasets in the other
>                         cases according to the fact its "content_type"
>                         is application/fits
>                   # GAVO :
>                       * In the example I found The DataLink service
>                         had "this" in semantics,  and also "preview",
>                         "proc" and "science".
>                       *  "this" and "preview" are self-explanatory.
>                       * "proc" is actually related to a SODA service
>                         (should be "cutout" maybe ?)
>                       * and science is a new term proposed by Markus
>                         to take into account that it is related
>                         science data
>                   # CASDA :
>                       *  In the example I found,  "Main" was a cube.
>                         It had in semantics several "this", a "cutout
>                         and a "proc".
>                       *   Each "this" row allowed the retrieval of the
>                         full dataset from different servers sometimes
>                         in synchronous mode and sometimes in
>                         asynchronous mode.
>                       *  The "cutout" row is related to a SODA service.
>                       * The "proc" row links to a SODA-like service
>                         extracting a single integrated spectrum from
>                         the data cube.
>               + This shows that semantics is not only there in
>                 DataLink for selection among rows in the {links}
>                 response table but also helps the client to figure out
>                 what to do with the target in combination with
>                 content-type, content_length and service descriptor
>                 (if any is defined).
>               + This also shows that semantics terms work like a flat
>                 vocabulary despite their tree presentation in the rdf
>                 document.
>                   # Auxiliary is a head term for bias, dark, flat but
>                     can also be used on its own for non registered cases.
>                   # Same for proc and cutout.
>                   # The tree structure of the vocabulary is actually
>                     only descriptive. It's not functional at the time
>                     of writing.
>       * New Uses cases:
>           o Short after DataLink became an official IVOA
>             recommendation, some data providers were interested in
>             using the DataLink functionalities for use cases where the
>             "Main" was a source in a catalogue.
>           o  This can work, of course, and proposal are currently
>             discussed to integrate these use cases within the scope of
>             DataLink-1.1, but no adapted semantics terms describing
>             this kind of relationship between the "Main" and the
>             "Target" were available in the previous vocabulary.
>           o Often  the "Target" related to the source "Main" is the
>             result of an observation of the source, actually a dataset
>             (image, spectrum, lightcurve, etc..)
>               +  In vizieR we had a similar situation for what we call
>                 "associated data" to catalogue "rows".
>               + these "associated data" can indeed be images,
>                 TimeSeries, cubes, spectra...
>           o  Hence the VEP0001 proposal as it was presented in October
>             the 15th
>               + An associated_image is actually "an image of main"
>                 which is a source.
>               +  An associated_lightcurve is similarly " a light curve
>                 of Main"   which is a source.
>           o  It is to be en-lighted that this term informs the client
>             that it is an image or a light curve and that it is an
>             Observation result of the source.
>           o The proposal to define an item in the associated branch
>             for each value of dataproduct_type and even more for each
>             subtype of TimeSeries introduced the idea to combine
>             associated_data with the ObsCore vocabulary.
>               +  It was pointed out (By Markus) that other head terms
>                 such has "progenitor" or "derived" could need this too
>                 and this could lead to a combinatory explosion.
>           o By the way the term "associated_data" itself has been
>             criticized to describe the concept of observation result
>             of a source.
>       * The 4 concepts proposal
>           o Ada proposed to separate the description of the links in 4
>             different concepts
>               + "4 independent levels or categories:
>               + Level 0 - Data-format (fits, VOTable, PDF, png, …)
>               + Level 1 - Data-type (tabular, image, spectrum, cube,
>                 text, …)
>               + Level 2 - Data-information (Documentation,
>                 Calibration, Log, Preview, …)
>               + Level 3 - Data-relation (Derived from, Progenitor of,
>                 Sibling of, ...)"
>           o I think this introduces an effort for a  real data
>             modelling of DataLink. It would be obviously a major
>             improvement in the way we link resources. But it may take
>             sometimes to achieve.
>           o At the moment I don't see a clear distinction between
>             level 2 and level 3 because the "information" we have in
>             the "Target"  is always "relative" to a "Main" so not 
>             that far from level 3. At least it may be sometimes
>             difficult to know  in which "level" falls  a given
>             category value
>           o On the other side for links to dynamical services I am not
>             sure to which category their characterization belongs. Is
>             that  a fifth level to add ? Data-type in the context of
>             DataLink should have a much wider scope than ObsCore
>             "dataproduct_type" because there are targets which are not
>             data products. Various metadata, auxiliary data, texts,
>             plots, etc... If data_product_type is standardized, what
>             about the other stuff ?
>           o To me It looks like the levels proposed by ada (an maybe a
>             few others) are more like matrix description tant a flat one.
>           o Account taken of all the above, I think the levelling of
>             the categories can be a project for DataLink 2 which will
>             be really interesting. if we want to have a quick solution
>             I think we have to consider more modest solutions.
>       * Among different Proposals :
>           o I see two possible simple solutions to tackle the use case
>               + go back to a simplified version of VEP001.
>                   # Instead to reproduce the full ObsCore
>                     "dataproduct_type" variability we only define the
>                     terms we currently need  and we will see in the
>                     future if we need more.
>                   # At the same time I get rid both of
>                     "associated_data" and "sibling" head term and
>                     choose to use "Observation_Result_of_source"
>                   # ESO and SVO use cases :   "image_of_source"",
>                     "Spectrum_of_source"
>                   # TimeDomain/Gaia use cases :
>                     "LightCurve_Of_Source",
>                     "RadialVelocityCurve_Of_Source",
>                     "Movie_Of_Source", "SpectroChronogram_Of_Source"
>                       * "TimeSeries_Of_Source" may be used as a head
>                         term for the four above, or when we don't know
>                         exactly what is varying in time.
>               + adopt proposal made by Pat Dowler. Use the media type
>                 in content_type to give the type or product type using
>                 the parameter "content="
>                   # application/fits;content=image
>                   # application/fits;content=spectrum
>                   #  application/fits;content=lightcurve or
>                     application/fits;content=timeseries;subtype=lightcurve
>                   # application/fits;content=movie or
>                     applicaton/fits;content=timeseries;subtype=movie
>                   # etc ...
>                   # the standard structure of media types allows to
>                     reuse the current "dataproduct_type" vocabularu as
>                     a vlaue of the content parameter and then to use
>                     an additional "subtype" parameter, or
>                     alternatively  to directly use the timseries
>                     subtype in "content=".
>                   # a variant would be to create a new
>                     dataproduct_type parameter in the media type when
>                     appropriate
>                   #  If we adopt that, semantics will only be
>                     "Observation_Result_of_source" in parallel for all
>                     these possibilities
>               +  In the first solution we directly introduce some kind
>                 of datatype in the "meaning of target relative to the
>                 main" semantics field which I think it's fine except
>                 that it doesn't explicitely reuse ObsCore dataproducttype.
>               + In the second solution clients will have to parse the
>                 media type to discover not only the format of the
>                 target but also its content. We still have to decide
>                 how to do subtype.
>                   # This has probably to be explicitly explained in
>                     the next DataLink-1.1 version
>           o What do implementers / service providers prefer ?
>
>
>     I wish you all happy holidays for the coming days
>
>     Cheers
>
>     François
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20200213/1f4f3d7d/attachment-0001.html>


More information about the dal mailing list