DataLink target meaning : "observation results of a source" use case

François Bonnarel francois.bonnarel at astro.unistra.fr
Fri Dec 20 16:46:34 CET 2019


This email was sent yesterday in another thread.

Following Markus' recommendation I open now a new thread for this 
discussion of the "astronomical source observation results" use cases.

Cheers

François

Dear all,

  * When I proposed VEP0001 immediately after Groningen Interop I could
    not imagine that such a controversy discussion would occur.
      o Before considering the use case we have I would like to go back
        to the current usages of DataLink I know.
      o Then go back to the "new" use case
      o And then check some of the proposed solutions on this list
      o And then argue for my preference
  * According to DataLink 1.0
      o the semantics field contains a "Term from a controlled
        vocabulary describing the link" as stated in Table 1 and
      o section 3.2.6 reads :
      o "The semantics column contains a single term from an external
        RDF vocabulary that describes the meaning of this linked
        resource relative to the identified dataset. The semantics
        column is intended to be machine-readable and assist automating
        data retrieval and processing."
      o Let's call the initial thing we are starting from and to which
        we want to link resources "Main" and the various linked
        resources "Target".
          + Two remarks  :
              # The text in section 3.2.6, consistently with the use
                cases described in the introduction considers that the
                "Main" is a dataset
              # The  semantics field describes globally what the target
                is "with respect to the main"
          + More classical is the group of columns access_URL ,
            content_type, content_length which references and describes
            the "Target" itself (independently from the "Main")
          + Now I tried to look a little bit at the current usage of
            DataLink using Aladin DeskTop as a client and the three
            following SIAP2 servers
              # CADC :
                  * In the example I found The DataLink service had
                    "this" in semantics for the full retrieval of the
                    dataset,
                  *   "cutout" for a SODA service
                  * and a couple of "auxiliary" Rows for additional data
                    such as PSF images, etc...
                  *   cutout is related to the fact that it is a
                    service, described as "service descriptor". Aladin
                    opens a specific menu in that case while it
                    downloads the datasets in the other cases according
                    to the fact its "content_type" is application/fits
              # GAVO :
                  * In the example I found The DataLink service had
                    "this" in semantics,  and also "preview", "proc" and
                    "science".
                  *   "this" and "preview" are self-explanatory.
                  * "proc" is actually related to a SODA service (should
                    be "cutout" maybe ?)
                  * and science is a new term proposed by Markus to take
                    into account that it is related science data
              # CASDA :
                  *   In the example I found,  "Main" was a cube. It had
                    in semantics several "this", a "cutout and a "proc".
                  *    Each "this" row allowed the retrieval of the full
                    dataset from different servers sometimes in
                    synchronous mode and sometimes in asynchronous mode.
                  *   The "cutout" row is related to a SODA service.
                  * The "proc" row links to a SODA-like service
                    extracting a single integrated spectrum from the
                    data cube.
          + This shows that semantics is not only there in DataLink for
            selection among rows in the {links} response table but also
            helps the client to figure out what to do with the target in
            combination with content-type, content_length and service
            descriptor (if any is defined).
          + This also shows that semantics terms work like a flat
            vocabulary despite their tree presentation in the rdf document.
              # Auxiliary is a head term for bias, dark, flat but can
                also be used on its own for non registered cases.
              # Same for proc and cutout.
              # The tree structure of the vocabulary is actually only
                descriptive. It's not functional at the time of writing.
  * New Uses cases:
      o Short after DataLink became an official IVOA recommendation,
        some data providers were interested  in using the DataLink
        functionalities for use cases where the "Main" was a source in a
        catalogue.
      o   This can work, of course, and proposal are currently discussed
        to integrate these use cases within the scope of DataLink-1.1,
        but no adapted semantics terms describing this kind of
        relationship between the "Main" and the "Target" were available
        in the previous vocabulary.
      o Often  the "Target" related to the source "Main" is the result
        of an observation of the source, actually a dataset (image,
        spectrum, lightcurve, etc..)
          +   In vizieR we had a similar situation for what we call
            "associated data" to catalogue "rows".
          + these "associated data" can indeed be images, TimeSeries,
            cubes, spectra...
      o   Hence the VEP0001 proposal as it was presented in October the 15th
          + An associated_image is actually "an image of main" which is
            a source.
          +   An associated_lightcurve is similarly " a light curve of
            Main"   which is a source.
      o   It is to be en-lighted that this term informs the client that
        it is an image or a light curve and that it is an Observation
        result of the source.
      o The proposal to define an item in the associated branch for each
        value of dataproduct_type and even more for each subtype of
        TimeSeries introduced the idea to combine associated_data with
        the ObsCore vocabulary.
          +   It was pointed out (By Markus) that other head terms such
            has "progenitor" or "derived" could need this too and this
            could lead to a combinatory explosion.
      o By the way the term "associated_data" itself has been criticized
        to describe the concept of observation result of a source.
  * The 4 concepts proposal
      o Ada proposed to separate the description of the links in 4
        different concepts
          + "4 independent levels or categories:
          + Level 0 - Data-format (fits, VOTable, PDF, png, …)
          + Level 1 - Data-type (tabular, image, spectrum, cube, text, …)
          + Level 2 - Data-information (Documentation, Calibration, Log,
            Preview, …)
          + Level 3 - Data-relation (Derived from, Progenitor of,
            Sibling of, ...)"
      o I think this introduces an effort for a  real data modelling of
        DataLink. It would be obviously a major improvement in the way
        we link resources. But it may take sometimes to achieve.
      o At the moment I don't see a clear distinction between level 2
        and level 3 because the "information" we have in the "Target" 
        is always "relative" to a "Main" so not  that far from level 3.
        At least it may be sometimes difficult to know in which "level"
        falls  a given category value
      o On the other side for links to dynamical services I am not sure
        to which category their characterization belongs. Is that  a
        fifth level to add ? Data-type in the context of DataLink should
        have a much wider scope than ObsCore "dataproduct_type" because
        there are targets which are not data products. Various metadata,
        auxiliary data, texts, plots, etc... If data_product_type is
        standardized, what about the other stuff ?
      o To me It looks like the levels proposed by ada (an maybe a few
        others) are more like matrix description tant a flat one.
      o Account taken of all the above, I think the levelling of the
        categories can be a project for DataLink 2 which will be really
        interesting. if we want to have a quick solution I think we have
        to consider more modest solutions.
  * Among different Proposals :
      o I see two possible simple solutions to tackle the use case
          + go back to a simplified version of VEP001.
              # Instead to reproduce the full ObsCore "dataproduct_type"
                variability we only define the terms we currently need
                and we will see in the future if we need more.
              # At the same time I get rid both of "associated_data" and
                "sibling" head term and choose to use
                "Observation_Result_of_source"
              # ESO and SVO use cases :   "image_of_source"",
                "Spectrum_of_source"
              # TimeDomain/Gaia use cases :  "LightCurve_Of_Source",
                "RadialVelocityCurve_Of_Source", "Movie_Of_Source",
                "SpectroChronogram_Of_Source"
                  * "TimeSeries_Of_Source" may be used as a head term
                    for the four above, or when we don't know exactly
                    what is varying in time.
          + adopt proposal made by Pat Dowler. Use the media type in
            content_type to give the type or product type using the
            parameter "content="
              # application/fits;content=image
              # application/fits;content=spectrum
              #   application/fits;content=lightcurve or
                application/fits;content=timeseries;subtype=lightcurve
              # application/fits;content=movie or
                applicaton/fits;content=timeseries;subtype=movie
              # etc ...
              # the standard structure of media types allows to reuse
                the current "dataproduct_type" vocabularu  as a vlaue of
                the content parameter and then to use an additional
                "subtype" parameter, or alternatively  to directly use
                the timseries subtype in "content=".
              # a variant would be to create a new dataproduct_type
                parameter in the media type when appropriate
              #   If we adopt that, semantics will only be
                "Observation_Result_of_source" in parallel for all these
                possibilities
          +   In the first solution we directly introduce some kind of
            datatype in the "meaning of target relative to the main"
            semantics field which I think it's fine except that it
            doesn't explicitely reuse ObsCore dataproducttype.
          + In the second solution clients will have to parse the media
            type to discover not only the format of the target but also
            its content. We still have to decide how to do subtype.
              # This has probably to be explicitly explained in the next
                DataLink-1.1 version
      o What do implementers / service providers prefer ?


I wish you all happy holidays for the coming days

Cheers

François
















-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20191220/0e3b7e10/attachment.html>


More information about the dal mailing list