Datalink data access services

Fri Dec 13 07:13:51 PST 2013

Hi Markus,
Hi all,
     I don't go in all the details but want to make some short general 
comments on the overall DAL landscape you propose compared to what 
emerged from hawai interop.
The final decision concerning the landscape to build was taken at  a 
side meeting  hold during ADASS (I was personnaly missing but this has 
been explained by Severin on the list if I remember). This landscape is 
clearly described in the new version of SIAV2 WD posted by Pat a couple 
of weeks ago. You have a nice diagram in $ 1.3 of this draft.
      - accessData services (or functionanlities ) are identified there, 
as well as dataLink services or functionalities. A new "full metadata" 
or "get-gory-details" functionality also appears in the landscape for 
image/cubes services.
      - accessData services will be given the importance they have by 
benefiting from a fully separated recommandation document.
      - In your B) below you propose to replace the item "free services" 
by "accessData services" in the DataLink document. This seems to  imply 
that there is no room left for free services which will have other 
functionalities (think for example to a service providing some kind of 
model fitting to spectra. the use case exists for DataLink). But maybe 
you don't mean this, so sorry in that case.
      - in C) you emphasize the fact that you propose to replace 
DataLink service  descriptors by AccesData service  descriptors in Dal 
discovery query responses . So I don't understand where Datalink 
services bounded to a DAL query described will be described because 
apparently you don't like the possibility to have AccesData and DataLink 
services described in the same response. maybe you explained in one of 
your other mails this week, or maybe I didn't catch your point. Sorry 
again in that case.
      - The SiaV2 draft of this month clearly states that AccesData and 
GetFullMetdata services can be described using the same recipe described 
by Pat in DataLink WD for DataLink services. I think this approach is 
fully usable and fully flexible and don't see why your use cases could 
not work with this approiach.

     In a next email  (probably available on Monday) I will try to 
explain why DataDiscovery, DataLinks and AccesData are all not only 
usefull but independant serviices concepts the definition of which has 
to be defined independantly for flexibility. This doesn't mean that they 
cannot be combined in various ways for different use cases

Best regards
François

Le 12/12/2013 14:04, Markus Demleitner a écrit :
> Dear DAL folks,
>
> It's Christmas time!  As a special present, I'm bringing you a longer
> piece on... well, read on.  It's a something like 400 lines, and
> contains quite a bit of standardese, so you may want to reserve a quiet
> moment for this.  I ask for your patience, though, as I believe this
> will let us save a complete standards text, which would be far more work
> overall.
>
> As I've already tried to convey in Waikloloa, I'm convinced datalink
> doesn't need terribly much to just work(TM) as a generic server-side
> processing service -- this has been discussed as "accessData" in
> Waikoloa --, pretty much in the way the proposed (and implemented) SSA
> getData extension already works (e.g.,
> http://docs.g-vo.org/talks/2012-urbana-ssaevo.pdf; note that we're
> withdrawing SSA getData for the more general datalink).
>
> In this mail I'm proposing changes to WD-DataLink-1.0-20131022 that,
> based on our experiences with SSA getData, should allow this.  I've not
> tried formatting the changes, but I believe the document shouldn't grow
> by more than two pages (excluding example data).
>
> (Almost) everything described here is already implemented in DaCHS;
> we're going to provide proof-of-concept client-side support for spectral
> manipulations in SPLAT RSN.
>
> I'll present these changes in chapters, enumerated with uppercase
> letters for easier references in praise or flame.
>
>
> (A) Promise a bit more in the introduction
> ------------------------------------------
>
> I propose to replace section "1.2.5 Free or Custom Services" with:
>
>    1.2.5 Data Access Services
>
>    In many data access scenarios, server-side processing of data is
>    highly desirable, typically to reduce the amount of data to be
>    transferred.  Examples for such operations are cutouts, slicing of
>    cubes, and rebinning to a coarser grid.  Other examples for server-side
>    operations include on-the-fly format conversion or recalibration.  For
>    the purpose of this specification, we call such services data access
>    services.  Datalink lets servers declare such data access services in a
>    way that a generic client can discover what operations are supported,
>    their semantics, and the domains of the operations' parameters.  This
>    lets clients operate multiple independent data access services behind
>    a common user interface, allowing scenarios like "give me all voxels
>    around positions X in wavelength range Y of all spectral cubes from
>    services Z_1, Z_2, and Z_9".
>
>
> (B) Forward Reference in Data Discovery Section
> ------------------------------------------------
>
> Section 3.2 can become much shorter when there's a chapter on how to
> describe services (see (E)).  I'd therefore propose the following text:
>
>    3.2 Data Access Services in Discovery Responses
>
>    \label{sect:dl-in-discovery}
>
>    To communicate the capabilities of a data access service, a
>    DALI-compliant discovery service embeds one or more datalink data
>    access service resources (see section \ref{sect:dasr}) after the
>    VOTable RESOURCE of type "results".  This data access service MUST
>    support a parameter with the name ID and the UCD meta.id;meta.main.
>    In it, the client MUST pass a discovered identifier.
>
>    To enable the client to decide which column of the discovery result
>    table contains the appropriate identifier, the PARAM element describing
>    the ID parameter MUST contain one LINK element with
>    content_role="ddl:id-source", the value of which is a relative URI to
>    the FIELD element in the discovery result table (i.e., its id preceded
>    by a hash).
>
>    For instance, if a result of an ObsCore query could contain
>
>      <FIELD name="obs_publisher_did" ID="datalinkID"
>        utype="obscore:Curation.PublisherDID"
>        ucd="meta.ref.url;meta.curation"
>        xtype="adql:VARCHAR" datatype="char" arraysize="256*" />
>
>    A data access service accepting identifiers from the corresponding
>    column would declare its ID parameter in the inputParams GROUP like
>    this:
>
>      <PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main"
>        value="">
>         <DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
>         <LINK content-role="ddl:id-source" value="#datalinkID"/>
>      </PARAM>
>
>    The ID value datalinkID is of course arbitrary.
>
>    As in datalink documents themselves, multiple service resources can be
>    present in a single discovery response.  See section \ref{sect:dasr}
>    for more details.
>
> Incidentally, I believe the document would work better if Section 3 were
> shifted towards the back of the document so people already know what
> we're talking about when reading this.
>
> I've suggested the LINK-based technique to link (pun intended) the PARAM
> and the FIELD with the identifier earlier this week and Pat appeared
> underwhelmed.  I'm still convinced the LINK technique is preferable to
> the current WD technique (that, for the example, would look somewhat
> like this:
>
>    <GROUP>
>      <PARAM name="ID" arraysize="*" datatype="char" ucd="meta.id;meta.main"
>        value="">
>         <DESCRIPTION>The pubisher DID of the dataset of interest</DESCRIPTION>
>      </PARAM>
>      <FIELDRef ref="datalinkID"/>
>    </GROUP>
>
> on grounds that):
>
> (a) it's more explicit; sibling elements within an (unadorned) GROUP
>      have no explitict semantics, a LINK child of a PARAM does
> (b) it's easier to relate parents and children in DOMs than siblings
>      in my experience (try writing xpath expressions for pickung out
>      the FIELD id for both solutions)
> (c) GREAT TAGS SAVINGS -- SAME CONTENT FOR NOT 10, NOT 20, NO,
>      A SENSATIONAL 30% LESS!
>
> I'd like the FIELDref if we had a proper data model that had a
> reference-valued field "source of the id parameter".  Then, a<FIELDref
> utype="dl:id-param-source" ref="datalinkID"/>  would do the trick.  Alas,
> enthusiasm for proper data modelling in Datalink is lacking, and I've
> given in to trying for that only in Datalink Deluxe, too -- the case in
> Datalink is less compelling anyway than in some other standards I could
> *cough* SCS *cough* metion.
>
>
> (C) Allow Links to Metadata Pages?
> ----------------------------------
>
> You may not have noticed it yet, but in (B) I've sneaked in that the
> discovery protocols have descriptors not for datalink services (as in
> the WD) but to data access services (i.e., containing parameter
> definitions that let clients directly retrieve cutouts and such rather
> than have to roundtrip to the datalink metadata services in between).
> The rationale there is in multi-service queries it gets fairly tedious
> to have to query the servers again to retrieve the datalink metadata.
>
> The price to pay here is a fairly close coupling between the discovery
> service and the datalink service.  I am fairly convinced from my
> implementation practice that you'll have this coupling anyway, as the
> discovery service at least needs to know what identifiers the datalink
> service knows about; on the contrary, when determining ranges of
> parameters, it's very convenient to be able to use the discovery result
> to compute those (e.g., the range of LAMBDA_X is immediately obtainable
> from an SSA result set), so stuffing them into the discovery response
> immediately (potentially) saves you more database queries.
>
> After this consideration, I'm proposing to link to the access services
> (meaning: with the full argument set) rather than datalink services
> (i.e., the metadata producing ones).  I give you the coupling thing is
> serious though, so I'm open to argument whether these should be datalink
> services or whether both should be allowed (the worst solution IMHO).
>
>
> (D) Links to Services
> ---------------------
>
> As services are now described in a RESOURCE rather than by IVORN, the
> serviceType column in the link list has to change.
>
> I would propose, in the table:
>
> serviceDef  reference to the description of a service at accessURL     no
>
> And then
>
>    4.4 serviceDef
>
>    If serviceDef is non-NULL, accessURL points to a service that, in
>    general, requires additional parameters to yield a useful result.
>    Note that serviceDef can and should be NULL even if accessURL points
>    to some resource generated on the fly if accessURL is intended to be
>    used with no additional parameters (e.g. direct download links or
>    links to dynamic content where all parameters are already included in
>    the link, such a on-the-fly preview generation).
>
>    Typically, serviceDef contains a relative URI to a data access service
>    descriptor (see \ref{sect:dasr}), i.e., the RESOURCE's id prepended
>    with a hash sign.
>
>    [I'd like to strike the following paragraph; even for standard services,
>    I'd rather just use a descriptor with its standardId PARAM set]
>    The serviceDef column can also contain IVO standardIDs for
>    standard IVOA capabilities to indicate some the accessURL points to a
>    service complying to that standard (e.g, SSAP), and a standard client
>    can be used to operate the service.  The details of how to operate
>    such a service with a datalink identifier are defined in the service
>    standard [Should we specify it for SIAP and SSAP here?].
>
>
> (E) Data Access Service Descriptors
> -----------------------------------
>
> This would be a new section, probably a subsection of the current
> section 5.  Here's some prose I'd suggest:
>
>    5.3 Data Access Service Descriptors
>
>    Data access services are described in VOTable RESOURCE elements with a
>    type="service" attribute.  In datalink responses, they must also have an
>    ID, as they are referenced from the links table.  In data discovery
>    responses, their ID parameter must reference the FIELD in the
>    type="result" RESOURCE describing the column containing the dataset
>    identifiers (see sect.~\ref{sect:dl-in-discovery}).
>
>    The resource content consists of PARAM elements containing general
>    service metadata, a GROUP describing the input parameters of the
>    service, and optionally further metadata.
>
>    The following PARAMs, identified by their names, are defined by this
>    standard:
>
>    name            description                    required
>    accessURL       URL to invoke the capability   yes
>    standardID      URI for the capability         no
>    IVORN           IVOA registry identifier       no
>
>    The access URL may contain GET-type arguments; clients must parse the
>    URL to decide how to add arguments.  Absent a standardID, data access
>    services use the GET HTTP method.
>
>    The standardID, if present, contains an IVORN referring to a service
>    standard.  This allows supporting more complex service contracts if
>    necessary. It also can be used to refer to S*AP services that allow
>    retrieval of the data sets.
>
>    IVORN, if present, is the identifier of the service in the registry.
>    [Is there a scenario where that could become useful?  Much as I am a
>    Registry afficionado, I can't really see what to do with this IVORN
>    here, and I still don't really believe in registering datalink
>    services.]
>
>    The GROUP describing the input parameters is identified by having the
>    name inputParams.  For each input parameter, there is a VOTable PARAM
>    element, the name of which gives the name of the HTTP parameters.
>    Implementors SHOULD make sure to give as much metadata as possible here,
>    in particular as regards UCDs, units, descriptions useful also to users
>    not familiar with the underlying data, as well as ranges of valid values
>    or enumerations of the values accepted by each parameter.  A non-empty
>    value attribute on a VOTable PARAM should be used as a default for the
>    parameter by a client, in particular in user interfaces to the
>    service.
>
>    Parameters that have roles in known data models SHOULD be marked up as
>    recommended by the either [VOTable], the data model documents, or any
>    forthcoming IVOA recommendation.  This is particularly important for
>    parameters that are part of Space-Time-Coordinates.
>
>
> (F) Guidelines for Data Access Services
> ---------------------------------------
>
> Here's where the meat is, finally.  This would become another toplevel
> section.  My hope is that this is enough to fulfill our cube access use
> case together with all other "simple" cut-out and recalibration use
> cases.
>
>    6 Guidelines for Data Access Services and their Use
>
>    Data Access Services will in general be called by a discovery client
>    with minimal human intervention.  It is therefore important to enable
>    clients to infer the semantics of the parameters offered and to enable
>    robust operation.
>
>    On errors, data access services MUST NOT return 200 status codes, as
>    clients have in general no way to tell error messages from useful
>    content in a data access content.  Instead, they SHOULD use one of 400
>    (bad syntax), 404 (not found), 422 (semantics problems), or 500 (server
>    problems).  Clients SHOULD be prepared to handle other HTTP status codes
>    like redirects (301 or 303) and authentication requests (401).
>    Error messages MUST be text/plain and should be rendered without
>    reflowing.
>
>    Data access services SHOULD raise an error when they receive a parameter
>    unknown to them, and clients SHOULD never pass a parameter to a service
>    unless it has declared support for it in its data access service
>    resource.  The purpose of this recommendation is that ignored parameters
>    could mean downloads of gigabytes rather than kilobytes in a data access
>    context; this should not happen due to a client error.
>
>    Data access services SHOULD return the unmodified dataset when
>    passed the ID only.
>
>    If a specific set of arguments yields the empty data set, Data Access
>    Services SHOULD return an empty table, image, or the like, rather than
>    an error message.  Datalink clients SHOULD not render such data sets or
>    give modal feedback for them but give their users some way to diagnose
>    that empty datasets have been returned in a nonmodal way (e.g., in a
>    separate log window).
>
>
>    6.1 Common Parameters
>
>    Many data access services share the axes along which cutouts are to be
>    performed or ways to manipulate the data.  It is highly desirable that
>    clients can drive them in the same way to achieve multi-service
>    operation.  They SHOULD use the following parameter characteristics if
>    appropriate, and the MUST NOT use them if the corresponding physics is
>    different.
>
>    In general, data access services will accept a dataset identifier under
>    the ID parameter used  by the datalink service.  Clients MUST pass in
>    the dataset identifier(s) given in the links table if the service
>    supports ID.  Implementors may choose to encode the dataset identifier
>    in the service's accessURL in some other way, but they are mildly
>    discouraged from doing so.
>
>    The parameter characteristics consist of parameter name, ucd, and unit.
>    Clients SHOULD only assume the semantics given here if all three
>    characteristics match; while distinguished in the table, neither
>    clients nor servers should distinguish between an empty and missing
>    unit attributes on the PARAM elements.
>
>    paramater name  UCD                                unit     remarks
>    ID              meta.id;meta.main                  (none)
>    DEC_MIN         param.min;pos.eq.dec               deg      (1)
>    DEC_MAX         param.max;pos.eq.dec               deg      (1)
>    RA_MIN          param.min;pos.eq.ra                deg      (1,2)
>    RA_MAX          param.max;pos.eq.ra                deg      (1,2)
>    LAT_MIN         param.min;(see remark)             deg      (1)
>    LAT_MAX         param.max;(see remark)             deg      (1)
>    LON_MIN         param.min;(see remark)             deg      (2,3)
>    LON_MAX         param.max;(see remark)             deg      (2,3)
>    LAMBDA_MIN      param.min;em.wl                    m        (4)
>    LAMBDA_MAX      param.min;em.wl                    m        (4)
>    FORMAT          meta.code.mime                     (none)   (5)
>    SPECRP          spect.resolution                   (empty)  (6)
>    FLUXCALIB       phot.calib                         (none)   (7)
>    PIX(n)_MIN      param.min;pos.pixel.ax(n)          pixel    (8)
>    PIX(n)_MAX      param.max;pos.pixel.ax(n)          pixel    (8)
>    KIND            meta.code                          (none)   (9)
>
>    Remarks
>    (1) ICRS coordinates; use LAT, LON for other systems
>    (2) Stitching line is assumed at 360 degrees; both _MIN and _MAX are
>        always positive, _MAX may assume values>  360
>    (3) LAT and LONG can refer to all kinds of spherical coordinate systems,
>        which will determine the UCD; do declare VOTable STC metadata for these
>        parameters.  UCD fragments legal here include pos.eq.ra,
>        pos.galactic.lon, pos.supergalactic.lon, pos.ecliptic.lon,
>        pos.spher.lon for LON, and their lat counterparts for LAT.
>    (4) Even if a specific community uses frequencies or energy, spectral
>        cutouts should be possible by wavelength. Additional specificiation
>        according to community practices is, of course, allowed, but the
>        preferred solution is conversion on the presentation layer (i.e.,
>        community-targeted UIs presenting input options according to
>        community practices).
>    (5) This is for format conversion; the permitted values must be
>        enumerated in the PARAM's VALUE child
>    (6) This is for on-the-fly rebinning along spectral coordinates
>    (7) This is for on-the-fly recalibration; the calibrations understood
>        by the service must be enumerated in the PARAM's VALUE child.
>    (8) Here, n is an axis index, 1-based in accordance with FITS usage.
>        These are intended for pixel-wise cutout, in particular after the
>        client has obtained dataset metadata.
>    (9) KIND admits an open enumeration of "things" to ask for.  Values
>        supported by a service must be enumerated in the PARAM's VALUE
>        CHILD.  Predefined values include HEADER (metadata; for FITS images,
>        the primary header), HEADERn (the n-th header, 1-based, for
>        compound data), DATAn (the n-th data-metadata combination -- e.g.,
>        HDU in FITS files -- in compound data).  Please use the predefined
>        values if appropriate for your data, do not use it if these concepts
>        do not match your data.
>
>    [Instead of _MAX and _MIN, there's something to be said for _VAL and
>    _SIZE, in particular less hassle with spherical coordinates.  Choose one
>    and stick with it, I'd say; PLEASE help contribute the parameters you
>    want here.]
>
>    Additional parameter characteristics can be added to this table in minor
>    revisions of this document.
>
>
> (E) Example documents
> ---------------------
>
> Everyone loves examples.  Validators are even better, but they're more
> work, too.  So, I propose two non-normative appendices with on example
> document each for the datalink and data-discovery case.  I'm giving
> links to live documents here.  I suspect it'd be a good idea to
> hand-edit to cut some crap and maybe add some contrived and
> "interesting" features once we know what these could be.
>
> This stuff hasn't been validated, and no client has been implemented
> against it yet. If there's disagreement between the document content and
> the proposed text, the document content probably is wrong.
>
> Maybe these examples should not be part of the document text but rather
> stored somewhere else and just linked?  They'll badly blow up the
> doument otherwise, and the forests of the world will hate us.
>
>    Appendix A: Datalink Example Document (non-normative)
>
> could contain, e.g., the output of:
>
> curl -FID="ivo://org.gavo.dc/~?califa/data/V500/reduced_v1.3c/NGC6310.V500.rscube.fits" "http://dc.zah.uni-heidelberg.de/califa/q/dl/dlmeta" | xmlstarlet fo
>
> I'm happy to clean that up as required.
>
>     Appendix B: Data Discovery Example Document (non-normative)
>
> could contain,e.g., the output of
>
> curl -FTARGETNAME="15 Mon" -FREQUEST=queryData http://dc.zah.uni-heidelberg.de/feros/q/ssa/ssap.xml | less
>
> -- here, all the SSAP verbosity should pretty certainly be cut away.  The
> output right now contains one datalink service descriptor, pointing to
> the metadata service as envisioned by the current WD, and one data
> access service descriptor, pointing to the actual service as  proposed
> here.  As mentioned above, I'm very sure one of them should disappear,
> I'm keeping both at the moment in case people want to try both
> alternatives.
>
>
> Finally, here's a list of new UCDs we should try and get into the
> "big list": param.min, param.max, pos.spher.(lon|lat), pos.pixel.ax1..ax7
>
> (I have my doubts about the ax1..ax7, too, but it would be symmetric
> with what's there for other pos.* UCDs).
>
> Cheers,
>
>               Markus