SSA working draft

Doug Tody dtody at nrao.edu
Tue Nov 21 19:42:42 PST 2006


Hi Alberto -

> About 1.1 Architecture
> ----------------------
>
> I do share Inga's position on the fact that SSA should not force
> -and I'm sure that is not the intent- a data provider to only
> answer to a SSA query with an on-the-fly generated, so called "virtual"
> dataset. It is completely up to the data provider to come up with
> a best schema to comply to the VO, and that could very well be
> a "real" static and fully compliant dataset.
> That second sentence in 1.1  could be removed from the document without 
> causing any damage, isn't it?

It is true that for "archival" data (no subsetting or filtering),
the data returned does not necessarily have to generated on the fly;
it could be precomputed and cached, and we can revise the text to
be more precise in the description of the role of virtual data in
the interface.

Nonetheless this static file business is a rather limited approach
to things, and many aspects of SSA require on-demand generation of
the data products.  In addition to cases like cutouts or spectral
extraction, a good service should be able to return data in any of
several formats, and should be designed to be easily updated when a new
version of the interface is released.  In general when a new version of
SSA is deployed, one will want to keep at least one old version around
for a while, and this could require duplication of both the service
and the entire data collection if a static file approach is taken.
As you say, it is up to the data provider how to manage all this,
but all these cases could be much easier to manage if these rather
small datasets are generated on the fly.


> 1.3.2 Parameters
> ----------------
>
> "If the same parameter appears multiple times in a request
> the operation is undefined."
>
> This basically excludes the ability to provide a logical "OR"
> which is normally implemented using the "multiple choices" mechanism,
> (as in a "multiple select" or in the "check buttons" web form elements).
>
> Isn't that a pity? How would SSAP support a logical OR in a query?
> Do we have to wait until ADQL is in place to see that implemented?

To some extent the "or" mechanism can be provided by a list-structured
parameter, which defines a set of acceptable values.  The basic "and"
mechanism is already provided by having a set of parameter constraints.
This satisfies most simple queries.  If it gets much more complicated
then an expression based interface (ADQL) is the way to go.

That said, we can do anything we want with multiple instances of
a parameter so long as it is well defined what to do in this case.
What generally happens currently (where this is undefined) is that the
service either returns an error, or it silently overrides the parameter
value when a new value is specified, in effect providing a mechanism
to override a default value.  Both of these actions are as valid as
defining multiple values to imply an "or".  Another possibility would
be to have the semantics be defined on a per-parameter basis.

I don't have any strong opinion on this one, so long as the semantics
are logical and well defined.  A multiple instance mechanism which
translates into an equivalent list-structured value is possible
for example.  (At this point, this is another semantic detail which
we carried over from OpenGIS/WMS).


> CreationType
> ------------
>
> The end user wants to know what kind of processing was applied
> to the data; hence the user should be told if the data were binned
> or mosaic'd, etc.
> What is not clear to me is why the SSA service should describe only
> the part of the processing it is responsible for, as the word "Typically"
> indicates in the second sentence of 2.4.2, and as indicated in the very last
> sentence in 2.4.2 (which actually contradicts that initial sentence
> by forcing the creationtype to express ONLY operations happening
> during the VO access).
>
> Wouldn't be better to describe the entire end-to-end process that brought
> the data in the status they are when they rich the user's disk?
> Otherwise, what is the value of such information?
>
> Unless the intention is to notify the VO user that the same data
> *in different form* exist somewhere else, in case s/he is not happy
> with it. If that is the case, then I would suggest a simpler "original"
> as opposed to "reprocessed" keyword, and forget all the quite artificial
> distinctions.

This is one of the more difficult points of SSA (as is the next
one below).  I agree that this is a difficult issue and am not yet
certain either what is the best approach.

One point here is that often the user does know something about
the original data product, and may want to know what the service
has done to produce the data product which is actually delivered.
A use-case I had in mind here was access to complex data, e.g., a
spectral data cube.  It is useful to know if a spectra was produced
by on-demand extraction from a spectral data cube, as opposed to,
for example, return an entire dataset from some well-known spectral
data collection (the "archival" case).  In this case we have one well
defined "original" data product (the survey cube) and we can view
it is multiple ways, via 2-D or 3-D cutouts, via reprojection or a
general slice specified in 2-D, via filtering by spectral bandpass,
via extraction of a 1-D spectrum, and so forth.  A good scheme which
describes the creation of data from a source data product can deal
with all these cases (this is more general than just SSA but that is
the point here as SIA V2 is next up).

Another important case is where we have a well defined data collection
which has already been carefully processed - the usual survey or
instrument data collection for example - and the service generates
a virtual data product from this by either cutting out a subset, or
for example, reprojecting the data onto a standard coordinate system
(changing the spectral dispersion in this case).  Which was done
is quite important to know: do we have the original pixels/samples
painstakingly generated by the well-known survey data collection,
or is the service filtering or interpolating, and thus degrading,
the data samples, to better represent what we asked for?  (SIA V1
already addresses this in a rudimentary fashion by the way).

On the other hand, I agree that in the most general case where the
original data (as defined by the DataID metadata in SSA) is not well
known, or we are doing a large scale automated analysis where knowledge
of well known data collections cannot easily be used, what one wants
to know is something about the overall processing done to get to to
the data actually returned by the service.  Of course, this can get
quite complex to describe, and if it gets too complex, it won't happen
and we fail.  We can hope to describe what the service does, but we
aren't able yet to describe all the prior processing done as well.

I don't have a perfect solution to this problem yet either.  The scheme
proposed is more or less adequate to describe data access operations
upon well defined data collections, hence may be a good starting point,
however I agree that have not yet fully addressed this problem.


> 3.3.1 Input Parameters
> ----------------------
>
> I think the following two sentences contradict each other, or are
> at least confusing to the reader (me!).
>
> Early in the text:
>
> A. "if a given parameter is not specified or is not supported by the service,
> a logical value of "all" is generally assumed."
>
> At the bottom of 3.3.1:
>
> B. "where a specific value is specified for an attribute which is undefined
>   for a given data collection, the service should respond by finding
>   no matching data."
>
> Apart from the contradiction, I like B, and do not like A.
> Returning too many results is much worse than telling the user:
> please refine you query because our service does not support
> the input parameter you used.
> Also, "A." covers two very different cases:
> A1. a parameter is not specified
> A2. a parameter is not supported
>
> In the A1 case, I would agree that "all" is generally assumed.
> In the A2 case, the service should better bail out a warning to
> the user.

I agree this is a pretty subtle point, but I don't think this is a
contradiction.  The key point is that in case B,

     1) the parameter is explicitly specified,
     2) the parameter is supported by the service,
     3) the value is *known to be undefined* for the data.

Hence for theory data (for example), where time of observation is
undefined, specifying an explicitly specified time of observation
should find nothing - the time value is "known" for this data (it is
known to be undefined) and does not match the query (except in the
case that the theory data simulates a given actual observation time
or epoch).

This is different than the case where query by time is merely not
supported by the service: in this case the service does not know the
time or does not support query by time and hence merely cannot apply
the query constraint.  Hence it matches data ignoring the constraint,
leaving it to further processing upstream to resolve the matter,
possibly by rejecting, refining and resubmitting the query.

The problem with a service aborting if a query constraint is supplied
which it does not support is that an essential design requirement
for DAL queries, in order to be able to support global multiband data
discovery and access, is that we can pose the *same* query to multiple
services and expect it to work; further query refinement, if required,
can occur on the client side, where much greater knowledge of the
problem to be solved is available, and further examination of the
metadata returned is possible.  The alternative would require that
service metadata for every service be examined and the capabilities
of each service be understood enough by the client to enable tailoring
of the query for each service, which is unworkable.


> SPECRES
> ----------
>
> I think a SPEC_RP is the new suggested keyword for a lamdba/d(lambda),
> which is called resolving power, and not resolution.
>
> Maybe a way out is to let the data provider to choose whether
> a FWHM (SPECRES) or a L/dL (SPEC_RP) suites better her data?

Resolving power is the more correct term here, although I think
spectral resolution in RP units is also commonly used.  Anyway,
you are right, we should probably call it the spectral resolving power.

As in other cases we always want to simplify the interface rather
than add more features, so having two ways to specify essentially
the same thing is probably not justified.


> Units
> ---------
>
> In various tables (e.g. 3.3.3) the unit DDEG is mentioned,
> to mean decimal degrees. I do not think that is an agreed standard.
> To avoid troubles, I suggest you change that into "deg".

Ok; guess we should not confuse units and format.


> Inconsistency about fully compliant services
> ---------------------------------------------
> 3.3.3 initial sentence states that "all" the "should or may" parameters
> are required for a fully compliant service. I think that is wrong
> and does not match with 1.4.1.
> Only the "should" parameters are required for a fully compliant service,
> isn't it?

Yes; this is stated incorrectly (fully compliant probably included the
optional parameters in an early version but that is now thought to be
too stringent).


> Ranges
> ----------
>
> 1. Why not allowing ranges in all (at least numerical) fields?

While in theory this might be nice and consistent, it may not make
sense for all parameters, and supporting a range list complicates
the interface (e.g., for the range list in BAND we currently have
this as an optional capability).

Supporting a single value where a range is permitted can also be
useful in its own right, e.g., to specify a point within a bandpass,
rather than an explicit bandpass ("give me anything which contains
this value" vs "give me anything where this range intersects the
actual data range").


> 2. Apparently there is no mandatory order when specifying a range;
> in many examples throughout the entire document one can find
> both:
>
> 1E-6/3E-7   (that is, max/min)
>
> or
>
> 1.3/3.0   (that is, min/max)
>
>
> But when an open-ended range comes along (e.g. /5) that implies
> a very specific order:  <= 5 (ie min/max).
>
> Minor point, but if one uses all the time the max/min order,
> s/he could end up getting too used to that, and use /5 to indicate >= 5.

Good point.  This issue needs to be clarified, and has come up before;
we should have gotten it into this version of the document.


> 3.3.3.15 COMPRESS
> -----------------
>
> Unclear: Is that paragraph saying that
> even if a client asks for compression, the server could return an
> uncompressed file?

Yes.  The client just says it is prepared to accept compressed data,
and please use compression if it is worthwhile; whether a given dataset
is compressed is up to the service.

 	- Doug



More information about the dal mailing list