[ObsCoreRFC]Minutes of the telco Monday June 6

Mon Jul 4 13:53:12 PDT 2011

On Thu, 9 Jun 2011, Arnold Rots wrote:

> 3. dataproduct_type, dataproduct_subtype, access_format
> I still think the scheme that is proposed is incomplete since it is
> ill-suited (as currently defined) to accommodate datasets (i.e.,
> collections of files).
> I would like to suggest that it would be good to add a
> dataproduct_type "package" (or some such thing) that indicates that
> the client will be receiving not just a single file. However, the
> client will still want to know what is in the package, so maybe the
> subtype should contain a list of the science file data types?
> In access format we are running into a somewhat similar problem:
> it's nice (and necessary) to know that a tar file is coming, but it is
> equally important to know what kinds of formats are hidden inside that
> tar file: if it is, say, Cobol code, I am not interested. Should it be
> a comma separated list? Or something like "tar(fits,pdf,txt)"?

Complex datasets are handled by the scheme.  It is true that we don't
really have a way to define what is inside a tar, zip, FITS MEF,
directory, etc.; that would be quite complex to attempt.  However
support for this use case is provided in two ways.

First, the subtype may be used to define what the data object is in
collection or archive specific terms.  For example if the data object is
a tar file containing all the files comprising a ROSAT observation the
data provider can define a subtype for this type of data.  It is up to
the client to understand what the content of the proprietary data
product is, but if they are able to deal with such instrument-specific
data they probably do know what it is.

Second, it is possible to expose the individual files comprising the
complex dataset.  Then all the metadata can be specified separately
for each data product allowing a full description.  All data products
would share the same obs_id hence they are still associated as a complex
dataset.

Which approach is better probably depends upon how one expects the data
to be used.  If the client will almost always want to get all the data
elements at once (e.g. for custom reprocessing or analysis of
instrument-specific data) then the first approach is probably
preferable.  If they are more likely to want only a higher level derived
data product such as an image or spectrum, the second approach might be
preferred.  Combinations of the two approaches are also possible since
obs_id can link multiple associated data products of any type.

On Thu, 9 Jun 2011, Arnold Rots wrote:
> Are you saying that it is unwise to include optional columns in a
> query, because it may cause them to error out?
> Then why do we bother with optional items?
> It seems to me that their use is discouraged. By not specifying how
> servers should handle them we render them useless, don't we?

Not at all.  The optional columns are ignored by a generic query without
error but are still useful to more fully describe the data to the client
or user.  Also, it is possible in a subsequent query to the specific
service providing this extra metadata to reference the custom elements,
and still have a well-formed query.  In this way the general mechanism
can be used to pose more precise archive-specific queries, but the
ability to pose generic queries to a number of services has not been
compromised.

 	- Doug