Spectrum data model
Doug Tody
dtody at nrao.edu
Wed Sep 13 22:41:27 PDT 2006
Hi All -
After reading Anita's careful review of Spectrum (thanks Anita!) and
Jonathan's thoughtful replies I think the issues below are the most
important, so some further elaboration follows.
- Doug
Required/optional vs must/should/may
The advantage of must/should/may is that it allows us to differentiate
between "minimal compliance" (all the "must"s) and "full compliance"
(all the "should"s). This is useful as we want minimal compliance
to be as low a bar as is reasonable, but we would really prefer that
most services implement at least the "should"s. To reward service
implementors for doing more we would do something like flag fully
compliant services in the registry. Hence I tend to agree that it is
useful to make the must/should/may distinction.
In general what is required or optional depends upon how a general
data model is used - it might be different in different circumstances.
For Spectrum the priorities are probably pretty clear, but for
something more general like Char it will really depend upon the
application (hence it is not clear how much this should be specified
at the level of the Char spec).
Coordinate systems other than just ra/dec
For the 2nd generation DAL interfaces it is probably too restrictive
to limit ourselves to only ICRS/J2000, as for SIA. For example, we
already have folks trying to use DAL for solar data. A reasonable
compromise is to default coordinates to ICRS as in SIA, but provide a
means to optionally specify a different coordinate system; whether or
not other coordinate systems are supported would be a service-specific
capability.
The above refers mainly to the query interface and standard
parameters. To describe the actual data we probably want to
permit the native coordinate systems of the data to be used.
This is already done in SIA 1.0, where the WCS information allows
the coordinate system to be specified rather than requiring that a
new WCS be computed to publish the data.
Should Coverage.Location (or whatever) be a MUST
I agree with Jonathan that fundamental metadata such as this is a
"must". Anita is correct that it may not be appropriate for all
data, e.g., theory data, but we should at least require it where it
is appropriate for the data. Rather than define what "appropriate"
means it might be better to define values such as "not applicable"
or "undefined", and still require such a value to be specified even
for data where the value is not applicable. This would allow more
rigorous queries to be performed. The problem is, this may not be
possible for numeric values other than in a text-based serialization.
(I saw something like this elsewhere recently, possibly in VOEvent).
Mediation to a standard data model vs pass-through of native data
This is an essential feature of SSA. There is no standard
astronomical format for spectra, and at the scale of the VO, where
a client application may access spectra from dozens of archives,
it becomes impractical for each client application to know how to
deal with spectral data from dozens of different projects (sure,
a few applications do this now for a few archives, but that is not
good enough, and such a scheme will break whenever anything changes).
What we want to make possible is for each SSA service to return data
conforming to the SSA data models (Spectrum in this case), so that
the mediation occurs once in the service rather than hundreds of
times in remote applications. A pass-through for "native" format
data is also important, in part for on-the-cheap services that can't
perform the data conversion, or more importantly, to obtain direct
access to the native data so that clients with intimate knowledge
of a specific data collection can take advantage of project-specific
features of the data. Both approaches are important.
Target.Name vs dataset IDs, collection, etc.
Target.Name is just the name of the observed object (if any), such
as one might pass to a name resolver. (Title is the more important
version of this since it always applies and is broader).
Collection is the data collection (ShortName) e.g., "SDSS-DR4"
or whatever. DataID.CreatorDID is the dataset ID (URI) assigned
to the dataset (spectrum) by its creator, e.g., the survey project
or observatory which created the data collection. The CreatorDID
does not change if the data is replicated. Curation.PublisherDID is
the dataset ID assigned by the publisher, and will be different for
each publisher.
It is possible that the published dataset returned by the service may
differ significantly from the "parent" (Creator's) dataset, e.g., in
the case of virtual or derived data. This can be indicated with the
CreationType attribute. For example, if we extract a spectrum from a
data cube, CreatorID identifies the cube, PublisherID the extracted
spectrum, and CreationType is something like "extracted spectrum".
This is a primitive form of provenance model. If a completely new
collection is formed by analysis then a new Creator resource is
required to describe it.
More information about the dm
mailing list