[ObsCoreRFC]Minutes of the telco Monday June 6

Douglas Tody dtody at NRAO.EDU
Wed Jul 6 08:54:58 PDT 2011


On Wed, 6 Jul 2011, Arnold Rots wrote:

> I think I am beginning to realize what it is that makes me so
> uncomfortable with ObsTAP and what makes it so hard to grasp the
> correct way to implement it: its ambivalence.
>
> It is primarily intended (I think) as a data discovery interface.
> The problem is that it also doubles as a data access tool.
> I think it is the intertwining of these two functions that makes it murky.
> And I wish these two functions had been separated into separate intefaces.
> I know this is not an issue for some observatories (say, the ones that
> only produce simple 2-D images), but it makes life difficult for more
> complicated datasets.
>
> As a data discovery tool, I would have expected its purpose to be:
> - find available observations that fall within certain constraints in
>  time, space, frequency, etc.
> - tell me what kind of data products are available for each
>
> For a data access tool:
> - Give me the URL to a specific (set of) type(s) of data product for a
>  specific (set of) observation(s)
> For all I know, this role could be played by SIAP. SSAP, SCS, or
> whatever protocols are already in existence.

ObsTAP is intended mainly to provide uniform global data discovery; it
can find any type of data, even non-VO data formats.  The data access
capabilities provided at this level are very limited, but can be used to
retrieve static archive data files (the data product could actually be
generated on the fly if desired, but the description at least is
static).

As you suggest, the idea is that for any non-trivial data access the
typed interfaces would be used (SIA, SSA, etc.).  So for example one
could do global data discovery using ObsTAP and then followup with one
of the typed interfaces to get more complete object-specific metadata
and do the actual data access, which for a typed/OO interface will often
involve virtual data generation (subsetting, filtering, transforming,
output format specification, etc.).  Of course if just retrieving the
static archive file is enough then that can be done with just the acref
returned by ObsTAP.

> The trouble is that for Chandra data, the intertwining of the two
> functions requires us to duplicate each ObsCore record six times to
> enumerate, laboriously, the different data types we can provide.
> When it comes to proper data discovery, it makes much more sense to
> return a single record with the ObsCore parameters and a list of
> available data product types (event lists, images, light curves,
> spectra, tarfiles with all of the above, etc.).

True, but this is necessary to be consistent with the relational model
and to provide a simple mechanism.  For a Chandra observation one might
return a set of records with the same obs_id, one being a tar.gz of the
full instrumental dataset, the others being static images, spectra, etc.
derived from that data.  A query for a specific obs_id would thus
describe all the data products available for the observation.  As you
note it is necessary to duplicate some of the metadata in associated
records, but much of the metadata will differ for each data product as
well.

So far as the archive goes one would probably want to autogenerate the
ObsTAP table from more fundamental, fully normalized database tables.
Any updates would be done only on the underlying tables (auto-updating
the ObsTAP "view" after each such update).  Then there should be no
problem with the redundant metadata in the ObsTAP index table becoming
inconsistent or whatever.

In addition to a few static images or spectra providing standard views
of an observation one would ideally provide SIA, SSA, etc.  services
capable of accessing the event data and computing custom virtual data
products on the fly.  In the future the proposed data linking facilities
would be able point directly to such services.  At present one would
have to do a registry query to find the service and then use the
publisher DID from the ObsTAP query to access the desired dataset.

> Btw, Use Case 1.6 misquotes MJD as Mean Julian Date. Should be
> Modified Julian Day.
>
> I hope you don't mind these ruminations, but these are things that I
> am discovering as we are trying to implement this - and it is hard.

Not at all; it is useful to have these discussions in the record for
others later as well.

 	- Doug


> Cheers,
>
>  - Arnold
>
>
> Douglas Tody wrote:
>> On Tue, 5 Jul 2011, Arnold Rots wrote:
>>
>>>> First, the subtype may be used to define what the data object is in
>>>> collection or archive specific terms.  For example if the data object is
>>>> a tar file containing all the files comprising a ROSAT observation the
>>>> data provider can define a subtype for this type of data.  It is up to
>>>> the client to understand what the content of the proprietary data
>>>> product is, but if they are able to deal with such instrument-specific
>>>> data they probably do know what it is.
>>>
>>> This is precisely the case I was trying to solve: a tarfile containing
>>> a mix of data types: images, spectra, event lists.
>>> The way I would like to solve it is to allow "package" (or something
>>> similar) for the data type and enumerate the data files contained in
>>> the tarfile in the data subtype.
>>>
>>> It still leaves a similar issue for the access format: that would be
>>> tar, but it would be nice to be able to enumerate the formats of the
>>> files in the tarfile in a similar format subtype - that also would
>>> allow one to indicate whether or not the content of the the tarfile is
>>> gzipped (as opposed to gzipping the tarfile itself).
>>>
>>> I realize that this constitutes a use of subtypes that is different
>>> from the original intent (at least, I think so), but it does seem a
>>> useful mechanism.
>>
>> Arnold - I agree that in principle it would be useful to have this extra
>> information.  However we had to argue for quite a while to get support
>> for instrumental data at this level included at all.  One *can* expose
>> this data with ObsTAP 1.0 as outlined in my earlier email; in particular
>> exposing the individual data products separately allows them to be
>> described if the data provider wants to do so.  Even exposing only the
>> tar/zip/MEF etc.  file works so long as the client recognizes the
>> subtype.
>>
>> To attempt to the describe the contents of arbitrary complex
>> instrumental datasets is out of scope for ObsTAP, at least 1.0.  Perhaps
>> we can address this issue in the next phase of development where we
>> prototype related mechanisms such as data linking.
>>
>>> However, there is also the reverse problem: what do we do with data
>>> products based on multiple observations? Do we allow ObsId to be a
>>> list of ObsIds?
>>
>> This was addressed in the document as I recall.  In the case of complex
>> data products which are derived from multiple inputs (e.g.  multiple
>> observations) which essentially have a new "software observation", and a
>> new obs_id should be assigned.  To say more about the derivation of a
>> particular data product is complex and gets into the general issue of
>> provenance which is being addressed separately.  Furthermore obs_id is a
>> database key used to uniquely identify specific "observations" (usable
>> as a foreign key in other tables for example) hence we cannot turn it
>> into a list of obs_ids.
>>
>>  	- Doug
>>
> --------------------------------------------------------------------------
> Arnold H. Rots                                Chandra X-ray Science Center
> Smithsonian Astrophysical Observatory                tel:  +1 617 496 7701
> 60 Garden Street, MS 67                              fax:  +1 617 495 7356
> Cambridge, MA 02138                             arots at head.cfa.harvard.edu
> USA                                     http://hea-www.harvard.edu/~arots/
> --------------------------------------------------------------------------
>


More information about the dm mailing list