IV ) How are metadata related to data and Data access layer...
bonnarel at alinda.u-strasbg.fr
bonnarel at alinda.u-strasbg.fr
Thu Oct 30 13:34:25 PDT 2008
Follow up of the OBS DM task group discussion
My introduction mail from last sunday
<The team had a mailing list, some teleconfs, partial side-meeting in
Garching a
<nd Trieste.
<Recently we had a very hot discussion on several aspects which I try
to tidy up <there.
<The DM1 session (and part of DM 2) of this interop will show various
<presentation showing
<where we are in this questions.
<
< I ) Models "dissemination" , usage and su
itability with the
<user/developper/dataprovider needs. Are the model too complex ?
<
<
< II ) Formats and vocabularies
< Do we need utypes and ucds ? Is a model transportable in VOTABLE.
<IS JSON alternative or complementary to XML ?
<
< III ) UNits and Coordinate systems
< Do we force dataproviders to use only one or do we allow
various <systems by
<providing acccurate description using STC and a (coming) Unit datamodel...
<
<
< IV ) How are metadata related to data and Data access layer...
<
<I will post now 4 emails with the best part of the discussion on these
<subjects...
-----------------------------------------------------------------------------
*Step 1 FB answers to Fabien
-----------------------------------------------------------------------------
<1.4 Standardize only what can be without ambiguity. The worst thing
that can <happen is that two implementations use the same concept with
slightly different <meanings. This also means that there is need for an
extension mechanism <(already planned by the characterization document)
that data providers can use <to add whatever we couldn't agree on
unambiguously.
<This approach is pragmatic, and will encourage usage of
characterization, even <if some part of the data don't have a standard
way to be described. My hope is <that this approach will create "de
facto" standards by the pioneers users.
<The part of the fields which are standardized could be perceived as a
base <class from which other specialized class can derive (add extra
fields). A <client application relying only on the standardized
elements can then safely <assume that they are correct.
--> i may agree but you should gibe examples of what you have in mind.
<For point 2, I propose the following:<
<2.1 Always think of characterization as part of a larger dataset data
model, or <observation data model.
---> I think everybody agrees . that's why we are working on
Observation container and Provenance data model. You will see a few
examples soon.
<Practically, in the JSON serialization
<of the observation data model, characterization would be just a
subsection. <Doing so would allow to integrate other meta data in the
file smoothly and <finally have a single unified file format for many
different kind of meta-data.
---> OK, agreed, but this is not a specificity of a JSON
serialization it is how the model is designed. And the example I have
in Xml can easily be translated in JSON. Whatever format we take we
have to decide where we put what. Ascendant compatibility with the
current status of standards is very important if we want to progress.
AS I said SSA is already relying on (a small part) of the current
characterization and some people are building SIA2 the same way. We
cannot restart from scratch.
--------------------------------------------------------------------------------
*Step 2 Fabien answers to François
-------------------------------------------------------------------------------
>> 1.4 Standardize only what can be without ambiguity. The worst thing
>> that can happen is that two implementations use the same concept
>> with slightly different meanings. This also means that there is need
>> for an extension mechanism (already planned by the characterization
>> document) that data providers can use to add whatever we couldn't
>> agree on unambiguously.
>> This approach is pragmatic, and will encourage usage of
>> characterization, even if some part of the data don't have a
>> standard way to be described. My hope is that this approach will
>> create "de facto" standards by the pioneers users.
>> The part of the fields which are standardized could be perceived as
>> a base class from which other specialized class can derive (add
>> extra fields). A client application relying only on the standardized
>> elements can then safely assume that they are correct.
> --> i may agree but you should gibe examples of what you have in mind.
I attached a JSON file to this email, which is an example of the
characterization of a (dummy) dataset. You will notice that there is an
ESO section. I put it there as an example of a non standardisable
descriptors. Only ESO people and tools need them.
>> For point 2, I propose the following:
>>
>> 2.1 Always think of characterization as part of a larger dataset
>> data model, or observation data model.
> ---> I think everybody agrees . that's why we are working on
> Observation container and Provenance data model. You will see a few
> examples soon.
Good, In my attached example I also added descriptors coming from what
is called provenance in various IVOA documents.
>> Practically, in the JSON serialization of the observation data
>> model, characterization would be just a subsection. Doing so would
>> allow to integrate other meta data in the file smoothly and finally
>> have a single unified file format for many different kind of
>> meta-data.
> ---> OK, agreed, but this is not a specificity of a JSON
> serialization it is how the model is designed. And the example I have
> in Xml can easily be translated in JSON. Whatever format we take we
> have to decide where we put what.
Yes, as you say what matters is how the model is designed, but so far I
have never seen a single model unifying all the other ones.
Ascendant compatibility with the current status of
> standards is very important if we want to progress. AS I said SSA is
> already relying on (a small part) of the current characterization
> and some people are building SIA2 the same way. We cannot restart
> from scratch.
I don't speak about SIA/SSA. I speak here about a new standard for
serializing general dataset information, which includes
characterization but not only that. I don't think there is such a
defined standard at the moment.
(exemple already given in compilation II
-------------------------------------------------------------------------------
*Step 3 François again on these Observation metadata
--------------------------------------------------------------------------------
Anita, Fabien,
Thank you for the interesting discussion...
Let me 24 more hours and I will send you examples for Observation
container and then for char2 and Provenance on which I was working on
when Fabien sent his first mail.
But they provide ascendant compatibility with IVOA characterization
recompmendation and spectrum/SSA..
The format ( I will give examples in xml and VOTABLE) is another
issue JSON versions of the modelisation I will propose could be built
easilly (maybe somebody will provide one before we start the Baltimore
meeting)
--------------------------------------------------------------------------------
*Step 4 FB sends his Observation container example
--------------------------------------------------------------------------------
You can find here as attachment an Observation metadata document.
It contains
a Curation section similar to what can be found in spectrum
a DataID section similar to what can be found in spectrum again
an Access section very similar to what is in SSA.
a characterization structure which is pure IVOA char
a place where a provenance instance could be hooked
The char section encapsulate an STC footprint....
This example will be rewritten in VOTABLE (and will be very
similar to SIA2
in that case) and may also be rewritten in JSON.... Format is not a
real issue in data modeling activity
THIS EXAMPLE ILLUSTRATES WHAT IS ASCENDANT COMPATIBILITY
It was written for the NVO Footprint service team and allows to hook a
footprint
(in STC X) to a dataid or other observation metadata.....
Another use case may come from the VOSPACE people< They need dataset
metadata to associate the; to stored dataset and make some intelligent
retrieval .....
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Observation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stc="http://www.ivoa.net/xml/STC/stc-v1.30.xsd"
xmlns:cha="http://www.ivoa.net/xml/Characterisation/Characterisation-v1.11.xsd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.ivoa.net/xml/Observation/Observation.xsd" xsi:schemaLocation="http://www.ivoa.net/xml/Observation/Observation.xsd
Observation2.xsd">
<!-- Curation as in Spectrum -->
<Curation>
<Publisher>SAO</Publisher>
<PublisherID>ivo://cfa.harvard.edu</PublisherID>
<Contact>
<Name>Gretchen Greene/Tamas Budavari</Name>
<Email>jcm at cfa.harvard.edu</Email>
</Contact>
</Curation>
<!-- Data ID section -->
<DataID>
<Title>Arp 220 Image</Title>
<Creator>STScI/JHU</Creator>
<DatasetID>ivo://stsci.edu/mast#10314</DatasetID>
<Date>2003-12-31T14:00:02Z</Date>
<Version>1</Version>
<Instrument>BCS</Instrument>
<Logo>http://stsci.edu/nvo/sdsslogo.jpg</Logo>
</DataID>
<!-- Access to the actual data -->
<Access>
<acref>http://sdss.jhu.edu/images/sdss/10314.fits</acref>
<format>application/fits</format>
</Access>
<!-- Characterisation -->
<char>
<cha:characterisationAxis>
<cha:axisName>Sky</cha:axisName>
<cha:ucd>pos.eq</cha:ucd>
<cha:unit>deg</cha:unit>
<cha:coordsystem id="TT-ICRS-TOPO" xlink:type="simple"
xlink:href="ivo://STClib/CoordSys#TT-ICRS-TOPO"/>
<cha:independentAxis>true</cha:independentAxis>
<cha:calibrationStatus>CALIBRATED</cha:calibrationStatus>
<cha:numBins2>
<cha:I1>500</cha:I1>
<cha:I2>500</cha:I2>
</cha:numBins2>
<cha:undersamplingStatus>false</cha:undersamplingStatus>
<cha:regularsamplingStatus>true</cha:regularsamplingStatus>
<cha:coverage>
<cha:location>
<cha:coord coord_system_id="TT-ICRS-TOPO">
<stc:Position2D>
<stc:Name1>RA</stc:Name1>
<stc:Name2>Dec</stc:Name2>
<stc:Value2>
<stc:C1>132.4210</stc:C1>
<stc:C2>12.1232</stc:C2>
</stc:Value2>
</stc:Position2D>
</cha:coord>
</cha:location>
<cha:bounds>
<cha:unit>arcsec</cha:unit>
<cha:Extent>20 </cha:Extent>
<cha:limits coord_system_id="TT-ICRS-TOPO">
<cha:Coord2VecInterval/>
</cha:limits>
</cha:bounds>
<!-- The spatial support is actually the footprint -->
<cha:support>
<cha:coordsystem id="RegionCoordSys">
<stc:SpaceFrame>
<stc:Cart2DRefFrame projection="TAN" ref_frame_id="TT-ICRS-TOPO">
<stc:Transform2 unit="deg">
<stc:C1>1.0</stc:C1>
<stc:C2>1.0</stc:C2>
<stc:PosAngle xsi:nil="true" />
</stc:Transform2>
</stc:Cart2DRefFrame>
<stc:CoordRefPos>
<stc:Position2D>
<stc:Value2>
<stc:C1>132.4210</stc:C1>
<stc:C2>12.1232</stc:C2>
</stc:Value2>
</stc:Position2D>
</stc:CoordRefPos>
<stc:SPHERICAL coord_naxes="2"/>
</stc:SpaceFrame>
</cha:coordsystem>
<cha:Area coord_system_id="RegionCoordSys">
<stc:Polygon coord_system_id="RegionCoordSys" unit="deg">
<stc:Vertex>
<stc:Position>
<stc:C1>0.2</stc:C1>
<stc:C2>-0.1</stc:C2>
</stc:Position>
</stc:Vertex>
<stc:Vertex>
<stc:Position>
<stc:C1>-0.2</stc:C1>
<stc:C2>-0.1</stc:C2>
</stc:Position>
</stc:Vertex>
<stc:Vertex>
<stc:Position>
<stc:C1>-0.2</stc:C1>
<stc:C2>0.1</stc:C2>
</stc:Position>
</stc:Vertex>
<stc:Vertex>
<stc:Position>
<stc:C1>0.2</stc:C1>
<stc:C2>0.1</stc:C2>
</stc:Position>
</stc:Vertex>
</stc:Polygon>
</cha:Area>
<cha:AreaType>Polygon set</cha:AreaType>
</cha:support>
</cha:coverage>
</cha:characterisationAxis>
<cha:characterisationAxis>
<cha:axisName>Time</cha:axisName>
<cha:ucd>time</cha:ucd>
<cha:unit>d</cha:unit>
<cha:coordsystem idref="TT-ICRS-TOPO"/>
<cha:calibrationStatus>CALIBRATED</cha:calibrationStatus>
<cha:numBins1>1</cha:numBins1>
<cha:coverage>
<cha:location>
<cha:coord coord_system_id="TT-ICRS-TOPO">
<stc:Time>
<stc:TimeInstant>
<stc:MJDTime>52148.3252</stc:MJDTime>
</stc:TimeInstant>
</stc:Time>
</cha:coord>
</cha:location>
<cha:bounds>
<cha:Extent>1500.0</cha:Extent>
<cha:limits coord_system_id="TT-ICRS-TOPO">
<cha:Coord2VecInterval></cha:Coord2VecInterval>
</cha:limits>
</cha:bounds>
</cha:coverage>
</cha:characterisationAxis>
<cha:characterisationAxis>
<cha:axisName>spectral</cha:axisName>
<cha:ucd>em.wl</cha:ucd>
<cha:unit>m</cha:unit>
<cha:coordsystem idref="TT-ICRS-TOPO"/>
<cha:calibrationStatus>CALIBRATED</cha:calibrationStatus>
<cha:numBins1>1</cha:numBins1>
<cha:coverage>
<cha:location>
<cha:coord coord_system_id="TT-ICRS-TOPO"></cha:coord>
</cha:location>
<cha:bounds>
<cha:Extent>3000.0</cha:Extent>
<cha:limits coord_system_id="TT-ICRS-TOPO">
<cha:CoordScalarInterval></cha:CoordScalarInterval>
</cha:limits>
</cha:bounds>
</cha:coverage>
</cha:characterisationAxis>
<cha:characterisationAxis>
<cha:axisName>"Flux density"</cha:axisName>
<cha:ucd>"phot.flux.density;em.wavelength"</cha:ucd>
<cha:unit>"erg cm**(-2) s**(-1) Angstrom**(-1)"</cha:unit>
<cha:coordsystem idref="TT-ICRS-TOPO"/>
<cha:accuracy>
<cha:sysError>
<cha:flavor>systematic</cha:flavor>
<cha:ErrorRefVal>
<stc:Error>0.05</stc:Error>
</cha:ErrorRefVal>
</cha:sysError>
</cha:accuracy>
<cha:calibrationStatus>CALIBRATED</cha:calibrationStatus>
<cha:numBins1>1</cha:numBins1>
<cha:coverage>
<cha:location>
<cha:coord coord_system_id="TT-ICRS-TOPO"></cha:coord>
</cha:location>
</cha:coverage>
</cha:characterisationAxis>
<!-- <prov> -->
<!-- ..... -->
<!-- ..... -->
</char>
<!-- <prov> -->
</Observation>
--------------------------------------------------------------------------------
*Step 5 Fabien's answer
-------------------------------------------------------------------------------
Hi Francois,
Just for comparison, I converted your file roughly following the
guidelines that I explained in my previous emails. It is attached there.
Just for testing, take the 2 files and show them to a developer, a
scientist and an archive technician, and ask them which one they would
prefer to work with.
{
"title": "Arp 220 Image",
"creationDate": "2003-12-31T14:00:02Z",
"datasetID": "ivo://stsci.edu/mast#10314",
"logo": "http://stsci.edu/nvo/sdsslogo.jpg",
"creator":
{
"shortName": "STScI/JHU",
},
"publisher":
{
"shortName": "SAO",
"id": "ivo://cfa.harvard.edu",
"contact":
{
"name": "Gretchen Greene/Tamas Budavari",
"email": "jcm at cfa.harvard.edu"
}
},
"instrumentSetup":
{
"facility": "STScI"
"instrument": "BCS",
},
"characterization":
{
// in ICRS (deg)
"spaceAxis":
{
"centralPos": [132.4210, 12.1232],
"footprint":
{
"worldCoords": [[[132.2210, 12.0232], [132.6210, 12.0232],
[132.6210, 12.2232], [132.2210, 12.2232]]]
}
},
// in heliocentric standard of rest (m)
"wavelengthAxis":
{
"centralPos": 1235.5e-9,
"boundingBox": [[1235.5e-9, 4235.5e-9]]
},
// in TT (s)
"timeAxis":
{
"centralPos": 52148.3252,
"boundingBox": [[52898.3252, 54498.3252]]
},
// (erg/cm2/s/Angstrom)
"fluxDensityAxis":
{
// SNR has to be properly defined
"SNR": 3
}
},
// The list of all the data files (including calibration) associated
to this data set
"dataSources":
{
"mainDataSet":
{
"format": "application/fits",
"FITShdu": 1,
"url": "http://sdss.jhu.edu/images/sdss/10314.fits",
"shape": [500, 500],
"typestr": "f",
"calibrationStatus": "calibrated"
}
https://astron.u-strasbg.fr/horde/imp/compose.php?uniq=1rectowu74o0}
}
-------------------------------------------------------------------------------
*Step 6 Anita test results
--------------------------------------------------------------------------------
> Hi Francois,
> Just for comparison, I converted your file roughly following the
> guidelines that I explained in my previous emails. It is attached
> there.
>
> Just for testing, take the 2 files and show them to a developer, a
> scientist and an archive technician, and ask them which one they
> would prefer to work with.
One astronomer's view:
At first glance:
To read in an email: Fabien's
To read in an xml interpreter: Francois'
To write (using an xml tool): Francois, because I know that the grammar
is clearly defined and I can extract and view the structure in
different ways To use ???
The point is that I find Java very difficult, C(++) even worse, whereas
xml is easy to understand even if sometimes tedious.
Moreover, the primary question is the use... no good having something
which looks nice if it does nt do what we want... over to the experts
for testing?
-----------------------------------------------------------------------------
*Step 7 FB's comment on Fabien's exemple
-----------------------------------------------------------------------------
Good. In detail I don't yet understand why you distort the
structure a little bit. But anyway ... we will discuss this in Baltimore
Model and FORMAT are not the same thing.
Contrary to Igor I have nothing against A JSON format for a given datamodel.
But it's definitly to early (at least) to say that this is to replace xml
Why should we not just add JSON as a new serialisation BESIDE the
previous ones ?
(I feel like repeating what I already wrote OK ....)
-------------------------------------------------------------------------------
*Step 8 Fabien to FB
-------------------------------------------------------------------------------
I fully agree, in my example I just use JSON because it allows to
better read the file, but I could as well convert that in xml without
losses.
The main debate is not xml versus JSON, it's not even what the model
itself should be but rather what should be and should not be included
in an *interoperable astronomical exchange file format*.
-------------------------------------------------------------------------------
*Step 9 Gretchen and Tamas (private) to FB
-------------------------------------------------------------------------------
good news francois,
We've been successful working with the observation and characterization
data models containing STC polygons.
the xsd code generation ran error free for C#/.NET (a good
sign)
There is a small problem with the xml sample obs.xml instance you
distributed. I'm sending the modified version that shifts the polygon
from the Area element to the areaType element.
This was the only way the deserialization of the file works with a
populated polygon (i've attached the modified
obs_2.xml)
With this change the programmatic handling works with the jhu spherical
library and we are set up nicely now to perform region intersects, etc.
The region area calculations work after Tamas added in a polygon
regionType (we were previously only demonstrating convex type use).
So all in a couple days work...we've made good strides.
-------------------------------------------------------------------------------
*Step 10 An-ita questions FB
-------------------------------------------------------------------------------
How do you see the Observation model being used?
One thing I was thinking of, is that in general, data published to the
VO should have had the instrumental signature removed; i.e.,
Observation should be for reference only... but of course, sometimes
you need to recalibrate etc. Hence you need to find out what was the
state of the instrument when the data were taken.
I recall that Andreas and Alberto have explained that in the context of
the ESO archive, and for e.g. the VLA you would want to know what
configuration the array was in; for ALMA you might want the water
vapour radiometry records for that day... I think that this is getting
too much to model, especially as you will almost certainly be using a
specialised dedicated pacakge to handle the information.
So I propose that we add somewhere a field for a link or links to
ObservationalConditions (or a better name), probably under DataID
Or is this dealt with somewhere else?
--------------------------------------------------------------------------------
*Step 11 FB answers to Anita
-------------------------------------------------------------------------------
The basic usage for Observation is linking Charac, DataID, Curation and
access to real ID.
For the provenance part my answer will be an example which I am trying
to complete before leaving .... But basically there is a minimum we can
do and after I try to hook to provider specific metadata and
documentation
-------------------------------------------------------------------------------
*Step 12 Igor's two cents
-------------------------------------------------------------------------------
First of all -- Francois, thanks for this example.
> How do you see the Observation model being used?
>
> One thing I was thinking of, is that in general, data published to
> the VO should have had the instrumental signature removed; i.e.,
> Observation should
This is what everybody's talking about, but this is, unfortunatly, an
idealisation. It cannot be done for the real datasets, only for
simulated ones. One can't, say, "remove the instrumental effects" from
direct images by increasing the spatial resolution to the
Delta-function PSF, converting the filter transparency curve into the
reference one etc. The same applies to spectroscopic data and to
anything else, meaning that "removal of the instrumental signature" is
simply unachievable. And I would add that it's absolutely unnecessary.
Therefore, the only solution is to give the thorough description of all
the instrumental effects in sufficient details to make science with the
data. This description than can be applied to the models which are used
to interpret the observations. As far as I know, this is very common in
X-ray and Gamma-ray observations that one has to apply the response
function to the model and not to ``remove'' it from the data.
> be for reference only... but of course, sometimes you need to
> recalibrate etc. Hence you need to find out what was the state of
> the instrument when the data were taken.
Therefore, I don't think that your conclusion about "the reference
only" is correct. The "observation" metadata is absolutely required for
the data analysis.
> I recall that Andreas and Alberto have explained that in the context
> of the ESO archive, and for e.g. the VLA you would want to know what
> configuration the array was in; for ALMA you might want the water
> vapour radiometry records for that day... I think that this is
> getting too much to model, especially as you will almost certainly be
> using a specialised dedicated pacakge to handle the information.
Most of the things you're mentioning here belong to the "provenance".
However, there are other things which one should be able to learn from
it. For example, what was the proposal (link to its abstract, perhaps)
and who was the PI, what instrument was used, how the data were reduced
etc. These things go to another components than Char or Prov.
------------------------------------------------------------------------------
Follow up of what started there between Igor and Anita will be under
topic "Provenance"
-------------------------------------------------------------------------------
*Step 13. where the DAL enters this discussion. This mail from Doug
comes from another private discussion icluding Jesus, Doug and me
------------------------------------------------------------------------------
<<I am still not convinced we even need an Observation DM, or what its
<<scope would be. If we take all the "component data models" we have
<<now (Char is one), and add a few more, perhaps we already have this
<<Observation DM.
< First I disagree, but second we may agree ;-) !!!!!
Observation DM is <the concatanation of all standardized metadata for
<a dataset of any type. It is made "a la spectrum/SSA" with reuse of dataset
<dataID,curation, Access , Characterization packages...
< What is currently under development in the DM group is the additional
<Provenance datamodel.<
In order that what is needed is
a ) an Observation container very similar to Spectrum datamodel
(apart from the data section of cours which may be replaced by a simple
Access block pointing to the actual dataset, and to the optional presence
of a Provenance package)
b ) develop a first simple Provenance package.
Some draft will be shown in Baltimore.
So we may agree that little really specific has to be developped
apart from Provenance and that a lot of existing stuff is to be
reused.
This sounds an awful lot like what we have been calling the generic
Dataset model. Are Observation and Dataset the same thing? That is,
something much like SSA, except that we omit the Data section. As
you say a Provenance model needs to be added. There is a pointer to
the archival dataset. All of this sounds exactly like what you describe
above.
So I suspect they are the same concept, if just change the name
"Observation" to "Dataset".
----------------------------------------------------------------------------
*Step 14 FB comment on Doug's point
----------------------------------------------------------------------------
Definetly it is the same concept
What DAL architecture Note (published last Monday by Doug) call generic
dataset ( a 3 year old discussion in DAL) will be filled by the
observation DM concepts
Actaually geberic dataset query response will be a generalization of SSA
query response in the same way the observation container example is a
generalization of an IVOA spectrum serialization....
--------------------------------------------------------------------------
*Step 15 Dave Morris private comment on the example
----------------------------------------------------------------------------
Thanks for the example I will study it.
Unfortunately I'm not at the Baltimore conference, so I won't be able
to join in the discussions.
At the moment VOSpace does not have any specific requirements.
I am still just exploring what plans and ideas people have for
handling characterization
and provenance.
It is more that the VOSpace team need to be aware of what plans you
have so that we can
design the next generation of VOSpace to be able to support it.
One day it would be good for VOSpace to be able to tell the user where
a data object came
from and what processing was done to it, but I don't think we will have
that for some
time yet.
I have no plans to use a JSON characterization schema in VOSpace.
I was interested to hear what Fabien was trying to do with the data in
the ESO archives.
More information about the dm
mailing list