I ) Models "dissemination"
bonnarel at alinda.u-strasbg.fr
bonnarel at alinda.u-strasbg.fr
Sun Oct 26 16:28:58 PDT 2008
*Step1* I start by the summary I (F Bonnarel) gave to the team last August
-----------------------------------------------------------------------
About DISSEMINATION of characterization data model which we
discussed in Garching and Trieste during last spring
A great part of what we have discussed was dealing with dissemination
or how to make IVOA partners (users as well as data providers or
application developers) make use of characterization.
Concerns fall in different categories:
- a ) As a provider how do I decide which values I have to put in which
characterization DM attributes?
- b ) Is characterization able to deal with my favorite data? Eg : 3D
spectroscopy, long slit spectra ,visbility data, polarimetry data,
X-ray or gamma data, etc....
- c ) When I know that, how do I publish that on the VO?
- d ) How do my users can use these characterization data for different
use cases ? Are there tools for that ?
For a), it is possible to use Alberto's Micol tutorial document (about
2D Images). I also wrote some slides for the last Euro-VO DCA workshop,
based on ACS images examples. Igor wrote a couple of examples based on
ASPID-SR data ore more recently Giraffe... These are all pragmatic "by
example" approaches
On the other side Fabien would like all the char stuff to be firmly
based on theoretical grounds. I will discuss his proposal elsewhere in
relationships to characterization Level 4 stuff.
For b ), we know from Igor's work that Char is perfectly able to manage
3D spectroscopy data.
Discussions are going on with Catherine Boisson for gamma-ray data.
Anita studied the "complex visibility" data case as well as the
polarimetric data. I think she has shown how we can do it, but we have
to build a full example (This will be done this week, she sent me
example files).
. Igor has also shown that it is possible to give some descriptions of
snapshots from simulations in char terms (see also elsewhere how the
Char concepts applies to simulated data).
As for c), there is now a lot of technics which have been explored to
format and serialize characterisation data and publish them.
For utype/based (VOTABLE formats) we can tight the level 1+2 attributes
to SIA1 query responses (like in SaadaDB, or VOTABLE output of Aladin
image server). A new version of XMM images with characterization using
SaadaDB will be released after Baltimore. SSA does it naturally by its
characterization utypes. The hope is to get this in the same way for
images and cubes in SIA2 (one of the Top priority for pushing this
protocol)
DM mapper and MEX also allow to match utypes with outputs of the VO services.
For both XML and utypes outputs, CDS developped Camea to build and
check interactivly valid characterization documents, and more recently
a MappingGenarator mapper, allowing to map FITS keywords and their
combination into char files ....
d ) is probably the more critical part, because no VO scientific
application fully makes use of characterization.
In utype mode, Aladin only displays meta information with their utypes
but cant make filtering on Field according to utypes values as it does
for UCDs. TOPCAT can plot VOTABLE query response records against char
fields, but utypes are not directly visible on the plot Axes (this was
discussed with Mark Taylor during Cambridge AIDA meeting).
SaadaQL and Igor's ASPID-SR interface as well as Alberto's
implementation of ISO images log allow queries based on utypes
constraints all using SQL oriented queries as will probably do
extensions of ADQL in a near future. But none of these usages is fully
standard at the moment.
Direct use of the xml structure has been implemented in ASPID-SR search
engine, and is extensively tested at CDS in relationship with Workflow
developments. A demo has been made during the recent AIDA meeting in
Cambridge and will be improved at the Baltimore meeting in the
APPLICATION session!!!
Experiments with E.Auden for Observation/VOEvent matching have been
attempted... We are building a science case with Catherine Boisson
But we need a fully working science case along these lines to be fully
understood by the community (this is what Alberto said since already
some time, and this is also a conclusion of the last char exercice in
the EURO VO DCA tutorial in Garching).
Documents and links
Char Implementations (including links to ASPID-SR, Aladin server and
CAMEA):
http://www.ivoa.net/Documents/latest/ImplementationCharacterisation.html
Char Utype list:
http://www.ivoa.net/Documents/latest/UtypeListCharacterisationDM.html
CAMEA (Characterization Editing Tool):
http://wiki.eurovotech.org/twiki/bin/view/VOTech/CharacEditorTool
Albertos Users guide/tutorial:
http://www.ivoa.net/internal/IVOA/IvoaDataModel/ivoa_char_2d_image_tutorial_1.0.pdf
Bulgarian DCA Info day Januray 2008 :
http://cds.u-strasbg.fr/twikiDCA/pub/EuroVODCA/Sofia_workshop_jan08/UCDandCharUtypes.pdf
EuroVO DCA workshop June 2008 material :
http://cds.u-strasbg.fr/twikiDCA/pub/EuroVODCA/DcaJune2008CDSMetadata/CharMetadataAndUtypes.pdf
Workflow validation using characterization :
http://wiki.eurovotech.org/twiki/pub/VOTech/DS6PlanningStage08/CDSDS6S7.pdf
-----------------------------------------------------------------------------------------------
*Step2* Discussion is launched by Fabien
--------------------------------------------------------------------------------
I would like to take advantage of the discussion initiated by Francois
to add a few points, and hope that you will comment on them. I will
also make a presentation about that during the Baltimore meeting.
These are general comments about dissemination which are another way of
expressing the concerns from Francois document.
It occurred to me that the main reasons only few people use
characterization now is that:
1- it is too complicated. Complicated for reading by humans, for
implementing by programmers, and for data providers to understand (STC,
units, ucd/utype).
2- it is not integrated with the other data models and protocols from VO
For point 1, I propose the following solutions:
1.1 For increasing readability allow a serialization in the JSON format.
JSON is very easy to parse, and also to be read by humans. Note that it
is possible to convert JSON to xml if needed. See json.org for more
info.
1.2 Stop using ucd and utypes. What is needed is a unique,
straightforward identifier. UCD and uTypes are derived from complex
theoretical consideration making them difficult to parse and understand.
In practice, experience shows that software developers use both of them
only as a static string identifier to identify a field (i.e. softwares
don't try to make use of the hierachy of classes). For this purpose,
the JSON variable name is just enough, and all it needs is to be clear
and self-explanatory, e.g. 'instrument' instead of 'meta.id;instr' or
'ssa:DataID.Instrument'. 'centralPosition' instead of
'pos.eq.ra;meta.main' and 'pos.eq.dec;meta.main' etc..
1.3 Stop using STC. I am not saying that STC is bad or useless, but I
believe it is not needed for characterization. The simple way to avoid
using it is to fix in the standard which reference frame should be used
for each characterization parameter. The same can be done for units. If
a fixed unit is defined in the standard, there is no need to even
specify it in the serialization (it is implicit). For example I would
suggest to define sky directions in ICRS in degree.
The usual reaction to this idea is that people who own data recorded in
another native reference frame and unit feel discriminated. But it has
to be clear that we speak here about the characterization meta-data
only. It doesn't mean that the real data need to be converted from its
native unit (for this STC would be perfect), but only that descriptors
such as the central position should be given in a common reference
frame. Having a single unit and reference frame allow software
developers to concentrate on only one instead of being forced to parse
and convert all of them. Furthermore, if a data provider owning
observations in a non standard reference frame (such as planetary
observations) is unable to convert its own data to a standard one, he
can not expect all clients VO software to do that for him!
1.4 Standardize only what can be without ambiguity. The worst thing
that can happen is that two implementations use the same concept with
slightly different meanings. This also means that there is need for an
extension mechanism (already planned by the characterization document)
that data providers can use to add whatever we couldn't agree on
unambiguously.
This approach is pragmatic, and will encourage usage of
characterization, even if some part of the data don't have a standard
way to be described. My hope is that this approach will create "de
facto" standards by the pioneers users.
The part of the fields which are standardized could be perceived as a
base class from which other specialized class can derive (add extra
fields). A client application relying only on the standardized elements
can then safely assume that they are correct.
--------------------------------------------------------------------------------
*Step3 FB answers
--------------------------------------------------------------------------------
----> It is not that complicated : you can have very small serialisation
----> if you take only some options. For example is VOTABLE
serialisation of char in SSA Complicated?
--------------------------------------------------------------------------------
*Step4 Anita answer
--------------------------------------------------------------------------------
I agree with Fabien that Char may seem to complicated but I think that
a lot of the problem is where we are trying to be too specific. We
also have to think about who will use it and what knowledge they have
already.
Most astronomer-archivists are not at all fond of Java or of new
languages. On the other hand, most such people like the idea of UCDs
although there is cofusion over 1 v. 1+ (how to convert automatically?)
and I still don't understand utypes - but maybe I don;t need to.
VOTable is generally well-liked. Hence I am against adopting yet
another language...
--------------------------------------------------------------------------------
*Step5 Peter Skoda's Feedback on using Spectrum and SSA
--------------------------------------------------------------------------------
Hi all!
I did not want to enter the interesting discussion as I do not know too
much the details of the JSON, but he example in the attachment did not
impress me too much in comparison to XML ;-)
I have just successfully tested the specta cutout service based on
Pleinpot, which is a main engine of ELODIE and GIRAFFE SSA services and
now I am staying in front of decision how to deliver the output
metadata - the system is still using only several VOX keywords for
position,axes etc ... I need the client to use the information but even
the simple BAND is a problem for Specview - SPLAT can handle it but it
does not support TIME range for selection... So I am even afraid what
the server should provide to allow the client basic functionality.
So far we are able to select spectrum by position and at some clients
by BAND in all available SSA services. My server now can do the cutout.
That's all
we are lacking in implementation (in clients) of even the basic SSAP
requirements.
I do not know what the characterization can be used for CURRENTLY.
Even to get information about the wavelength range of given spectrum is
not always possible.
In addition we have a number of examples from real science where even
the current characterization cannot give enough information about the
observation and reduction process involved (e.g. how differential
refraction, seeing and slit width influences the spectrum of close
double star pair, how the merging of echelle orders were done -and how
it is rebinned etc ...)
So I am on both sides - as a astronomer I know the tricks and have
difficulty to describe what I have done with data (I think the log from
reduction pipeline is essential to access as part of charac, but most
spectra have been still reduced manually in the intuitive manner (even
after pipeline processing). As a archive creator I am not able to use
the information I put as a astronomer inside in a sensible manner in
clients..
Have someone seen the VO system that would take the HJD of spectra and
put it in the plotting program as a parametr describing the stacked
spectra (i.e. with the small vertical offset between each other)?
That is the reality and in additon there are only very few useful
spectra available in current SSA services - so you will not restrict
the search too much to get anything ....
So this is frustrating part of the issue and I agree with Fabien's
original radical view !
--------
On the other hand, in the future, when the VO will start to bring the
benefit in everyday spectroscopy analysis (it is now rather a toy for
engineers) and the VO client will become and indispensable tool, the
people will start to think about the characterization anyway .
So we should be prepared and have some solution - but do not require
today to implement it for every server in a rigorous way!
The STC and even the Paper III keywords are nice but he real spectra
you can see are reduced mostly in IRAF and use either IRAF wcs keywords
like WAT_ or are rebinned into CRVAL1 and CDELT1 constants.
So even here you have the discrepancy between reality and wishes.
> If the VO would work, astronomer would not have to see how it works.
> They would just use the tools.
Absolutely ! - the tools should use the charac information to do
something useful (e.g. take HJD to sort or stack the spectra)
--------------------------------------------------------------------------------
*Step 6 Igor's 2 cents
--------------------------------------------------------------------------------
Since the discussion is getting hot, I'll add my 2 cents...
Not about characterisation/stc, but about DM in general. As I
understand, we would like to reach the general astronomical public and
to promote data models among them (Characterisation, STC, later
Observation).
No need to repeat again that the success of VO will be its entire
transparency for the end-user.
Regarding this, I would say that presently, THE ONLY SUCCESSFUL ATTEMPT
of dealing with the data model, easily understandable by an astronomer
(observer) who is completely unaware (or who doesn't care at all) of
the VO technologies and standards, is VOSpec with its on-the-fly unit
conversion (thanks to
SSAP+Spectrum DM). Here it is clear: the resource and data are
standard-compliant making very easy to take MIR spectrum expressed in
um:mJy, optical one in A:ergs/cm^2/s/A and an X-ray in
kev:photons/cm^2/s and plot them together on the same graph. That's a
real use-case, where you can tell the astronomer: "Look, it works,
because it's DM compliant, and if it was not, you'd have to write yet
another buggy fortran code to convert the units."
A counter-example is SIAP. Since there is no data model for the images,
at present it is NOT POSSIBLE to automatically mosaic images coming
from different telescopes/surveys, although normally it should be.
All the others: sophisticated query interface for ASPID-SR, metadata in
Aladin, CAMEA, and so on and so forth is very far away from the real
astronomer's life. We have to change the strategy. If we want to reach
the community, the best way to do it is to publish a scientific paper
(and put it on astro-ph) identifying some *real applications*
explaining (1) how to do the real things; (2) what is behind these
things, i.e. DM; (3) why it would be impossible without it. I'm
repeating this for the Nth time: I'll be ready to lead the effort of
writing this paper and bringing it to the publication-ready state, but
we need to define what exactly we want to put in it and who will
contribute.
--------------------------------------------------------------------------------
*Step 7 Alberto with his new ESO hat !
--------------------------------------------------------------------------------
Fabien is challenging the current poor status of the VO, whereby even
the most simple things are not working as they should. Try to find
three SSAP services covering different spectral regimes (like X-ray,
optical and infrared for example) and prove to me that they work
smoothly, and I'm not saying that they have to be "fully compliant" to
the latest SSAP version, just 3 "workable" SSAPs will do...
Fabien went through this -he has hands-on experience- and he is very right:
it is a very hard and discouraging exercise.
And the same criticism is valid for many other vo standards.
He is hence trying to pragmatically indicate simple solutions to what
should be simple problems.
Specifically regarding CharDM...
I'd like to emphasize the difference between the Data Discovery aspects
on one side, and the Analysis ones on the other.
This version of CharDM was supposed to address mainly Data Discovery.
Calibration and other similar aspects are not addressed by CharDM,
while they should in Provenance for example.
The discovery aspect is by far the easiest. CharDM should permit very
straightforward queries and return quick and intelligible answers that
a *simple* software can easily make sense of, to provide effectively
and quickly a selection mechanism to the astronomers.
Only later, after retrieval, they will want to analyse the data in detail.
Did we succeed in that? Only partially I'm afraid. While we do have a
model (it took years of compromises to get it published), I share
Fabien's opinion that is difficult to read, and difficult to quickly
implement it by the data providers (when computing or anyway extracting
metadata from their data collections). And we failed, because after
more than a year we, ourselves, the authors of such document, have not
been able to come up with a simple tool that makes a sensible,
remarkable, and enlightening and interoperable use of the CharDM (and
it is EASY to do it, once we get a couple of data collections
"characterised").
-------------------------------------------------------------------------------
*Step 8 my personal (FB) comment on the way the discussion is going on
--------------------------------------------------------------------------------
About the discussion so far:
- In my first posting : char dissemination I tried a summary of
the Difficulties whe had. I identified the lack of a full science
application Using char as one of the difficulties we have convincing
people using it.
- The other basic difficulty is that even if char is present in
SSA, we lack it in SIA, because SIA2 is not there. I hope (and work
with Doug and
Jesus) that we wil launch the last phase of this from Baltimore ahead.
- This Observation DM work is a collective work an is about
building Standard step by step, under the IVOA rules.
There are two errors to avoid:
- say that everything has been done already and
that "you just have to apply standard so and so"
- say that nothing good has been done yet and that
we have to restart from scratch. (We could do that each time there is a
difficulty and in that case we will have to do that all the time)
The good approach is to identify the actual problems and
try to solve them to show something new.
4 examples of the new Observation DM concept will be shown in Baltimore
----------------------------------------------------------------------------
*Step 9 Gretchen's comments
-------------------------------------------------------------------------------
Maybe i'm not reading all the details because these messages get a
little lengthy yet i'm finding a gap in the communication and a view
that reads very black and white.
What I mean is the VO infrastructure is being described as data models
and then clients with nothing between.
The question i have: what is the objective for the data models? If the
data models are to capture and describe the data accurately with
preservation in mind then the data models require completeness. If on
the other hand they are being viewed simply as a transport mechanism,
then that is not the case and read no further.
My own understanding is that it is the former, that the data models
are accurate and provide a structure for the native representations
capturing data integrity.
The VOTable data model is however in my view more a transport mechanism
that provides a simple framework for higher level abstraction and
generalization. The specific data models which characterize region,
time, spectral distribution, etc.
need to account for the complexity and heterogeneity or information is
lost. I don't see how this can not be obvious.
If we are ONLY providing clients with higher level views which omit the
scientific content of the data by forcing it into a convenient format
for software, then we are defeating VO and doing what I often hear
scientist fearing from the VO....changing the integrity and quality of
the data.
So is the VO a set of nifty client tools or a framework that provides
real science to evolve in a modern network and grid computing
environment?
I challenge us to build the data models to be complete and accurate,
to build the tools to work with them, and to continue to provide
applications which make it possible for scientists to do the work they
do now in a richer framework.
----------------------------------------------------------------------------
*Step 10 The DM chair advice.
----------------------------------------------------------------------------
sorry for those who suffered mutely this long list of emails . Still I
use this contact list as I don't know who among you is interested or
not .
I just want to summarise the differrent points appeared in this
discussion and recap on the strategy we have adopted in the DM WG.
Data Models requirements :
as mentionned by many of you, DM was in charge to describe all the
metadata available to interprete astronomical data, mainly observations
, but possibly simulated observations , and cover as many aspects as
possible in order to :
- propagate metadata , in relationship with protocols definitions
inside the DAL WG
- describe the information content in order to organise and realise
data analysis of obsrvations ( e.g. image processing).
Therefore DM are trying to cover many uses-cases , and be comprehensive .
Characterisation focuses on physical information, STC on coordinates
representation, RSM on any possible resource in the VO.
This generality is needed I am afraid , but:
- not everybody needs this complexity, that is why we have designed
various levels in Characterisation, for instance. We also have a
Spectrum DM , focusing on simple spectra and taking care of the
attached data too.
There is a difference between the binding of a DM , (the implementation
that a developer makes of a DM, by re-using a subset of its concepts)
and the rich set of classes of a data model.
Example: Characterisation has a PosGres binding via XML instance
documents included in relational tables.
STC has a binding for footprint representation in the currently
developped NVO footprint service.
STC is also re-used in VOEvent serialisation .
etc...
Data models interactions
The "big picture" model that we envisaged to design at the beginning of
IVOA was a big challenge and not achievable at the time . There were no
semantics tags or recommended vocabulary at that time, it was just
starting and many jargons of FITS keywords were used in various
archives .
That is why we started with Characterisation and focused on physical
axes and properties.
The Observation' concepts are now more mature , protocols have been
settled to propagate data and various points of view have been
discussed with the help of archive managers , theory group, pipeline
designers...
So the next step is the integration of working datamodels together in
an Observation DM.
It seems to me it is not at all reasonable to redesign all the levels,
from Observation down to Coordinates definitions and serialisation .
Improvements are OK , a new serialisation format ( JSON, but also KML )
can be supported , but it needs to be at the same fiability level as
XML ( a W3C rec) , to be widely used.
Simplification is OK , for example by distributing a small STC java
library, with the most used STC classes to build up new applications.
A Characterisation library dealing with the 3 first levels can be
developped too.
In the case of large collection of objects, each of them beeing
described with a small metadata subset, the table structure is still by
far the most effective, so the UCD tags , to classify metadata between
tables , and the Utypes , to identify which part of a data model some
piece of metadta is related to, are necessary.
It is the opposite use-case of what you have : a rich metadata set
about one or several related observations , that is why the
hierarchical serialisation is necessary in applications dealing with
data visualisaion , representation and analysis, like your Virgo
application.
This was just a short piece of history about the DM group. :-) I want
to point that , we are not so many contributors collaborating for this
effort, so it is important to converge and potentialise our efforts
with constructive critiscism.
thanks to all,
Mireille Louys, DM chair
--------------------------------------------------------------------------------
*Step 11 Juan de Dios is getting in
--------------------------------------------------------------------------------
Sorry for the delay in coming to the discussion. I'll add my two cents:
Fabien started by indicating that characterisation is being of little
use/usage, and that he thinks that is because
1. Characterisation is too complicated (for readability, implementers and data
providers)
2. It is not integrated with the other protocols.
I think maybe the CharDM is complicated, but I don't think it is so
because it is "baroque", or trying to comprise too much information. I
think we're all thinking about the bare minimum!
The other point is more interesting AND more difficult to deal with.
CharDM is an effort which tries to provide most of the metadata for an
observation in a way that is much more informative than the way that
information is stored in FITS files, and that does not require the
download of the file.
An additional aim for characterisation is to describe datasets as a
whole, and in that regard we have a less detailed version which is part
of the Registry, and that might be part of a potential "VOPackage" to
deliver large parts of datasets.
So characterisation should be part of the protocols as long as there
are ways to query about properties in the CharDM. But that, I think, is
somehow secondary, because we're still defining CharDM.
As for the solutions proposed, I don't think that CharDM is too
complicated for data providers or implementers, and as for humans,
there might be alternate representations. But I don't think JSON is
much better than XML for readability, and I think is more fragile than
XML in case of partial truncation. And relationships (hierarchical or
purely relational) have to be specified by foreign keys, which hamper
readability.
-------------------------------------------------------------------------------
*Step 12 That's all for dissemination and generalities
---------------------------------------------------------------------------------
More information about the dm
mailing list