I ) Models "dissemination"

Sun Oct 26 16:28:58 PDT 2008

*Step1*  I start by the summary I (F Bonnarel) gave to the team last August
-----------------------------------------------------------------------
About “DISSEMINATION “ of characterization data model which we 
discussed in Garching and Trieste during last spring
A great part of what we have discussed was dealing with dissemination 
or how to make IVOA partners (users as well as data providers or 
application developers) make use of characterization.
Concerns fall in different categories:
-	a ) As a provider how do I decide which values I have to put in which 
characterization DM attributes?
-	b ) Is characterization able to deal with my favorite data? Eg : 3D 
spectroscopy, long slit spectra ,visbility data, polarimetry data, 
X-ray or gamma data, etc....
-	c ) When I know that, how do I publish that on the VO?
-	d ) How do my users can use these characterization data for different 
use cases ? Are there tools for that ?

For a), it is possible to use Alberto's Micol tutorial document (about 
2D Images). I also wrote some slides for the last Euro-VO DCA workshop, 
based on ACS images examples. Igor wrote a couple of examples based on 
ASPID-SR data ore more recently Giraffe... These are all pragmatic "by 
example" approaches
On the other side Fabien would like all the char stuff to be firmly 
based on theoretical grounds. I will discuss his proposal elsewhere in 
relationships to characterization  Level 4 stuff.

For b ), we know from Igor's work that Char is perfectly able to manage 
3D spectroscopy data.
Discussions are going on with Catherine Boisson for gamma-ray data. 
Anita studied the "complex visibility" data case as well as the 
polarimetric data. I think she has shown how we can do it, but we have 
to build a full example (This will be done this week, she sent me 
example files).
. Igor has also shown that it is possible to give some descriptions of 
snapshots from simulations in char terms (see also elsewhere how the 
Char concepts applies to simulated data).

As for c), there is now a lot of technics which have been explored to 
format and serialize characterisation data and publish them.
For utype/based (VOTABLE formats) we can tight the level 1+2 attributes 
to SIA1 query responses (like in SaadaDB, or VOTABLE output of Aladin 
image server). A new version of XMM images with characterization using 
SaadaDB will be released after Baltimore. SSA does it naturally by its 
characterization utypes. The hope is to get this in the same way for 
images and cubes in SIA2 (one of the Top priority for pushing this 
protocol)
DM mapper and MEX also allow to match utypes with outputs of the VO services.
For both  XML and utypes outputs, CDS developped Camea to build and 
check interactivly valid characterization documents, and more recently 
a MappingGenarator mapper, allowing to map FITS keywords and their 
combination into char files ....

d ) is probably the more critical part, because no VO scientific 
application fully makes use of characterization.
In utype mode, Aladin only displays meta information with their utypes 
but can’t make filtering on Field according to utypes values as it does 
for UCDs. TOPCAT can plot VOTABLE query response records against char 
fields, but utypes are not directly visible on the plot Axes (this was 
discussed with Mark Taylor during Cambridge AIDA meeting).
SaadaQL and Igor's ASPID-SR interface as well as Alberto's 
implementation of ISO images log allow queries based on utypes 
constraints all using SQL oriented queries as will probably do 
extensions of ADQL in a near future. But none of these usages is fully 
standard at the moment.
Direct use of the xml structure has been implemented in ASPID-SR search 
engine, and is extensively tested at CDS in relationship with Workflow 
developments. A demo has been made during the recent AIDA meeting in 
Cambridge and will be improved at the Baltimore meeting in the 
APPLICATION session!!!
  Experiments with E.Auden for Observation/VOEvent matching have been 
attempted... We are building a science case with Catherine Boisson

But we need a fully working science case along these lines to be fully 
understood by the community (this is what Alberto said since already 
some time, and this is also a conclusion of the last char exercice in 
the EURO VO DCA tutorial in Garching).

Documents and links
•	Char Implementations (including links to ASPID-SR, Aladin server and 
CAMEA): 
http://www.ivoa.net/Documents/latest/ImplementationCharacterisation.html
•	Char Utype list: 
http://www.ivoa.net/Documents/latest/UtypeListCharacterisationDM.html
•	CAMEA (Characterization Editing Tool): 
http://wiki.eurovotech.org/twiki/bin/view/VOTech/CharacEditorTool
•	Alberto’s User’s guide/tutorial: 
http://www.ivoa.net/internal/IVOA/IvoaDataModel/ivoa_char_2d_image_tutorial_1.0.pdf
•	Bulgarian DCA Info day Januray 2008 : 
http://cds.u-strasbg.fr/twikiDCA/pub/EuroVODCA/Sofia_workshop_jan08/UCDandCharUtypes.pdf

•	EuroVO DCA workshop June 2008 material :                              

http://cds.u-strasbg.fr/twikiDCA/pub/EuroVODCA/DcaJune2008CDSMetadata/CharMetadataAndUtypes.pdf
•

•	Workflow validation using characterization : 
http://wiki.eurovotech.org/twiki/pub/VOTech/DS6PlanningStage08/CDSDS6S7.pdf
-----------------------------------------------------------------------------------------------
*Step2* Discussion is launched by Fabien
--------------------------------------------------------------------------------
I would like to take advantage of the discussion initiated by Francois 
to add a few points, and hope that you will comment on them. I will 
also make a presentation about that during the Baltimore meeting.

These are general comments about dissemination which are another way of 
expressing the concerns from Francois document.

It occurred to me that the main reasons only few people use 
characterization now is that:
1- it is too complicated. Complicated for reading by humans, for 
implementing by programmers, and for data providers to understand (STC, 
units, ucd/utype).
2- it is not integrated with the other data models and protocols from VO

For point 1, I propose the following solutions:

1.1 For increasing readability allow a serialization in the JSON format.
JSON is very easy to parse, and also to be read by humans. Note that it 
is possible to convert JSON to xml if needed. See json.org for more 
info.

1.2 Stop using ucd and utypes. What is needed is a unique, 
straightforward identifier. UCD and uTypes are derived from complex 
theoretical consideration making them difficult to parse and understand.
In practice, experience shows that software developers use both of them 
only as a static string identifier to identify a field (i.e. softwares 
don't try to make use of the hierachy of classes). For this purpose, 
the JSON variable name is just enough, and all it needs is to be clear 
and self-explanatory, e.g. 'instrument' instead of 'meta.id;instr' or 
'ssa:DataID.Instrument'. 'centralPosition' instead of 
'pos.eq.ra;meta.main' and 'pos.eq.dec;meta.main' etc..

1.3 Stop using STC. I am not saying that STC is bad or useless, but I 
believe it is not needed for characterization. The simple way to avoid 
using it is to fix in the standard which reference frame should be used 
for each characterization parameter. The same can be done for units. If 
a fixed unit is defined in the standard, there is no need to even 
specify it in the serialization (it is implicit). For example I would 
suggest to define sky directions in ICRS in degree.
The usual reaction to this idea is that people who own data recorded in 
another native reference frame and unit feel discriminated. But it has 
to be clear that we speak here about the characterization meta-data 
only. It doesn't mean that the real data need to be converted from its 
native unit (for this STC would be perfect), but only that descriptors 
such as the central position should be given in a common reference 
frame. Having a single unit and reference frame allow software 
developers to concentrate on only one instead of being forced to parse 
and convert all of them. Furthermore, if a data provider owning 
observations in a non standard reference frame (such as planetary
observations) is unable to convert its own data to a standard one, he 
can not expect all clients VO software to do that for him!

1.4 Standardize only what can be without ambiguity. The worst thing 
that can happen is that two implementations use the same concept with 
slightly different meanings. This also means that there is need for an 
extension mechanism (already planned by the characterization document) 
that data providers can use to add whatever we couldn't agree on 
unambiguously.
This approach is pragmatic, and will encourage usage of 
characterization, even if some part of the data don't have a standard 
way to be described. My hope is that this approach will create "de 
facto" standards by the pioneers users.
The part of the fields which are standardized could be perceived as a 
base class from which other specialized class can derive (add extra 
fields). A client application relying only on the standardized elements 
can then safely assume that they are correct.

--------------------------------------------------------------------------------
*Step3  FB answers
--------------------------------------------------------------------------------
----> It is not that complicated : you can have very small serialisation
----> if you take only some options. For example is VOTABLE 
serialisation of char in SSA Complicated?
--------------------------------------------------------------------------------
*Step4 Anita answer
--------------------------------------------------------------------------------
I agree with Fabien that Char may seem to complicated but I think that 
a lot of the problem is where we are trying to be too specific.  We 
also have to think about who will use it and what knowledge they have 
already.
Most astronomer-archivists are not at all fond of Java or of new 
languages. On the other hand, most such people like the idea of UCDs 
although there is cofusion over 1 v. 1+ (how to convert automatically?) 
and I still don't understand utypes - but maybe I don;t need to.  
VOTable is generally well-liked.  Hence I am against adopting yet 
another language...
--------------------------------------------------------------------------------
*Step5 Peter Skoda's Feedback on using Spectrum and SSA
--------------------------------------------------------------------------------
Hi all!

I did not want to enter the interesting discussion as I do not know too 
much the details of the JSON, but he example in the attachment did not 
impress me too much in comparison to XML ;-)

I have just successfully tested the specta cutout service based on 
Pleinpot, which is a main engine of ELODIE and GIRAFFE SSA services and 
now I am staying in front of decision how to deliver the output 
metadata - the system is still using only several VOX keywords for 
position,axes etc ... I need the client to use the information but even 
the simple BAND is a problem for Specview - SPLAT can handle it but it 
does not support TIME range for selection... So I am even afraid what 
the server should provide to allow the client basic functionality.

So far we are able to select spectrum by position and at some clients 
by BAND in all available SSA services. My server now can do the cutout.
That's all

we are lacking in implementation (in clients) of even the basic SSAP 
requirements.

I do not know what the characterization can be used for CURRENTLY.
Even to get information about the wavelength range of given spectrum is 
not always possible.

In addition we have a number of examples from real science where even 
the current characterization cannot give enough information about the 
observation and reduction process involved (e.g. how differential 
refraction, seeing and slit width influences the spectrum of close 
double star pair, how the merging of echelle orders were done -and how 
it is rebinned etc ...)

So I am on both sides - as a astronomer I know the tricks and have 
difficulty to describe what I have done with data (I think the log from 
reduction pipeline is essential to access as part of charac, but most 
spectra have been still reduced manually in the intuitive manner (even 
after pipeline processing). As a archive creator I am not able to use 
the information I put as a astronomer inside in a sensible manner in 
clients..
Have someone seen the VO system that would take the HJD of spectra and 
put it in the plotting program as a parametr describing the stacked 
spectra (i.e. with the small vertical offset between each other)?

That is the reality and in additon there are only very few useful 
spectra available in current SSA services - so you will not restrict 
the search too much to get anything ....

So this is frustrating part of the issue and I agree with Fabien's 
original radical view !
--------

On the other hand, in the future, when the VO will start to bring the 
benefit in everyday spectroscopy analysis (it is now rather a toy for
engineers) and the VO client will become and indispensable tool, the 
people will start to think about the characterization anyway .
So we should be prepared and  have some solution - but do not require 
today to implement it for every server in a rigorous way!

The STC and even the Paper III keywords are nice but he real spectra 
you can see are reduced mostly in IRAF and use either IRAF wcs keywords 
like WAT_  or are rebinned into CRVAL1 and CDELT1  constants.
So even here you have the discrepancy between reality and wishes.

> If the VO would work, astronomer would not have to see how it works. 
> They would just use the tools.

Absolutely !  - the tools should use the charac information to do
something useful (e.g. take HJD to sort or stack the spectra)

--------------------------------------------------------------------------------
*Step 6 Igor's 2 cents
--------------------------------------------------------------------------------
Since the discussion is getting hot, I'll add my 2 cents...

Not about characterisation/stc, but about DM in general. As I 
understand, we would like to reach the general astronomical public and 
to promote data models among them (Characterisation, STC, later 
Observation).

No need to repeat again that the success of VO will be its entire 
transparency for the end-user.

Regarding this, I would say that presently, THE ONLY SUCCESSFUL ATTEMPT 
of dealing with the data model, easily understandable by an astronomer 
(observer) who is completely unaware (or who doesn't care at all) of 
the VO technologies and standards, is VOSpec with its on-the-fly unit 
conversion (thanks to
SSAP+Spectrum DM). Here it is clear: the resource and data are
standard-compliant making very easy to take MIR spectrum expressed in 
um:mJy, optical one in A:ergs/cm^2/s/A and an X-ray in 
kev:photons/cm^2/s and plot them together on the same graph. That's a 
real use-case, where you can tell the astronomer: "Look, it works, 
because it's DM compliant, and if it was not, you'd have to write yet 
another buggy fortran code to convert the units."

A counter-example is SIAP. Since there is no data model for the images, 
at present it is NOT POSSIBLE to automatically mosaic images coming 
from different telescopes/surveys, although normally it should be.

All the others: sophisticated query interface for ASPID-SR, metadata in 
Aladin, CAMEA, and so on and so forth is very far away from the real 
astronomer's life. We have to change the strategy. If we want to reach 
the community, the best way to do it is to publish a scientific paper 
(and put it on astro-ph) identifying some *real applications* 
explaining (1) how to do the real things; (2) what is behind these 
things, i.e. DM; (3) why it would be impossible without it. I'm 
repeating this for the Nth time: I'll be ready to lead the effort of 
writing this paper and bringing it to the publication-ready state, but 
we need to define what exactly we want to put in it and who will 
contribute.

--------------------------------------------------------------------------------
*Step 7 Alberto with his new ESO hat !
--------------------------------------------------------------------------------
Fabien is challenging the current poor status of the VO, whereby even 
the most simple things are not working as they should. Try to find 
three SSAP services covering different spectral regimes (like X-ray, 
optical and infrared for example) and prove to me that they work 
smoothly, and I'm not saying that they have to be "fully compliant" to 
the latest SSAP version, just 3 "workable" SSAPs will do...

Fabien went through this -he has hands-on experience- and he is very right:
it is a very hard and discouraging exercise.
And the same criticism is valid for many other vo standards.
He is hence trying to pragmatically indicate simple solutions to what 
should be simple problems.

Specifically regarding CharDM...
I'd like to emphasize the difference between the Data Discovery aspects 
on one side, and the Analysis ones on the other.

This version of CharDM was supposed to address mainly Data Discovery.
Calibration and other similar aspects are not addressed by CharDM, 
while they should in Provenance for example.

The discovery aspect is by far the easiest. CharDM should permit very 
straightforward queries and return quick and intelligible answers that 
a *simple* software can easily make sense of, to provide effectively 
and quickly a selection mechanism to the astronomers.

Only later, after retrieval, they will want to analyse the data in detail.

Did we succeed in that? Only partially I'm afraid. While we do have a 
model (it took years of compromises to get it published), I share 
Fabien's opinion that is difficult to read, and difficult to quickly 
implement it by the data providers (when computing or anyway extracting 
metadata from their data collections). And we failed, because after 
more than a year we, ourselves, the authors of such document, have not 
been able to come up with a simple tool that makes a sensible, 
remarkable, and enlightening and interoperable use of the CharDM (and 
it is EASY to do it, once we get a couple of data collections 
"characterised").
-------------------------------------------------------------------------------
*Step 8 my personal (FB) comment on the way the discussion is going on
--------------------------------------------------------------------------------
About the discussion so far:
       - In my first posting : char dissemination I tried a summary of 
the Difficulties whe had. I identified the lack of a full science 
application Using char as one of the difficulties we have convincing 
people using it.
       - The other basic difficulty is that even if char is present in 
SSA, we lack it in SIA, because SIA2 is not there. I hope (and work 
with Doug and
Jesus) that we wil launch the last phase of this from Baltimore ahead.
       - This Observation DM work is a collective work an is about 
building Standard step by step, under the IVOA rules.
            There are two errors to avoid:
                    - say that everything has been done already and 
that "you just have to apply standard so and so"
                    - say that nothing good has been done yet and that 
we have to restart from scratch. (We could do that each time there is a 
difficulty and in that case we will have to do that all the time)
             The good approach is to identify the actual problems and 
try to solve them to show something new.

    4 examples of the new Observation DM concept will be shown in Baltimore
----------------------------------------------------------------------------
*Step 9 Gretchen's comments
-------------------------------------------------------------------------------

Maybe i'm not reading all the details because these messages get a 
little lengthy yet i'm finding a gap in the communication and a view 
that reads very black and white.
What I mean is the VO infrastructure is being described as data models 
and then clients with nothing between.

The question i have: what is the objective for the data models?  If the 
data models are to capture and describe the data accurately with 
preservation in mind then the data models require completeness.  If on 
the other hand they are being viewed simply as a transport mechanism,  
then that is not the case and read no further.

My own understanding is that it is the former,  that the data models 
are accurate and provide a structure for the native representations 
capturing data integrity.

The VOTable data model is however in my view more a transport mechanism 
that provides a simple framework for higher level abstraction and 
generalization.  The specific data models which characterize region,  
time,  spectral distribution, etc.
need to account for the complexity and heterogeneity or information is 
lost.  I don't see how this can not be obvious.
If we are ONLY providing clients with higher level views which omit the 
scientific content of the data by forcing it into a convenient format 
for software,  then we are defeating VO and doing what I often hear 
scientist fearing from the VO....changing the integrity and quality of 
the data.

So is the VO a set of nifty client tools or a framework that provides 
real science to evolve in a modern network and grid computing 
environment?

I challenge us to build the data models to be complete and accurate,  
to build the tools to work with them,  and to continue to provide 
applications which make it possible for scientists to do the work they 
do now in a richer framework.
----------------------------------------------------------------------------
*Step 10 The DM chair advice.
----------------------------------------------------------------------------
sorry for those who suffered mutely this long list of emails . Still I 
use this contact list as I don't know who among you is interested or 
not .

I just want to summarise the differrent points appeared in this 
discussion and recap on the strategy we have adopted in the DM WG.

Data Models requirements :
as mentionned by many of you, DM was in charge to describe all the 
metadata available to interprete astronomical data, mainly observations 
, but possibly simulated observations , and cover as many aspects as 
possible in order to :
- propagate metadata , in relationship with protocols definitions 
inside the DAL WG
- describe the information content in order to organise and realise 
data analysis of obsrvations ( e.g. image processing).

Therefore DM are trying to cover many uses-cases , and be comprehensive .
Characterisation focuses on physical information, STC on coordinates 
representation, RSM on any possible resource in the VO.
This generality is needed I am afraid , but:
- not everybody needs this complexity, that is why we have designed 
various levels in Characterisation, for instance. We also have a 
Spectrum DM , focusing on simple spectra and taking care of the 
attached data too.

There is a difference between the binding of a DM , (the implementation 
that a developer makes of a DM, by re-using a subset of its concepts) 
and the rich set of classes of a data model.
Example: Characterisation has a PosGres binding  via XML instance 
documents included in relational tables.
STC has a binding for footprint representation in the currently 
developped NVO footprint service.
STC is also re-used in VOEvent serialisation .
etc...

Data models interactions
The "big picture" model that we envisaged to design at the beginning of 
IVOA was a big challenge and not achievable at the time . There were no 
semantics tags or recommended vocabulary at that time, it was just 
starting and many jargons of FITS keywords were used in various 
archives .
That is why we started with Characterisation and focused on physical 
axes and properties.
   The Observation' concepts are now more mature , protocols have been 
settled to propagate data and various points of view have been 
discussed with the help of archive managers , theory group, pipeline 
designers...
So the next step is the integration of working datamodels together in 
an Observation DM.

It seems to me it is not at all reasonable to redesign all the levels, 
from Observation down to Coordinates definitions and serialisation .
Improvements are OK , a new serialisation format ( JSON, but also KML ) 
can be supported , but it needs to be at the same fiability level as 
XML ( a W3C rec) ,  to be widely used.
Simplification is OK , for example by distributing a small STC java 
library, with the most used STC classes to build up new applications.
A Characterisation library dealing with the 3 first levels can be 
developped too.
In the case of large collection of objects, each of them beeing 
described with a small metadata subset, the table structure is still by 
far the most effective, so the UCD tags , to classify metadata between 
tables , and the Utypes , to identify which part of a data model some 
piece of metadta is related to, are necessary.

It is the opposite use-case of what you have : a rich metadata set 
about one or several related observations , that is why the 
hierarchical serialisation is necessary in applications dealing with 
data visualisaion , representation and analysis, like your  Virgo 
application.

This was just a short piece of history about the DM group. :-) I want 
to point that , we are not so many contributors collaborating for this 
effort, so it is important to converge and potentialise our efforts 
with constructive critiscism.

thanks to all,
Mireille Louys, DM chair
--------------------------------------------------------------------------------
*Step 11 Juan de Dios is getting in
--------------------------------------------------------------------------------
Sorry for the delay in coming to the discussion. I'll add my two cents:

Fabien started by indicating that characterisation is being of little 
use/usage, and that he thinks that is because

1. Characterisation is too complicated (for readability, implementers and data
     providers)
2. It is not integrated with the other protocols.

I think maybe the CharDM is complicated, but I don't think it is so 
because it is "baroque", or trying to comprise too much information. I 
think we're all thinking about the bare minimum!

The other point is more interesting AND more difficult to deal with.
CharDM is an effort which tries to provide most of the metadata for an 
observation in a way that is much more informative than the way that 
information is stored in FITS files, and that does not require the 
download of the file.

An additional aim for characterisation is to describe datasets as a 
whole, and in that regard we have a less detailed version which is part 
of the Registry, and that might be part of a potential "VOPackage" to 
deliver large parts of datasets.

So characterisation should be part of the protocols as long as there 
are ways to query about properties in the CharDM. But that, I think, is 
somehow secondary, because we're still defining CharDM.

As for the solutions proposed, I don't think that CharDM is too 
complicated for data providers or implementers, and as for humans, 
there might be alternate representations. But I don't think JSON is 
much better than XML for readability, and I think is more fragile than 
XML in case of partial truncation. And relationships (hierarchical or 
purely relational) have to be specified by foreign keys, which hamper 
readability.
-------------------------------------------------------------------------------
*Step 12 That's all for dissemination and generalities
---------------------------------------------------------------------------------