Cambridge working group minutes

Tue May 20 08:25:37 PDT 2003

IVOA Data Models WG 

Cambridge May 2003 - Meeting Notes

A summary is available at
http://hea-www.harvard.edu/~jcm/vo/cam/summary.html

At the Interop meeting the Data Models group met on Tues May 13. The
work of the UCD and DAL working groups was also highly relevant to the
DM process and you are encouraged to read their summaries too. Thanks to
Alberto Micol and Ray Plante for taking notes on the meeting.

The morning was taken up with presentations followed by a
discussion of process. In the afternoon we had open discussion,
which led to a list of possible objects to model and selection
of a subset of objects to take on as immediate work packages.
The final agreement was an agreement on process and a list of
work packages with individuals assigned as leaders. 

I did not take attendance at the WG - I was very encouraged to
see the large number of people present. The following people either
gave presentations our took on work package (WP) responsibilities; anyone
else who wants to be associated formally with the WG is welcome to
do so and should contact me.

*** PRESENTATIONS ***

The presentations are on the Wiki page. They were:

Jonathan McDowell (CfA/NVO) DM Process
Ray Plante (NCSA-Illinois/NVO) Quantity data model
Mireille Louys (CDS/AVO-F)  IDHA Model
Patrick Dowler (CADC/CVO)   Canadian VO data model
Jonathan McDowell (CfA/NVO) Spectral data model
David Giaretta (Starlink/Astrogrid) HDX model
Norman Gray    (Starlink/Astrogrid) HDX model
Dave Berry (Starlink/Astrogrid) Starlink WCS
Arnold Rots (CfA/NVO)       SpaceTime Coords

To summarize (very unevenly, depending on my available notes):

Ray's talk argued that we should define a model for a simple
value-unit-uncertainty object which could be used as a atom for building
more complicated models. Mireille presented a revision of the Aladin
IDHA model which handles the question of pipeline processing. Patrick
presented the Canadian model, which models an archive query over data
covering the spectral, spatial and temporal domains and  with uniformly
generated index catalogs giving statistical information on source
detection in each field. David and Norman presented HDX, which is a
container model, usually serialized as XML, which includes the NDX model
for n-dimensional images with variance and metadata. HDX draws a
distinction between coordination metadata which describes how different
components interact, and true metadata within the components. Dave
described the toolkit WCS model used in Starlink, in which simple
software components are chained together to make complex transformations
(such as, but not limited to, FITS WCS transforms). It uses the concepts
of Mapping (a transformation between spaces), Frame (a physical domain,
including special knowledge), and a FrameSet (network of Frames
connected by Mapping objects). Arnold reviewed his Space Time
Coordinates paper (already available on the mailing lists).

*** PROCESS ***

Here I try and reflect the discussions as well as the
conclusions.

What is a data model? When can we tell we have completed one, where
do we put it after that, and what do we do with it then?

1)  Is a data model
       a UML class diagram, or
       an XSD file (XML schema)?

The issue is whether the language-independence and abstract
nature of the UML is more important than the practical convenience
of an XML schema which can be automatically converted to code.
We concluded that the fundamental reference definition of each data
model would be a UML class diagram, but that we would also require a
reference representation as an XSD file to clarify intent and to
serve as a starting point for software implementations. 

Other possibilities were also discussed; Norman Gray recommended
a look at RDF (Resource Description Format) which has triples
of the form "resource, relationship, value", where value can
be either a scalar value or another resource. This allows
nested inferences about the relationships between resources
(which mostly equate to objects in our context).

2)  Does a VO data model
       include definitions of its methods (functions), or
       just the attributes?

David Giaretta and others argued for the importance of defining
interoperable interfaces to our objects, but a consensus was reached
that the first step was just to model attributes so that files
could be interoperably interchanged using e.g. XML, possibly with
several different ways of accessing the data. Meanwhile, the
standard interfaces to the objects will be the work of the Data Access
Layer WG. However, the DM process may lay out some methods if they
are important and obvious, to prevent them being implemented
many different ways. It was suggested that a reference Java
implementation should be required if methods are included, but
I don't think there was consensus on this.

3)  Do we want to concentrate on
       an overarching astronomical data model, or
       small components?

The danger with concentrating on the big picture is that we
won't converge for years and the other WGs need input now. The
danger with working on components is that two components
may later be seen to have commonality (e.g. they should be subclasses
of a single thing) and that may be missed if they are not modelled
in the larger context. We concluded that we should start with small
components, so that we can deliver something on a short timescale,
but also work on the big picture model as an evolving thing to refer
to, with a less defined schedule for completion.

Work on the big picture provides a vision of the larger scope which
will inform work on the lower levels. Big-picture models should
be posted to the DM twiki 
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaDataModel

4)  Are we modelling
       simple information of the kind needed for archive queries,
       or detailed information as needed for data analysis?

We noticed the presentations tended to fall into these two categories.
Indeed we decided we should cover both cases, and usually the simple
concept will be a subclass of the deeper model. For Bandpass, for
example, the query usually needs only a simple range since it
can live with a false positive, while data analysis may potentially need
the detailed transmission profle.

5)  Does the scope of the WG include
       just astronomical data, or
       a complete model of all VO components?

 It was argued that more computer-science things like grid resources
 also need a data model. Ray suggested, and others supported that
 for such non-astronomical-data parts of the model the relevant
 WG (e.g. the registry WG for the registry resource data model)
 should develop the data model following the DM WG process, and
 submit it to the DM WG for final approval. The DM WG will then
 concentrate on developing models for astronomical data (but not
 just observational data - derived and theoretical data should be
 included).

6) Where do we put the data models?

  This is subject to revision by the Process WG, but possibilities
included the Wiki, a static web page, or a registry. I think we concluded
a registry was the proper long term solution. 

7) What do we do with them?

Some uses mentioned were: 
  - Provide standard terminology for discussing concepts
  - Provide metadata structures for the data access layer
    and other applications.
  - Automatically generate software classes 

JCM presented the 2002 Cambridge, MA technical meeting DM process and
this was lightly amended by the WG. The new process involves the
following steps; it is anticipated that usually each step will trigger a
loop of discussion and will result in revisions to earlier steps.

1) Write a text white paper defining and discussing the concepts, including
   algorithmic details when appropriate.

2) Generate a document containing the UML class diagram and text
   describing the classes and attributes. (this may be combined with the
   general discussion in (1)).

   All attributes should be specified and should be tagged with UCDs. 
   Methods are not required, but if specified their input and output
   arguments should be tagged with UCDs, and appropriate UML case
   diagrams should be included in the document.

   The document (and model) should include versioning of some kind.

   To avoid huge diagrams we recommend a set of nested class diagrams with 
   only half a dozen boxes per page, when needed.

3) Generate a reference representation XML Schema (XSD) file to provide
   the basis for an initial implementation, and 
   some XML instance examples to clarify the intent and provide test data
   for interchange. At this point the model can be considered 
   compliant with IVOA standards for data models and is a candidate for approval
   as the recommended model for the concept.

   The XSD file should be compatible with standard code generation tools
   (Will O'M should please provide the names of two such tools to act
    as qualifying in this role). We note that if the XSD is generated by 
   tools from the UML (rather than hand coded), this compliance is likely to
   be met.

4) The IVOA DM WG will consider the model for approval. Normally
   this should not happen unless at least one pair of groups have
   successfully interchanged data using software based on the model.

*** OBJECTS ***

We called out a heterogenous list of interesting objects; then
JCM tried to organize them in an overall model, subject to
appropriate heckling by audience members. Here is a summary
of the objects mentioned; I hope the general meaning of each
is fairly obvious.

We called "GRID RESOURCES" an object outside the main data model
(we'll probably have a VO model eventually that includes both
this and DATA).

DATA was deemed to include ND-IMAGE and TABLE data, with
IMAGE, SPECTRA, 2D SPECTRA, 3D CUBE, and TIMESERIES probably under
ND-IMAGE, and CATALOG, SIMULATIONS, EVENTS and INTERFEROMETRY possibly under
TABLE. Both kinds of data share the OBSERVATION metadata model,
including COVERAGE (which in turn includes TIME_OBS, SPACE TIME COORDS
and SKY REGION). ND-IMAGE data has AXES (including POLARIZATION
and SPACE TIME COORDS) and an OBSERVABLE (e.g. FLUX; here OBSERVABLE
just means the independent variable pixel values, and is not 
meant to require that these values be a direct measurement - could be
theoretical values too. Better name is solicited.) Both AXES and
OBSERVABLE are examples of QUANTITY (with extra information such
as axis values) and may have TRANSFORM associated with them.

The OBSERVATION model may be the host for RESOLUTION and SENSITIVITY
as well as OBSERVATION INFO, PROCESSING and FIELD VALUES (like LIMITING FLUX
and SPATIAL FREQUENCIES BANDPASS).

A special kind of DATA is a SOURCE, which is a subset of
other data thought to be physically connected 
and with extra properties that might form a set of catalog entries.
Another kind is OBJECT, the real thing in the sky that SOURCE is
identified with. (Is OBJECT just a kind of DATA - because it is
a one-line CATALOG - or is it something different?)

Arnold noted that there are at least two orthogonal aspects to
these models - the computer-science data types and storage
organization, versus the structures and relationships that
encode scientific meaning.

 *** WORK PACKAGES ***

We picked a number of objects for further analysis, with a
work package leader for each one. There are two groups of
work packages: Group 1 which we hope to bring to completion
on a short timescale, with the white paper (process step 1)
no later than the 2003 ADASS meeting (Oct 12), and Group 2
which are more in the nature of ongoing research projects,
with no specific timescale yet assigned.

Each work package is assigned a tag, e.g. [SPECTRA], which
should be included in the subject line of messages to the 
mailing list. The responsibility of the package leader is
to initiate (and keep alive) discussion on the topic on
the dm at ivoa.net mailing list, and to delegate someone to
draft the white paper. We agreed that occasional (perhaps
monthly) DM WG telecons might be useful to review progress.

The packages and their leaders are:

Group 1:

[SPECTRA]    Jonathan McDowell (CfA/NVO)    1-dimensional spectra
[TIME-OBS]   Patrick Dowler (CADC/CVO)      Observation time
[RESOLUTION] Patrick Dowler (CADC/CVO)      Resolution of various kinds
[QUANTITY]   Ray Plante (Illinois/NVO)      Value/unit/uncertainty

Group 2:

[LIMIT-FLUX] Patrick Dowler (CADC/CVO)      Limiting flux
[TRANSFORMS] Dave Berry (Starlink/AG)       WCS, units, etc. on quantities
[INTERFEROMETRY] Peter Lamb (ANU/OzVO)      Particularly, radio issues
[SIMULATIONS] Gerard Lemson (GAVO)          Theory in the VO
[OBSERVATIONS] Albert Micol (ESO/AVO)       Big picture, and observing 
                                             metadata.

Among the packages, the ones being led by Patrick Dowler are those
requested by Doug Tody as needed soon for the Data Access Layer WG. It
was suggested by Arnold that progress on limiting flux [LIMIT-FLUX] will
be difficult (for many cases, esp. X-ray, this varies wildly across a
single field), but Doug argued that even something imprecise would be
useful. Although [SPECTRA] are largely a subclass of N-dimensional
image, we decided it was worth tackling the simpler case first
while of course keeping its ultimate generalization in mind.
It was argued that [RESOLUTION] might be something posessed by
all [QUANTITY] objects, although others (including me) felt that 
one should distinguish between a general [QUANTITY] which could
be any physical variable and a
MEASUREMENT which represents something measured in a detection
process (for which RESOLUTION is meaningful). Of course in
general there is often a choice to make between subclassing
(MEASUREMENT is a QUANTITY but with extra attributes) and
provideing default values (all QUANTITY objects have the measurement
attributes but often they are not filled in); this is a choice
about overhead versus flexibility. [TRANSFORMS] will depend
strongly on [QUANTITY].