Converging the Data Models

Thu Sep 16 10:58:38 PDT 2004

A proposed simplified XML schema for Char, STC and Q.
-----------------------------------------------------

During my long silence, I've been thinking hard about how to unify the
approaches of CDS (Characterization), Arnold (STC), and Brian et al
(Quantity).

First, some observations:

  - The objects in Char are very close to the concepts elaborated in STC,
    but Char redefines these because Francois and Mireille don't want the
    Char definition to be dependent on the complexities of STC.

  - Elements in Arnold's schema are very similar to Quantity, but
    don't use the Quantity schema.

The problem here is that each schema is so big, when done right,
that it's hard to buy in to it. But we must reuse components rather
than all go off in separate directions. So I have a proposal for converging us:

1) The Char schema should use the appropriate STC and Q elements, but in
an initial definition should use a relatively tiny toy STC and Q schema
which would be  included in the Char definition. 

2) Toy-STC would have the property that element instances of Toy-STC would
   be valid instances of full STC. Same for Toy-Q. Therefore, 
   a simple change to the namespace references would be all that is needed to
   layer Char on the full STC and Q. The existence of the full STC and Q
   schemas give us the guarantee that Char elements defined in this way
   are suitably extensible in the future.

3) Of course, there will be many valid instances of full STC that will
   not be instances of Toy-STC, and so anyone using full STC underneath Char
   might make Char instances that are not valid instances of the simple Char.
   So, interoperable deployment of the full version would need to be
   coordinated.

I know that Francois and Mireille would prefer to just redefine the
concepts and perhaps use choice groups to access their definitions
versus the STC definitions. I'm not keen on this because the serializations
are not interoperable.

I have collected the existing xsd's on a web page as

http://hea-www.harvard.edu/~jcm/vo/xsd/xsd.html

I have made the following new XSDs:

char.xsd   - a characterization schema layered on STC and Q but with the
             same functionality (and model) as the Bonnarel/Louys proposal.

toystc.xsd - a schema for STC containing only the stuff needed for char.xsd.

toyq.xsd   - a schema for Quantity containing only the stuff needed for char.xsd
             and toystc.xsd.

char.xml   - a sample instance of Char showing how this works in practice.
             I haven't validated the instance, though, so it may be broken.

Now let me admit right away that my schemas are not complete, there
are things in char.xsd that don't have definitions. But there's enough
there to illustrate the idea. It's probable we could simplify even further.

Next, I made an attempt to converge full STC and Quantity. There is a
fundamental difference of approach here in that Arnold has automatic
instance validation as a major goal, constraining tightly what
dimensionality, units, etc. can be used in an STC instance, while
Quantity aims to provide very general and flexible containers. This
difference could be reconciled by liberal use of <xsd:restriction>, but
I have not pushed this very far in my example. My main goal has been to
find a path to evolve STC so that it is consistent with the Quantity
approach.

This is not to say that STC 1.0 currently on the table shouldn't be
approved as the interim standard, but  I want to arrive in Pune with at
least a story about how STC 1.0 would evolve to coexist with the
Observation and Quantity work. I started with looking at Brian's attempt
in CoordinateSystems.xsd, but this didn't attempt the critical task of
implementing the actual coordinates as Quantities, so I ended up going a
slightly different route.

Conversely, working with Char and STC has led me to tweak Q a bit
relative to Brian's excellent initial work on the Q schema. 
I have generated a simplified Q schema (which omits mappings and axes
for now), to illustrate a possible approach. Basically this is 
the 'toy' approach again, making a full STC on top of a toy Q,
although slightly less toy than above - a mini Q, say. So I have

 stc1r.xsd
 miniq.xsd

(For Q aficionados, the main change I've made to Q is to make Q's Frame
an actual subelement which we can restrict to make more specific Frames
like an AbsoluteTimeFrame, without having to repeat the metadata and
value sections. I did this because of the utility of being able to
use an IDREF to a Frame (as in char.xml), the frequent need to do
restrictions on Q which were really just restrictions on the Frame bits,
and the very painful way that XSD makes you define restrictions by
repeating the whole darn structure. I know other Q team members didn't
want this, and I'll be happy to change it back later.)

I also made use of named sequence groups and attribute groups to clarify
the structure, and avoided the use of choice groups since Ray and Gerard
have argued against them as being non-UML-like and not well supported by
lets-make-automatic-code-from-xsd  tools.

The biggest problem I had was multiple inheritance. For instance:
  - an STC astronTimetype can be an ISOTimeType or a JDTimeType,
   which should derive from stringType and floatType respectively, 
   which derive from dataType.
   So we have
                                  dataType
                                     |
                            ------------------
                           /         |        \
                          /       TimeType     \
                      stringType   /   \      floatType
                         |        /     \       /
                         |       /       \     /
                       ISOTimeType       JDTimeType

    I don't know what to do about that. The only immediate idea I had
    is to relegate one of the distinctions (e.g. string vs float)
    to an attribute value rather than a class distinction but I'm
    sure the UML purists won't like that. So I've stuck with using
    substitution groups for now pending a consensus on how to proceed.

I also still feel rather strongly that we are over-typing. Where
should we make distinctions using instance values and where should we use types?

 I think (hope) that even the most enthusiastic OO people would agree that
 (A)  "Time = 42.3 seconds" and "Time = 101.6 seconds" 
 should not be two different types (a Time42.3SecondsType 
 and a Time101.6SecondsType)!

 But should there be (B) a TimeSecondsType and a TimeHoursType? Or
  just a TimeType with instance element <units>seconds</units>?
 Or (C) not a TimeType but 
  just a CoordinateType with instance element value <ucd>time</ucd>? 
 Why are (B) and (C) different from (A)?
 This is a matter of taste and really depends on your view of the 
 domain being modelled.

 In STC, to give him the autovalidation by the schema, Arnold defines things
 like Position3DType where I would be inclined to simply let the application
 use a VectorPositionType and put n=3 in the instance. The current approach
 leads to a huge proliferation of types as well as hugely exacerbating the 
 multiple inheritance issue.

Your comments would be appreciated; I suggest that overview
comments on the general approach would be more useful at this stage
than nitpicks on small details of the schemas.

 - Jonathan