utypes: a proposal

Thu Oct 30 10:28:51 PDT 2008

Hello

I have obviously missed the session on UTYPE-s.
But I would like to draw your attention to the way we propose to handle
assigning UTYPE-s (which we see as labels identifying syntactic elements in
a data model) in SimDB.
It is one of the issues we need to discuss in the larger IVOA community (and
we presented it already in Trieste as well).

In the context of SimDB we would like to have rules to derive UTYPE-s
automatically so we don't have to think about them separately from the main
modelling exercise. I think here we agree with Norman.

We however try to keep the UML representation central, and out UTYPE
generation rules are based on the UML representation. Since we derive a
"natural" (again to be discussed) XML representation from the UML, it would
be interesting to see whether we can make the approach "commute" with
Norman's. I.e. we should see whether it is possible to follow Norman's rules
to derive UTYPE-s from the XML we generate, and compare these to the UTYPE-s
we generate directly from the UML.

We have tried to come up with a minimal, necessary set of rules to produce a
string that uniquely represents any of the fundamental syntactic elements in
the model. We use that

- Property names are unique in a Class.
Note there are three types of properties: 
An Attribute is a property the datatype of which is a value type (NOT an
object type,/class), though it need not be primitive but may be structured
(i.e. have attributes of its own).
A Collection is a named, 1-to-many composition relation of a parent to a
child class.
A Reference is a named, many-to-one shared association to another class.

- Class names are unique in a Package (namespace).

- Package names are unique in either an enclosing parent package, or in the
Model (the root of all).

So a serialisation like (in a pseudo regexp)

<model-name>:[<package-name>/]*<class-name>.<attribute-name>[.<attribute-nam
e>]*

is a unique pointer to an attribute in a data model. Similarly

<model-name>:[<package-name>/]*<class-name>.<reference-name>
<model-name>:[<package-name>/]*<class-name>.<collection-name>

are unique pointers to the reference and collection properties of a Class.

(We used different delimiters, : and / and . to make a more readable and
explicit distinction between different sort of elements, it is not important
for the argument, as long as uniqueness is guaranteed, meaning, as long as
there is only a single element represented by a given string.)

The rule allows for an arbitrary nesting of packages, which is necessary to
ensure a unique encoding. Since attributes can be structured, we allow for
chaining these until the final primitive attribute is reached, i.e one whose
value will be a literal. UTYPE-s for higher level, less primitive elements
such as Classes are obtained simply by not following the rule to the end.

References and collections are NOT followed further, because it is not
necessary for "a string which encodes a pointer to a unique element in a
data model".

For examples in SimDB please see the (generated) HTML documentation of the
model in
http://volute.googlecode.com/svn/trunk/projects/theory/snapdm/output/html/Si
mDB.html

The main conceptual difference I think with Norman's prescription is that we
do not follow paths between different object types. The paths that in our
approach one might wish to follow are parent-child relations, for they are
not shared, but it seems redundant.

Again, this assumes that the goal of UTYPE-s is that one can assign a label
(in the sense of UCDs) to a column in a VOTable for example, to indicate
that that column contains values corresponding to a given attribute (for
example) in a given data model.

If instead one needs more context, for example if one wants to try to
indicate that that the values in the column only contain values obtained by
a certain complex query path through the data model, this prescription is
not enough. But this seems to stray quite far away form the simple ideas
behind UTYPE-s.

So our proposal gives a unique string value for any element in the data
model.
But if that is not what UTYPE-s are about, we may want to change our
approach.

Cheers

Gerard Lemson

> -----Original Message-----
> From: Norman Gray [mailto:norman at astro.gla.ac.uk] 
> Sent: Thursday, October 30, 2008 4:45 PM
> To: dm at ivoa.net
> Subject: utypes: a proposal
> 
> 
> Folks,
> 
> Sorry I had to leave the DM session early yesterday -- I was 
> due to give a talk in the parallel GWS session.
> 
> In the session, the comments I'd like to draw attention to are:  
> Jonathan's remark that we probably shouldn't force utypes to 
> do everything, and that worrying about uniqueness may be out 
> of scope; Doug's emphasis that a goal was to 'flatten' a 
> possibly strongly hierarchical structure into key-value 
> pairs; and Tom's agreement that talking about namespaces does 
> not imply XML.
> 
> The notion of 'utypes as xpaths' has been floating around for 
> ever, it seems, and I'm sure I remember it being proposed by 
> someone as an obvious solution at the first Cambridge-UK 
> interop, back when the universe was a lot smaller.
> 
> Here, I want to make that proposal concrete.  Is there 
> anything _really_ wrong with this model?
> 
> 
> 
> How about this:
> 
>      For each literal value defined in a data model, define 
> as its utype
>      that XPath which would retrieve the literal values from
>      the 'natural' XML serialisation of the data model.
> 
> That's all -- beginning and end of proposal.  The following 
> illustrates how this would appear.
> 
> [I'm _not_ suggesting we import all of XPath -- merely adopt 
> a syntax which uses a tiny subset of XPath, which is 
> therefore trivially compatible with it]
> 
> If the DM is actually _defined_ using an XML Schema, then 
> this is immediate, since the XSchema defines a serialisation, 
> so the utypes become 'xpaths in the instance'.
> 
> If the DM is defined in some other way -- such as the case of 
> Characterisation, which is defined using UML -- then there is 
> still almost certainly a 'natural' XML version of the model, 
> such as the XML fragment of Char'n which Mireille showed in 
> her presentation yesterday.
> 
> Even if the DM has no 'natural XML serialisation' (and I 
> don't think we've seen one of them in the DM group's 
> history), then there will surely be some part-of relationship 
> which takes you to the value from the 'top' of the model.
> 
> Mireille showed a sample Char'n document in the session, 
> which I think was something like:
> 
> 	<characterization>
> 		<spatialAxis>
> 			<axisName>Sky</axisName>
> 			<ucd>pos.eq</ucd>
> 			<unit>deg</unit>
> 			<coverage>
> 				<location>
> 					<coord 
> coord_system_id="TT-ICRS-TOPO">
> 						<stc:Position2D>
> 							<stc:Value2>
> 								
> <C1>132.4210</C1>
> 								
> <C2>12.1232</C2>
> 							</stc:Value2>
> 						</stc:Position2D>
> 					</coord>
> 				</location>
>                         </coverage>
>                 </spatialAxis>
>          </characterization>
> 
> There are four literals in that example, namely 'pos.eq', 
> 'deg', '132.4210' and '12.1232'.  Their utypes in this 
> proposal would be
> 
> cha:characterization/cha:spatialAxis/cha:ucd
> cha:characterization/cha:spatialAxis/cha:unit
> cha:characterization/cha:spatialAxis/cha:coverage/cha:location/
> cha:coord/stc:Position2D/stc:Value2/C1
> cha:characterization/cha:spatialAxis/cha:coverage/cha:location/
> cha:coord/stc:Position2D/stc:Value2/C2
> 
> ...presuming some 'cha' namespace declaration.  Making utypes 
> compatible with XPath ends up looking pretty much like the 
> existing proposal, except that '.' -> '/' and the namespace 
> prefix is repeated.
> 
> Given that in many cases there would be only one primary data 
> model in use, defining a default utype namespace would make these
> 
> characterization/spatialAxis/coverage/location/coord/stc:Position2D/
> stc:Value2/C1
> 
> XPaths are of course potentially a lot more complicated than that.   
> But I'm not suggesting we permit anything but this tiny 
> fragment of XPath; merely that we use a syntax which is 
> trivially compatible with XPath, and has a precisely 
> definable meaning.
> 
> This has a number of advantages:
> 
>    * It uses a fragment of an existing syntax -- we really, 
> really, don't have to reinvent this wheel;
> 
>    * it's very clear where namespaces fit in;
> 
>    * in some cases where applications are actually processing 
> XML, the utype might be incidentally useful as a way of 
> extracting the literal values;
> 
>    * this syntax provides a _very_ clear 
> illustration/definition of the cases where UFIs are 
> potentially required, namely those situations where a simple 
> hierarchy-based XPath such as this matches multiple literals 
> in a file (is The Unicity Problem anything other than that?);
> 
>    * and so if it really _does_ turn out that UFIs are 
> required, then it will be clear how to extend this syntax in 
> a principled and controlled way, to create UFIs by 
> cherrypicking one or two further elements of XPath.
> 
> I don't believe it's sensible to omit namespacing.  If you go 
> for fixed prefixes -- 'cha:' and only 'cha:' -- than you 
> can't sanely version the Char'n model.  As was noted in the 
> session, namespaces != XML, and if nothing else declaring a 
> prefix/URL combination provides a very natural mechanism for 
> linking a data object with its documentation.  XML obviously 
> provides a straightforward means of declaring namespaces; I 
> can think if a couple of ways of doing so in FITS; it must 
> surely be equally straightforward to do the same for ADQL.
> 
> Best wishes,
> 
> Norman
> 
> 
> --
> Norman Gray  :  http://nxg.me.uk
> Dept Physics and Astronomy, University of Leicester
> 
>