UType proposals

Fri Jun 12 10:30:22 PDT 2009

Gerard, hello.

On 2009 Jun 12, at 15:31, Gerard wrote:

>>> The latter allows one to find the element represented by
>> the string in
>>> the model by parsing the string.
>>
>> I'm nervous of 'parsing the string'.
>
> I meant really only parsing as in "reading", or what you call  
> "informally
> parsing by humans".

Ah right -- I'm with you there.

>>> So for the DM group to propose a rule for deriving Utype-s from the
>>> data model that does not require a separate lookup, i.e. is
>> parsable
>>> by humans I think is not a bad thing as it promotes homogeneity and
>>> readability between different modelling efforts.
>>
>> The recipe that Mireille proposes is informally parseable by humans,
>> and that has substantial mnemonic value if it is used
>> homogeneously.
>> It's just that turning that into a formally parseable thing
>> would have substantial costs with few benefits.
>>
> So are you suggesting that it might be ok for the Utype document to
> RECOMMEND, or SUGGEST
> that data modellers use these rules? Or should the document get rid  
> of these
> rules altogether,
> in which case their "substantial mnemonic value" would be lost as  
> well.

I was thinking of SHOULD in the RFC 2119 sense: "This word, or the  
adjective "RECOMMENDED", mean that there may exist valid reasons in  
particular circumstances to ignore a particular item, but the full  
implications must be understood and carefully weighed before choosing  
a different course."

>>     <param id='foo'
>> utype='http://www.ivoa.net/dm/simdb/v1.0#Simulated.Foo'
>>>
>>       xxx
>>     </param>
>>
>> or
>>     <VOTABLE xmlns:simdb='http://www.ivoa.net/dm/simdb/v1.0#'>
>>       ...
>>       <param id='foo' utype='simdb:Simulated.Foo'>
>>         xxx
>>       </param>
>>     </VOTABLE>
>>
[...]
>> nt way of representing this, if that were
>> deemed necessary.  That doesn't matter, because it's the full
>> URI -- possibly after some string concatenation -- which the
>> resulting application would be required to recognise.
>>
> I was assuming that you meant for utypes to be something like your  
> first
> case.
> I was referring to the second usage, including the explicit  
> xmlns:simdb=...

I did mean that the UTypes should be the full URI as in the first  
case, but that the difference between this and the second case is  
merely a matter of syntax -- the definition of the serialisation.   
That is, when processing the second case, an application's first step  
would be to concatenate the two bits of information into a single URI  
string -- this namespace information is readily available when  
processing a SAX stream or within an XSLT template.  Or rather -- as  
is the usual way of specifying these things -- it should act _as if_  
it had done that, since it might well be more efficient or  
straightforward in a particular case to do something more direct.  As  
before, there would be a different normalisation step in the case of a  
FITS serialisation (I emphasise this in order to emphasise that there  
is nothing here which is fundamentally coupled to 'XML namespaces').

> What I gathered from the discussion about the new 'xtype' attribute  
> in the
> VOTable session seemed to indicate
> that no such xmlns declaration was desired in a case that has  
> similarities
> to what we discuss here.

I confess I didn't follow all of the xtype discussion.

>>> Finally, I think one thing that Mireille's note does not
>> make clear is
>>> that to be able to have a rule deriving parsable Utypes from a data
>>> model such as the one used in SimDB, one must have defined the
>>> syntactic elements for expressing one's data model. In SimDB we do
>>> this explicitly and we have proposed a similar approach to the DM
>>> group. Once one has that one may also have hope of creating
>> instances
>>> for a specified Utype.
>>
>> I don't think I follow you.  By 'syntactic elements' do you
>> mean some parseable syntax for UTypes?
>>
>
> No. What I meant was that it seems to me that if one wants to  
> associate
> Utypes in a meaningful way to a data model, one needs to understand  
> what
> kind of data model construct they may refer to/correspond to.
> The BNF-like syntax for utype-s assumes implicitly the existence of  
> certain
> data model concepts (the "syntactic elements"): Class, Model,  
> Package etc.
> Utypes may refer to any of these.

I see what you mean.  I think there are some who would disagree  
emphatically with you here, and assert that UTypes can only describe  
things which have literal values.  That's what I take from the  
emphasis on the use-case of reconstructing an instance of a model from  
a set of key-value pairs.

Myself, I agree with you, that it would be useful to associate  
'UTypes' with each of the Classes, Models, and Packages in a data  
model.  Given a UML data model (or a XSchema data model, or whatever  
modelling framework you prefer), it would be straightforward to  
develop a simple ontology which reflected it, and RDFS or OWL would be  
the languages to do that in.  However I want to keep this fuller use- 
case separate from the proposal for Utypes-as-URIs, in order to keep  
their distinct advantages distinct, and to avoid too much talking at  
cross-purposes.

> Currently the DM group does not have an agreed upon language in  
> which to
> express data models.
> But in particular when you want machines to do something with utypes  
> they
> must be able to find out what kind of thing they are referring to.
> There is quite some difference between an attribute and a reference,  
> or
> between a package and a class.
> If we do not agree on a language for expressing the data models, it  
> will be
> hard to code against them.
> Again, this is what SimDB has actually done. Because of this we can  
> write
> code that uses metadata about a model (expressed in our intermediate
> representation), to infer things about instances of the model in  
> various
> forms (XMl, Java, RDB). Admittedly we are not using Utypes, these  
> are too
> limited for this purpose.

All I think the IVOA UType standard has to do is agree on a way of  
naming bits of models.  In certain circumstances, it'll be possible to  
know more about the relationships between those bits of models: in the  
case of SimDB for example, there will be lots of extra information in  
the XMI (say) describing the rich interrelationships between these  
model items; the same would be true of SSA, say, though there the  
interrelationships are described primarily in text (if I recall  
correctly; at any rate, I don't think there an SSA XMI, nor do I  
believe there rfc2119-should be).  These interrelationships can be  
exploited by code hand-written or generated from an XMI file.

It's at a higher layer of interoperability that a restricted view of  
what UTypes are for will pay off.  I can imagine an application which  
might want to handle bits of SSA, bits of SimDB, and some (SKOS)  
vocabulary terms, perhaps using information pulled from an RDB, FITS  
files and a registry query.  That sort of application probably isn't  
going to benefit from an intricately described structure for each of  
the data models, but it _can_ benefit from a consistent and technology- 
neutral way of naming entities (ie, UTypes), and a consistent way of  
finding display labels, and (here moving into a potential payoff from  
RDF) a consistent way of finding lightweight interrelationships, such  
as that a simulated galaxy is the same sort of thing as a SIMBAD- 
galaxy.  [Just to be clear: that last one goes beyond what I'm  
suggesting in this UTypes proposal].

>> This is indeed much less clear.  I think it would be valuable
>> and fairly easy to do this, and it would mean that you could
>> envisage a future query which asked for (all of) a table by
>> giving the table's UType.  The framework for this is in, for
>> example, section 3 of the proposal
>> <http://nxg.me.uk/note/2009/utype-proposals/#composite>,
>> which effectively suggests that char:coverage.location.coord
>> be a UType which has a structured value, and this is perfectly  
>> strict.
>>
>> However, as Doug and others have argued, the current primary
>> use-case for UTypes is the notion of a list of key-value
>> pairs, where the keys are UTypes and the values are literals
>> (ie, columns or single values).  I think there are benefits
>> and few costs to going beyond that (if one has a clear idea
>> of what one is doing), but that's where this argument would live.
>>
>
> I think the origin for utype was as an extra attribute on FIELD,  
> where I
> though it was supposed to assign extra meaning to the column in the  
> table,
> somewhat (but not very much) different from UCDs.

I remember this, too.  As I mentioned in my utype-questions posting, I  
think there are multiple conceptions of what UTypes are and are for,  
and that these are not always compatible, nor written down with much  
precision.  I listed the key-value-pair use-case as an explicit goal  
in the utype-proposals posting, just so it was explicit which problem  
I thought I was solving.

> But utype attributes are now everywhere in VOTable, also on GROUP,  
> TABLE and
> RESOURCE. These are all complex constructs, and one might worry that  
> in
> general it may not be correct to assume a 1-1 relation to a complex
> construct in a data model (unless designed to be so, like in SimDB's  
> TAP
> mapping).
>
> For example consider a model for people with a class Individual having
> attributes
> - firstName
> - lastName
> - age
> - email
> - telephone number
>
> In a VOTable one might encounter a TABLE with name="Person" and FIELDs
> (surname, dateOfBirth, emailAddress).
> Could I add the utype people:Individual to the Person table?

That would seem fine and sensible to me.  And the URI UTypes would  
comfortably handle that, too.

> I seem to recall you telling me about a concept in ontologies/ 
> vocabularies
> that seems similar to me to the utype.
> If I am not mistaken in ontologies one can point from one ontology to
> another and declare that a thing in the former is
> similar(/equivalent/equal/?) to a thing in the latter. Is utype a  
> simlar
> construct, and if so which (if any) of these meanings might it  
> correspond
> to?

If we want to talk about ontologies, then yes, you can declare  
relationships between classes A and B in different ontologies.  If A  
sameAs B, then if you state that a thing is a member of the class A,  
then it'll appear when you ask for the members of class B.  Or if A is  
a subClass of B, then if you state that x is in A, then it'll appear  
when you ask for the members of B, but not vice versa.  But, again,  
this is separate from the notion of URI UTypes: I don't want to be  
thought to be smuggling ontologies here -- URI UTypes are a pragmatic  
solution to a simple problem, but they don't block off sophisticated  
solutions to harder problems.  Here a UType is just a name for a  
class, or a property -- rather than 'A' and 'B' above, you'd use  
URIs.  That's all I'm suggesting.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
Dept Physics and Astronomy, University of Leicester