Schemas (and utypes)

Tue Jul 21 09:33:38 PDT 2009

Arnold, hello.

On 2009 Jul 2, at 16:51, Arnold Rots wrote:

> You touch on one of the central issues that have made me very
> uncomfortable with Utypes (but I assuem everyone is well aware that I
> don't like them). See below.

I've taken the liberty of adjusting the subject line here, partly (and  
_very_ importantly) in order to keep this separate from the ongoing  
what-is-a-utype discussion, but also because I believe your points  
touch on a larger and interesting issue to do with schemas in  
general.  By 'schemas' here, I mean RDBMS or XML Schemas (in RDF  
'schema' means something different).

>> This is presuming that ns:target.class isn't one of those utypes that
>> only makes sense when it's coordinated with a set of other utypes  
>> from
>> the same model (the goal 1 of utypes, as I understand it).  If it
>> makes sense by itself, then that's excellent, it means that it's been
>> artfully repurposed here, and an application can reliably/safely
>> understand this bit of XML without necessarily having heard of the
>> <whatisit> element before.
>
> This is the crux of the matter. A model never consists of a single
> item. It is usually described by a set of information items (for lack
> of a better term) that together convey the full meaning that the
> author intends to convey.

I agree with that to a pretty good approximation.  However, a key  
point in your remark is "the full meaning that the author intends to  
convey", to which we can add "the full meaning that the reader intends/ 
hopes/aspires to extract", which may be very different.

> The problem with Utypes is that it allows cherry picking of
> information items with no guarantee that the information is complete,
> or even makes sense. Consistency, completeness, and uniquenness have
> been abandoned.

You say "cherry picking", I say "loose coupling".  I want to argue  
that utypes, like simple schemas, do indeed "[allow] cherry picking of  
information items with no guarantee that the information is complete,  
or even makes sense", but that this is not a practical problem.

I presume you're thinking of the consistency which the STC schema  
provides, by virtue of its _syntactical_ insistence that all the  
elements of a point's coordinates (for example) are included in a  
message.  I recall watching STC discussions on the virtues or vices of  
defaulting versus explicit 'not known' remarks, and as you know I'm  
aware of many of the complications of specifying astronomical  
coordinate systems.

In the more-or-less loosely coupled network environment we're all  
talking about, which is too complicated for one-size-fits-all rules, I  
believe that this level of syntactical specification adds consistency  
at the expense of adding brittleness and unnecessary complication.   
That is because, ultimately, the schema doesn't add much value to the  
message: if there are relevant information items missing from the  
message then it is the consuming application -- and _only_ the  
consuming application -- which is competent to say so, and to default,  
fail, or respond appropriately to the originator.  Further, a message  
could pass even the most stringent syntactic validation and still be  
nonsense as far as the application is concerned.

Thus schemas can act as sanity-checks and no more.  They don't  
realistically relieve the consuming application from any  
responsibility for error-checking.[1]

What that means in turn is that the _real_ role of schemas and utypes  
is a fairly modest one, concerned simply with indicating which parts  
of a message are to be identified as what, at a syntactic level or not  
much higher (this is the intuition behind "a pointer into a data  
model").

The job of reassembling all these information items into a datamodel  
instance, ontology, java-object, FITS file or whatever you want, is a  
job which happens at a different layer, and it's in that layer that  
appropriate cherry-picking will be accepted, and inappropriate cherry- 
picking rejected, depending on the needs of the application that's  
doing the reassembling.  The utype model is therefore a good match to  
a world of heterogeneous applications, data and uses (my suggestions  
are intended to make this good match better, but the utype model is a  
good one nevertheless).

Best wishes,

Norman

[1] I wouldn't go as far as to say that schemas are useless.  I can  
see that there are some situations where code-generation is useful,  
and they can provide for contract checking ("whose fault is it that  
this message couldn't be parsed?"), but they don't have the semi- 
magical properties that would warrant the amount of interop agony  
sustained when arguing over them.

-- 
Norman Gray  :  http://nxg.me.uk
Dept Physics and Astronomy, University of Leicester, UK