Fw: A suggested revision for UCDs

Wed Oct 22 10:15:37 PDT 2003

Here is a discussion that Tom and I had off-list, but I think are number of
points of more general interest are raised.  Warning -- it is quite long!

Bob

- - - - -

Hi Bob,

Thanks for the review and comments.  I'm particularly interested
in the areas that were unclear.  It seemed to me that I needed to
actually put the ideas out where I could get some detailed reactions.
A fair number of typo issues were addressed in the version I uploaded
to the Twiki and announced to the UCD, DM and DAL groups.  Haven't
heard of any reaction.

I've responded to your comments below (there a lot of detail
but I thought it user to think these things through).

Tom

Robert Hanisch wrote:
> Hi Tom.  I read through your revised UCD document this evening.  Phew.
> There is much in it I like, much I don't, and much I don't follow.
Perhaps
> the two latter categories mix together.
>
> I guess my biggest problem is that the roles of concept, attribute, and
> modifier are partly defined by syntax (where they appear in the string)
and
> partly by having to know what names (em, pos, flux) have been allocated to
> which category.  This seems very arbitrary (and very confusing) to me.
> Although I have never written a parser in my life, it looks to me like a
> parser for this would be a zillion if statements.  Maybe this is fewer if
> statements than for other approaches, but it still looks very complex.
>
I agree that this is a major issue, although my biggest concern with it is a
little
different.  I'm giving a long answer to help me organize my thoughts.

The writer of a table presumably has access to the documentation for
UCDs so it shouldn't be a big problem dealing with the three types -- 
especially
once there are examples.  The problem is more in using UCDs when reading
tables.

In practice I'm not sure this would be a big deal for 'real' tools.  E.g.,
something like
VOPlot is going to need to know about the value and meas.error attributes
internally so
that it can plot values and error bars for a given quantity.  I.e., it's
just
going to look for pairs of columns within the same group of the form:
     SomeString;value and  SomeString;meas.error
A spectral processing tool is going to look for pairs like phot.flux*;value
and
phys.wavelength;value.  Specific tools internalize this kind of knowledge -- 
or
even better read it in as a data model.  These tools don't really know about
how
UCDs are organized.  The organization is intended to make it easy for them
to
search for the appropriate strings, but they just take advantage of that.

Generic tools for manipulating UCDs and for validating them are where the
problem
really begin to show up. Currently there are only 6 trees that are not basic
concepts
(em, frame and intent for modifiers and filter, stat and meas for
attributes).
I think the single word attributes are important enough that they will
not cause a problem.  So a complete algorithm to determine what word
belongs in what vocabulary is currently pretty easy... Psuedocode is just:

     firstAtom = substring(ucd, index(ucd,"."))
     switch (firstAtom) {
         'em', 'frame','intent': return thisIsAModifier
         'stat','filter','meas': return thisIsAnAttribute
         'value','local','instance','multiplet', 'vector': return
thisIsAnAttribute
     }
     return thisIsAConcept

Alternatively we're talking about validating UCDs against an IVOA schema to
define
the valid words and the match against this could give the type.

There are other simple ways to deal with this:  Begin all modifiers with m.
and attributes
with a.  Or I've suggested in the draft that all modifiers could be in the
frame tree
-- the idea is that the role of modifiers is to limit the context to which
the concept applies.
I don't think the attribute trees join as easily but if it's important
enough we could
pick a name for all of the attribute trees.

The biggest problem is non-standard namespaces.  How do we handle a new UCD
tree?
In some sense the issue is moot.  Non-standard words shouldn't be used
outside
of some developers local context.  They can be responsible for handling
them.  However
I suspect that non-standard words will escape into the wild.  The validate
against
the schema approach still works, but it's impossible for writers for tables
to know
how to use these UCDs.

There are some other ideas that might help address this issue:  Your
suggestion
of another separator character is nice.  I thought about it but decided that
it was too radial a change.  Maybe separate atributes and modifiers
within themselves by commas but separate them by '-'s.
e.g., a complex UCd might be:
     flux.phot-em.optical,intent.calculated-meas.error,stat.max
I'd still like to keep the vocabularies separate, but now it's trivial to
parse the UCD.

For the moment I tried to minimize the change from the original proposal.
Note that this
is all much harder in the original proposal.  There is no way to tell what
anything after
the first word is.  In that proposal the first word is a property, but all
subsequent
words can be either properties or concepts.  Nor there any lexical
definition of what
a property is (i.e., any word can be a property).

> The document has a lot of signs of a rush job -- is it Uniform or Unified?
> (Unified, I think.)

I always thought it was Uniform so that wasn't a typo but an error or my
part...
Sigh...

Is flux a 0-level concept?  Or is it phot.flux?
That I think is fixed in the published version (it's always phot.flux)

On p.
> 3 you say that units are not part of UCDs, but on p.16 you create a UCD,
> phys.degrees;value

I wasn't quite sure what the UCD should be there.
Maybe phys.angle.separation;value?

, that is all about units.  On p.12, I really like the
> typo(?) in 'pudding' (pubbing).

Alas that is also fixed.  [That kind of error must reflect some curious
things about the mind.  I clearly picked the mirror image letter even
though the typing motion for it is nothing like 'd']
>
> I'm not sure how others have reacted -- have not gone to the UCD list yet
to
> see.  But I was particularly confused by the following things.
>
> o  p.4, you say that
>
>     phot.flux;em.optical;intent.calculated;value
>
> is equivalent to
>
>     phot.flux;intent.calculated;em.flux;value
>
> But there must be a mistake here.  Shouldn't 'flux' in the second line be
> 'optical'?  And isn't the first form illegal if alphabetical order is
> required?

The typos in the UCD were fixed and I hope that would help clarify what
I was trying to say.  The two UCDs should have been
    phot.flux;em.optical;intent.calculated;value
and
    phot.flux;intent.calculated;em.optical;value
The statement I was trying to make was that there is no natural reason to
prefer one of these to the other, so we had to choose an arbitrary rule
to try to ensure uniqueness of UCDs.  Thus indeed the second is illegal.

>
> I find the goal of brevity at conflict with the goal of clarity.  What
does
> 'em' mean to a human reader?  Why 'src' and not 'source'?  Why 'value' and
> not 'scalar' (parallel structure to 'vector')?  Why default on 'value' in
a
> otherwise well-defined ontology?
I can't really argue with most of these.  The tension between various goals
it why I tried to list them all together.  I would be happy to change to
longer
words.

The default for value was just meant to be a convenience for writers of
tables.
If it confuses things I'm happy to drop it.

I like value rather than scalar because a value can be a vector quantity.
E.g.,
if we have a cell that contains an array of fluxes it's UCD might be
    phot.spectrum;value
That's because the concept of spectrum is inherently non-scalar.  A field
that had
a UCD of
    phot.spectrum;vector
would imply that each cell contained an array of spectra (i.e, that the cell
was
presumably a 2-d array).  However this is no big deal.

>
> I think if a clear distinction is to be made between attributes and
> modifiers, it must be encoded explicitly (i.e., not just based on a list
of
> magic words).  I do not like the semicolons as delimiters; this is not
what
> they mean in English grammar.  (The semicolon in the last sentence was
used
> properly.  The second clause is not necessarily a direct modifier of the
> first, but rather is related in some intimate way.)

This is fine by me -- I gave an example above using different separators.  I
think
the grammar is just as simple.

>
> I don't understand how to use the concept 'concept' in a practical sense.
>

Well I tried to give two examples:  If you have a VOTable in an editor how
do you
find the fields that don't have a defined concept?  If a user simply omits
the
UCD field it's kind of painful to find them.  However one can just do a
string
search for "concept" if the user has entered ucd='concept;value' to
explicitly
mark that the underlying UCD is unknown.

The real reason is given in the last example in section 5.  When correlating
two
tables that describe different kinds of quantities, e.g., sources and
observations,
I need to be able to describe what the ouput table is.   There are two
objects
in every row so it's a multiplet (in my scheme), but what kind of multiplet?
I can't
call it a source, and I can't call it an observation, so I need to go up to
a more
generic word, i.e., concept.  Basically it just provides the root for entire
concept
hierarchy.  If we really wanted to be regular, we could start all of the
base
concepts as using this word...

> Your definition of 'pos' does not include solar or planetary coordinate
> systems, though later you give an example that does.

I don't know what the current hierarchy under pos is...  What I'd guess
is that it would contain something like:

     pos.body.lat and pos.body.lon

and then the frame modifier would be used to specify which body.
[Or maybe I left an inconsistency in from the previous version]

>
> 'intent' is defined as the 'human context' of the concept.  Huh?  How are
> 'calculated', 'predicted', and 'simulated' anymore human concepts than
> 'observed' or 'measured'?

Observed and measured would be fine additions here except that they are
likely to be considered the default.  I.e., a time.exposure;value
is assumed to be the measured time, so I don't need to put that in.
[Note that meas is short for measurement].  The explanation probably needs
to be better, but I think we need some kind of modifier that distinguishes
between 'real' values and predicted, scheduled, calculated, ... values.
This
doesn't come up so much in VizieR tables, but many of the tables that I
deal with are riddled with situations where I may have an allocated exposure
time,
a predicted exposure time and an actual exposure time.  So something is
assumed to be actual/measured/observed unless an intent is specified.
>
> In 4.4 you insist that full words should be used ('electron' instead of
> 'el'), but at the same time assert that 'phys', 'temp', 'em', etc., are
all
> ok.

I don't have a horse in this race...  I tried to match the usage of the
previous
paper, but I'd be happy to go either way.
>
> Example 2 (p.14) does not convey to me anything semantically different if
I
> disregard your comments.  How am I supposed to understand something about
> guide stars and plate centers from the structure of the UCDs alone?  I
take
> issue with your assertion that "both software and humans should have no
> trouble distinguishing the very different semantics of the two tables."
>
Well...  I'd hope that by looking at the table UCD, you would immediately
note that one table returned source information and the other returned
observation information.  That's no small matter.  The structure immediately
shows which concept is subordinate to the other.  The actual semantics of
the relationship were not described.  You could do that if you want that
level
of detail.  I'm not sure what the right UCDs are.

E.g., in the source table might have included (hope the indentation survives
the mail):

      obs.instance
           meta.id;value
            pos;meas.center
                 pos.eq.ra;value
                 pos.eq.dec;value

I guess if we really want to include the concept of a guide star in the
UCD hierarchy, they probably belong in the base concept or maybe
in frame somehow, but I think this is
too detailed.  If we went ahead with it...  The guide star might be

      src;frame.usage.guiding;instance
meta.id;value
pos.instance

Note that in the first case it's the position that got the
extra information, because the observation is just a standard observation
(as far as we know).  In the second case we're suggesting that
this is a special kind of source.

But I don't think I want to put that in the relatively simple examples.
What I was trying to show was how the need for main columns has disappeared
and that we could get source or observation information from either table
with equivalent ease.

> I don't like 'arith' as a concept.  'math' would be ok.  If we need it at
> all.

Well I did try to discourage it...  I have no problem with math.
>
> I don't like 'soft' as a concept.  Is it so bad to just say 'software'?
All
> this stuff will be encoded in XML, which is notoriously verbose.  If we
> chose unclear abbreviations we will obscure whatever semantic meaning is
to
> be found.

Fine with me...

>
> OK, a lot of these criticisms are not really directed to you, but to the
> predecessor document.  I understood your presentation in Strasbourg (I
> thought) but do not follow the document sufficiently well that I would
ever
> be comfortable promoting it forward.  I did not like Roy and Sebatien's
> premise that concept and property could morph, one into the other,
depending
> on context.  I do like your attempt to structure things more rigidly.  It
> seems to me not rigid enough.  And when I ran into phys.degrees I felt
like
> the whole thing was falling down around me.  The concept is an angular
> distance, which of course can be expressed in degrees, radians, arcsec,
etc.

Agreed...  {see above)
>
> It might be worth our time to look at the AIPS++ measures definitions.  If
I
> were to construct a quick hierarchy, what we are trying to do here is
> distinguish various sorts of measurements, metadata about those
> measurements, and metadata about the people/organizations associated with
> those measurements.  So our fundamental concept is a measurement, of which
> there are various sorts:
>
> measurement
>   photometric
>   spectroscopic (which is just photometric per wavelength in an ordered
sort
> of way)
>   astrometric ('pos')
>   temporal
>   instrumental
>
> Ancillary information about measurements comes in the form of metadata:
>
> metadata
>   identifiers
>   people
>   organizations
>
> And we may have some special classes:
>
> software
> source (to collect measurements of an object in space-time)
>
> Measurements are taken in bandpasses, and in certain coordinate frames,
and
> from either the real universe or from computer simulations.  A bandpass is
a
> 'frame' restricting coverage in the em-spectrum.  A coordinate frame
> describes a restriction on the spatial coverage.  The idea of 'intent' has
> nothing to do with anything; it is simply a mode of collecting
measurements.
>
> Allright, enough of my rantings for this evening.  I applaud your attempt
to
> add rationality to Roy and Sebastien's work, but feel we still have some
way
> to go.
>

Thanks...  I don't disagree with what you are saying and I hope that we
can a least reopen the discussion.

Tom