MyUCDs & Registry

Tom McGlynn Thomas.A.McGlynn at nasa.gov
Tue May 20 07:21:30 PDT 2003


> 5. Should the Registry store the column names and units used in a catalog or
> data table? I would say 'definitely' for the column names and 'probably' for
> the units. The column names are essential to resolve duplicate UCDs before a
> generic query is farmed out. 

Hi Tony,

I find this unconvincing.  How is a 'generic' query to be able
to do anything with column names?

Generic queries must use some standardized mechanism
to identify the elements used in the query (else they aren't
generic).  We have no standardized
mechanism for specifying the names of columns (and as far as I know
we have not even begun an effort to define such) so manifestly it is
currently impossible to use column names for a generic query.

However, I believe that the currently planned implementation of
UCDs will in practice be adequate virtually all of the time.

The putative problem with UCD's is that multiple columns
in a given table may share the same UCD.  Let's look at some
scenarios in more detail.  One example given in Cambridge
was an object catalog whose entries contained not only the object postition
but the center of the image on which the object was detected.
This is precisely the kind of case the 'main' UCD qualifier would (and does)
address.  It is easily be handled using either the current or proposed
UCD frameworks.  A user querying this table is getting objects, and the position
of the object should be the 'main' position.  This doesn't mean that
someone couldn't make a query against the observation centers, but
that would be a 'manual' query.

There are cases where the 'main' column is not obvious.  E.g., we might
have a table which is the cross-correlation between the SDSS and USNO-B object
catalogs.  This contains two positions neither really subordinate to the other.
Which should be used in querying the table?

I'll grant the UCD's do not magically solve this problem, but the column names
don't really help.  How does any kind of automated program
pick between 'ra_usno' and 'ra_sdss'?

In practice in this case I would expect the creators of this table to
pick one of the sets of position -- perhaps the one with the smallest
error -- and suggest that this is the primary set of positions.  How might
they do this most easily?  The 'main' qualifier in the UCD descriptions
is again the obvious candidate.  The choice here is a bit arbitrary, but so
would any automated choice based on position be.  Regardless it will
not matter very much, since the positions will be very close to one another.

For a third scenario let's consider a catalog of gamma-ray bursts where
the position of each burst is given as a quadralateral in the sky with
four bounding positions.  Here it's clearly impossible to pick any
one of these positions over the other.   The best resolution of this
problem is probably to define a UCD for the group of columns the define
the bounding box.  Again having column names doesn't help automated software
pick a column -- this table just isn't easily searchable using simple
cone-search techniques.  Fortunately this is a relatively rare sort of table.



To my mind UCD's provide precisely the information that is useful
in making automated decisions about which column to query.
They identify the kind of information that is in the column in a standard
way.  When there are multiple columns that have the same kind of information
I don't think it's reasonable to expect automated choices to be made
unless there is a hint from the creator of the table.

This doesn't mean that we don't want to have column names in the registry.
They may be helpful for directed (rather than automated queries), for
textual matches, and to help the user understand what is in a table
before actually running a query.


I agree with you about units.  Basically they are an implementation detail
and the registry should not be exposing this implementation choice.

		Tom McGlynn



More information about the registry mailing list