[Semantics] Re: UCDs and arrays : histogram case

Tue May 22 15:00:05 CEST 2018

Hi Mireille,

On Mon, May 21, 2018 at 09:19:19PM +0200, Mireille Louys wrote:
> Le 21/05/2018 à 14:31, Markus Demleitner a écrit :
> > Analogously, a <histogram of X> is a different beast from <X>, and
> > clients that, for instance, cut off trailing atoms to figure out
> > roughly what something is would be completely mislead.
> >
> yes you are right , the context to interpret the aggregated measure
> is needed .  I had in mind we might want to trace the histogram of
> an error  ( very usual requirement) then if stat.error and stat
> histogram are both P,  we cannot choose :

I see.  That's a good point, yes, and it may actually indicate that
UCDs, as used right now, cannot really handle twice-derived
quantities.

To give a frame for that discussion, see

http://dc.zah.uni-heidelberg.de/ucds/ui/ui/form?__nevow_form__=genForm&description=Error%20in%20magnitude&_FORMAT=HTML&submit=Go

The "explanation" column in there is computer-generated based on the
UCD list (and frankly, I'm fairly satisfied how well it works out --
in case, you're curious, the source code is at
http://svn.ari.uni-heidelberg.de/svn/gavo/hdinputs/ucds/res/ucdexplainer.py).

I claim that a histogram of errors should produce, as explanation in
your example,

"Histogram of statistical error of photometric magnitude"

(or close).  This would translate into

stat.histogram;stat.error;phot.mag

-- which is, of course, illegal as well because it puts stat.error in an
S position.

You could go on; if you computed a median of that histogram, one
could logically build

stat.median;stat.histogram;stat.error;phot.mag.

I doubt that's a sensible thing to do, but then having extra rules to
prevent it when I do UCD inference in ADQL query processing feels
very wrong, too.

Without having thought deeply about the whole problem, I think we'll
need something like a new atom class "D" ("deriver") with a rule
like:

  D atoms can be prepended to any UCD to create new terms.
  Conceptually, they derive a new quantity from an existing one.  For
  instance, stat.error turns a measurement (which has an underlying,
  implicit distribution) into a single value (in this case, something
  like the standard derivation).  Over a set of different
  measurements, errors have a distribution of their own.  If you
  obtained a histogram of that distribution, you would get a
  stat.histogram;stat.error.

Then, the constraints on P would have to be adapted to allow Ds in
front of Ps.

I can't say I like it much, but there *are* things like
doubly-derived quantities out there, and I see no way to accomodate
them in the current framework.

Of course, we could say "well, we don't annotate them with UCDs, as
it's unlikely a machine could be coaxed into doing something sensible
with them".  But then it seems to me as if the information that a
given array is a histogram could be something a machine might want to
know about.

> What are the other use-cases where we would have to use
> stat.histogram for the content of a column? .  Tap 1.1 allows that
> but do we want to encourage multiple values inside a column ?  I
> assume if this would be generalized,  the risk to break the ucd
> labeling  consistency  is not null.

Well, we have allowed arrays in VOTables since day one, and SIAP has
pushed out arrays since its version 1.  I've had more applications for
arrays recently[1], and these experiences lead me to believe arrays
aren't going anywhere.  If we can't make UCDs work for them, that'd be
a shame.

           -- Markus

[1] An example not published yet, but coming up: dust maps of the
galaxy, where the arrrays are quantised extinction curves over
distance.  In this particular case, one could bring the data to first
normal form by expanding the arrays, and perhaps I should have done
so; but since that's also multi-level on HEALPix, the resulting
tables would have become *much* larger. And then there's the Gaia
light curves that I also store in arrays, and again a fully
normalised representation would yield much larger, and presumably
slower, tables.