Next step towards blind discovery: float-valued column metadata

Tue Jan 26 11:40:41 CET 2021

Dear Registry,

For newcomers: "Blind discovery" means "finding data by physics
rather than name"; you'd not be looking for 2MASS but, perhaps,
"K-band data around Rigel going down to at least 15 mag" (or even
better "0.01 Jy").

With VODataService 1.2's adoption in the field growing, I'd say we're
getting there as far as STC coverage goes (cf.
https://ivoa.net/documents/Notes/Regstc/).

The next step I'd take would be the part about "going down to at
least 15 mag".  That requires a bit more column metadata than we
currently have in either standard TAP_SCHEMA or VODataService 1.2
(and, in consequence, VOSI tables) or even VOTable.  The question is:
what additional metadata?

The service that has been breaking some ground here is ARI's Gaia
mirror (at https://gaia.ari.uni-heidelberg.de/tap).  If you go there,
you will see that its TAP_SCHEMA.columns has, in addition to the
standard columns:

* min_value
* max_value
* mean
* std_dev
* q1, median, q3 (i.e., the three quartiles)
* filling (which here just gives the number of non-NULLs)

Grégory has also provided a small Registry extension that adds these
statistics in VODataService tables,
https://gaia.ari.uni-heidelberg.de/tap-stats, so these stats are
available through /tables as well (where given).

Following the example of the STC discovery, I would now like to write
another roadmap document laying out how we could bring this kind of
thing into the VO as a whole.

First question: Is anyone interested in co-authoring this?

And then, again: What metadata do we want?

min and max are, I think, a given, if only because they're useful in
VOTable, too.

Mean and stddev I'm less convinced about.  While understanding them
as the first moments of the underlying distribution makes them
general terms, the arguably most interesting columns for this kind of
thing (photometry, distance/parallax) will in most resources be
severely non-gaussian, and thus the mean and stddev would give a
really poor idea of how the values are really distributed.  Also,
they're highly unstable against arithmetic operations (as in
1/mean(parallax) normally isn't anywhere near mean(1/parallax)).

On the other hand, of course, people are used to dealing with, well,
averages and FWHMs, they're straightforward to compute, and there are
built-ins in probably every database out there.  Hm.

>From a principled point of view, the quantiles are a lot better
behaved, in particular with the highly skewed distributions we'll be
dealing with most of the time.  So, going for purity of essence we'd
be using the median and 2 or more percentiles.

Me, I'd not be using quartiles, though.  I think what these things
are most useful for is outlier-resistant min and max.  You see, I'd
expect for most of our tables, min and max will not be really
representative of the data content but rather of the noise.  For
instance, if you translated the use case above into "where
max_val>15" would give you quite a few tables that in reality peter
out at 14 or so.

Hence, I'd use what would be close to a two-sigma interval for a
gaussian, perhaps the 3th and 97th percentile (2 sigma in a gaussian
is about 95% of the area, so you'd rather have 2.5th an 97.5th, but
I don't like the looks of this).

Computing those is more expensive than mean and stddev in either time
or space (or both).  I've not looked at postgres' implementation, but
I suspect it'll be at least O(n log(n)) (against O(n) for mean and
stddev).

So... what do people think?  Going statistically clean or quick and
simple?  Or both (ouch! on this from me)?  Quartiles?  95% CI?  A
histogram with 10 bins perhaps?  Doing something different entirely?

Thanks,

          Markus