Next step towards blind discovery: float-valued column metadata

Tue Mar 16 18:16:33 CET 2021

Markus,

my take would be:

For data discovery, I feel like min and max are what you're going to want.
I do admit that outliers are going to reduce the usefulness of those
quantities in practice, but attempting to craft a data discovery query
that makes meaningful use of quartiles or 2-sigma regions sounds a bit
ambitious to me.  Especially for datasets which you don't already
know a lot about, which is the obvious data discovery context.
On the other hand this kind of discovery isn't something I spend
much time doing, so if you or others disagree I might be wrong.

As far as metadata that's useful/interesting beyond the bounds of
data discovery, I'd favour keeping it much less constrained.
That's not really what you're asking about here, but if we're
going to be defining quantitative metadata for table columns,
it's probably a good opportunity to provide sufficiently flexible
hooks in VODataService and TAP_SCHEMA to make such information
available.  I would suggest to allow the service to provide custom
quantiles (e.g. enough to paint a little histogram characterising
the columns; or maybe just the median and quartiles, depending on
what the service wants to provide).  Tom Donaldson gave a talk
in Hawaii in 2013 suggesting something along those lines:

   https://wiki.ivoa.net/internal/IVOA/InterOpSep2013Applications/SummaryGuidedQueries.pdf

I guess mean and S.D. are reasonable additions too since they're easy
to calculate (unlike quantiles), though without knowing more about
the distribution they don't tell you that much.

Just my 2c.

Mark

On Tue, 26 Jan 2021, Markus Demleitner wrote:

> Dear Registry,
> 
> For newcomers: "Blind discovery" means "finding data by physics
> rather than name"; you'd not be looking for 2MASS but, perhaps,
> "K-band data around Rigel going down to at least 15 mag" (or even
> better "0.01 Jy").
> 
> With VODataService 1.2's adoption in the field growing, I'd say we're
> getting there as far as STC coverage goes (cf.
> https://ivoa.net/documents/Notes/Regstc/).
> 
> The next step I'd take would be the part about "going down to at
> least 15 mag".  That requires a bit more column metadata than we
> currently have in either standard TAP_SCHEMA or VODataService 1.2
> (and, in consequence, VOSI tables) or even VOTable.  The question is:
> what additional metadata?
> 
> The service that has been breaking some ground here is ARI's Gaia
> mirror (at https://gaia.ari.uni-heidelberg.de/tap).  If you go there,
> you will see that its TAP_SCHEMA.columns has, in addition to the
> standard columns:
> 
> * min_value
> * max_value
> * mean
> * std_dev
> * q1, median, q3 (i.e., the three quartiles)
> * filling (which here just gives the number of non-NULLs)
> 
> Grégory has also provided a small Registry extension that adds these
> statistics in VODataService tables,
> https://gaia.ari.uni-heidelberg.de/tap-stats, so these stats are
> available through /tables as well (where given).
> 
> Following the example of the STC discovery, I would now like to write
> another roadmap document laying out how we could bring this kind of
> thing into the VO as a whole.
> 
> First question: Is anyone interested in co-authoring this?
> 
> And then, again: What metadata do we want?
> 
> min and max are, I think, a given, if only because they're useful in
> VOTable, too.
> 
> Mean and stddev I'm less convinced about.  While understanding them
> as the first moments of the underlying distribution makes them
> general terms, the arguably most interesting columns for this kind of
> thing (photometry, distance/parallax) will in most resources be
> severely non-gaussian, and thus the mean and stddev would give a
> really poor idea of how the values are really distributed.  Also,
> they're highly unstable against arithmetic operations (as in
> 1/mean(parallax) normally isn't anywhere near mean(1/parallax)).
> 
> On the other hand, of course, people are used to dealing with, well,
> averages and FWHMs, they're straightforward to compute, and there are
> built-ins in probably every database out there.  Hm.
> 
> From a principled point of view, the quantiles are a lot better
> behaved, in particular with the highly skewed distributions we'll be
> dealing with most of the time.  So, going for purity of essence we'd
> be using the median and 2 or more percentiles.
> 
> Me, I'd not be using quartiles, though.  I think what these things
> are most useful for is outlier-resistant min and max.  You see, I'd
> expect for most of our tables, min and max will not be really
> representative of the data content but rather of the noise.  For
> instance, if you translated the use case above into "where
> max_val>15" would give you quite a few tables that in reality peter
> out at 14 or so.
> 
> Hence, I'd use what would be close to a two-sigma interval for a
> gaussian, perhaps the 3th and 97th percentile (2 sigma in a gaussian
> is about 95% of the area, so you'd rather have 2.5th an 97.5th, but
> I don't like the looks of this).
> 
> Computing those is more expensive than mean and stddev in either time
> or space (or both).  I've not looked at postgres' implementation, but
> I suspect it'll be at least O(n log(n)) (against O(n) for mean and
> stddev).
> 
> So... what do people think?  Going statistically clean or quick and
> simple?  Or both (ouch! on this from me)?  Quartiles?  95% CI?  A
> histogram with 10 bins perhaps?  Doing something different entirely?
> 
> Thanks,
> 
>           Markus
> 

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/