Next step towards blind discovery: float-valued column metadata

Wed Mar 17 09:38:51 CET 2021

Hi Mark,

Thanks for giving this some thought.

On Tue, Mar 16, 2021 at 05:16:33PM +0000, Mark Taylor wrote:
> For data discovery, I feel like min and max are what you're going to want.
> I do admit that outliers are going to reduce the usefulness of those
> quantities in practice, but attempting to craft a data discovery query
> that makes meaningful use of quartiles or 2-sigma regions sounds a bit
> ambitious to me.  Especially for datasets which you don't already

In what sense?  I mean, you'd just be saying

  where precentile_97>20

just like you're saying

  where max_value>20

-- granted, *establishing* percentile_97 is a bit harder than
max_value, but that's what we have databases for, and there are many
casese in which percentile_97 would be a lot more helpful than
max_value (extreme example: USNO-B 2.0 with its spurious magnitude
50s...).

> As far as metadata that's useful/interesting beyond the bounds of
> data discovery, I'd favour keeping it much less constrained.
> That's not really what you're asking about here, but if we're
> going to be defining quantitative metadata for table columns,
> it's probably a good opportunity to provide sufficiently flexible
> hooks in VODataService and TAP_SCHEMA to make such information
> available.  I would suggest to allow the service to provide custom
> quantiles (e.g. enough to paint a little histogram characterising
> the columns; or maybe just the median and quartiles, depending on
> what the service wants to provide).  Tom Donaldson gave a talk
> in Hawaii in 2013 suggesting something along those lines:
>   
>    https://wiki.ivoa.net/internal/IVOA/InterOpSep2013Applications/SummaryGuidedQueries.pdf

Hmyes.  I suppose it's a good idea to keep the use case of "I've got
a candidate metadata record, now let me quickly see if it's really
interesting" in mind.  But coming up with a flexible scheme that will
produce something clients and users can do anything sensible with is, I
think, hard.

Hence, I'd be interested in any ideas that would add a bit of extra
utility to what we can currently do (where you can just add any
attributes from outside the namespace in VODataService and any
columns to TAP_SCHEMA.columns).  

Perhaps just saying "*if* you're doing histograms, do them in this
way and let the client know what they are in this way" could already
go a long way?

> I guess mean and S.D. are reasonable additions too since they're easy
> to calculate (unlike quantiles), though without knowing more about
> the distribution they don't tell you that much.

I've largely discounted those because they're not terribly robust,
and as I said, most of our interesting distributions are badly
non-gaussian.  I'd be somewhat relaxed about the fact that they're a
lot easier to compute than the quantiles -- anything we talk about
here will be in RDBMSes, and I'd suspect all of them have some
facilities for establishing quantiles, if only because they need that
for their query planners.

Having said that, I don't think mean and stddev would hurt much, so
I'd not struggle to keep them out.

       -- Markus