Next step towards blind discovery: float-valued column metadata

Mark Taylor m.b.taylor at bristol.ac.uk
Wed Mar 17 13:20:19 CET 2021


On Wed, 17 Mar 2021, Markus Demleitner wrote:
> On Tue, Mar 16, 2021 at 05:16:33PM +0000, Mark Taylor wrote:
> > For data discovery, I feel like min and max are what you're going to want.
> > I do admit that outliers are going to reduce the usefulness of those
> > quantities in practice, but attempting to craft a data discovery query
> > that makes meaningful use of quartiles or 2-sigma regions sounds a bit
> > ambitious to me.  Especially for datasets which you don't already
> 
> In what sense?  I mean, you'd just be saying
> 
>   where precentile_97>20
> 
> just like you're saying
> 
>   where max_value>20

I don't mean that you can't write down some ADQL, just that knowing
what ADQL to write to achieve your scientific aim might be difficult.
 
> -- granted, *establishing* percentile_97 is a bit harder than
> max_value, but that's what we have databases for, and there are many
> casese in which percentile_97 would be a lot more helpful than
> max_value (extreme example: USNO-B 2.0 with its spurious magnitude
> 50s...).

That's what I mean - if you know that your target service has
spurious magnitude 50s, and approximately how common there are,
you can pick a quantile and have a good idea how to use it.
But if you're not familiar with the specifics of the data in
question it's harder to make that judgement, or to apply the
same judgement to all the services out there from which you
want to select.

It may be in practice that e.g. 97th percentile really is a good
choice for all services, in which case maybe I'm worrying
unnecessarily (though you might still be interested in e.g. very
bright/nearby stars).

> Perhaps just saying "*if* you're doing histograms, do them in this
> way and let the client know what they are in this way" could already
> go a long way?

Yes that sounds like sense.

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/


More information about the registry mailing list