Fwd: Advanced Column Statistics

Tue Oct 26 14:00:52 CEST 2021

Dear Ada,

On Thu, Oct 21, 2021 at 04:42:32PM +0200, ada nebot wrote:
> I think it would be great to be able to get information on the distribution 
> of a specific column. So I’d like to follow-up on this Note you distributed 
> a while ago…  
> 
> Has there been any further take-up from providers?   

As far as I know, no, not yet.  I'm trying to convince various
operators of publishing Registries to try it out.

[In case anyone is working on colstats already and I just don't know
about it: I'm grateful for any signal, private or public.]

> Adding higher order moments, the skewness and the kurtosis, might
> give information on the underlying distribution.  That said, any
> choice can be debated. And percentiles are easy to understand and
> calculate. I just find it a bit odd not to add values so used as
> mean and standard deviation, but given the fact that underlying
> population might not be a normal distribution, if added / used it
> deserves a bit of caution.

The current design essentially tries to convey mean and (double)
standard deviation in ways that are more robust against severe
non-gaussianity -- which is the rule rather than the exception for
most of the distributions that seem relevant for the use cases
presented (think of especially the magnitudes in common catalogues)
-- and against operations like turning magnitudes into fluxes or
parallaxes into... well, somewhat funky distance estimations.

Adding mean and standard deviation would therefore, I think, do little
to add useful information to the currently proposed set of
statistics.  Having them perhaps would be somewhat more, well,
welcoming; but then I think nudging people towards using more robust
statistics would be a public service all around if we can get away
with it.

Skewness and kurtosis would, indeed, add information that is not yet
(properly) conveyed in the statistics as proposed (although comparing
percentile03-median and median-percentile97 could be a useful
stand-in for skewness in many practical applications).  But before
adopting them (and all their robustness problems): Is there a strong
use case where you'd do data discovery using them?

Thanks,

            Markus