Statistics metadata in TAP

Gregory Mantelet gmantele at ari.uni-heidelberg.de
Fri Oct 21 01:55:35 CEST 2016


On 21/10/2016 01:29, Walter Landry wrote:
> Gregory Mantelet <gmantele at ari.uni-heidelberg.de> wrote:
>> ** Columns metadata
>>
>> The idea is to add basic statistics like a count, min, max, ... for
>> some numerical columns of tables published in a TAP service. For that,
>> I have just added the following columns in TAP_SCHEMA.columns:
>>
>>      - min_value
>>      - max_value
>>      - mean
>>      - std_dev
>>      - q1          (i.e. first quartile)
>>      - median (i.e. second quartile)
>>      - q3          (i.e. third quartile)
>>      - filling     (number of rows having a NOT NULL value for this column)
> As a data point, at IRSA we already calculate min, max, and number of
> rows for internal purposes.  Mean, std_dev, and filling would not be
> difficult to calculate at the same time.  Quartiles would be somewhat
> onerous.  We have rather large tables that grow over time (the project
> takes more data), and calculating the quartiles requires either sorting
> the data or lots of external storage.

I can see the difficulty there. Well, I have never said I wanted these 
metadata being mandatory. So I agree these statistics may be a good 
thing but only for "stable" tables...otherwise it is, as you say, 
complicated to maintain and it is probably better to not provide them 
for this kind of tables.

> As a side point, I am a little worried about what it means to take the
> mean of a table with NULL's.  I can define it, but I do not know if I
> like it.

Hence the additional column "filling". All the statistics I am proposing 
are computed only with the NOT NULL values, and "filling" is giving the 
number of columns used to compute all of them, including the mean. I 
should have probably mentioned it in my email...sorry.

Cheers,
Grégory


More information about the apps mailing list