Statistics metadata in TAP
Gregory Mantelet
gmantele at ari.uni-heidelberg.de
Fri Oct 21 01:55:35 CEST 2016
On 21/10/2016 01:29, Walter Landry wrote:
> Gregory Mantelet <gmantele at ari.uni-heidelberg.de> wrote:
>> ** Columns metadata
>>
>> The idea is to add basic statistics like a count, min, max, ... for
>> some numerical columns of tables published in a TAP service. For that,
>> I have just added the following columns in TAP_SCHEMA.columns:
>>
>> - min_value
>> - max_value
>> - mean
>> - std_dev
>> - q1 (i.e. first quartile)
>> - median (i.e. second quartile)
>> - q3 (i.e. third quartile)
>> - filling (number of rows having a NOT NULL value for this column)
> As a data point, at IRSA we already calculate min, max, and number of
> rows for internal purposes. Mean, std_dev, and filling would not be
> difficult to calculate at the same time. Quartiles would be somewhat
> onerous. We have rather large tables that grow over time (the project
> takes more data), and calculating the quartiles requires either sorting
> the data or lots of external storage.
I can see the difficulty there. Well, I have never said I wanted these
metadata being mandatory. So I agree these statistics may be a good
thing but only for "stable" tables...otherwise it is, as you say,
complicated to maintain and it is probably better to not provide them
for this kind of tables.
> As a side point, I am a little worried about what it means to take the
> mean of a table with NULL's. I can define it, but I do not know if I
> like it.
Hence the additional column "filling". All the statistics I am proposing
are computed only with the NOT NULL values, and "filling" is giving the
number of columns used to compute all of them, including the mean. I
should have probably mentioned it in my email...sorry.
Cheers,
Grégory
More information about the dal
mailing list