Statistics metadata in TAP

Gregory Mantelet gmantele at ari.uni-heidelberg.de
Fri Oct 21 12:34:18 CEST 2016


Hi Thomas, Tom,

To answer to Thomas, yes, I originally planned to provide also 
histograms in the TAP metadata, but since it is not as simple 
information as the others, I preferred to go step by step and keep that 
idea for later.

As proposes Tom, providing histograms in TAP could be a second 
interesting step. But I admit being a bit sceptical about the interest 
from the users for such information. For the moment, since the moment 
the statistics I presented are online, I have very few download (nearly 
none) of the histograms ; maybe because this service is not so well 
known for the moment or maybe because the image of the histogram is 
enough for the users, but I doubt about all these reasons. Then, maybe 
users are merely not aware yet of such possibility and if they were, 
maybe they would use it. So, I do not know...

These sceptical ideas apart, in order to have histograms in TAP I 
originally thought to three solutions:

     1- A column of type ARRAY (1D or 2D depending of whether the 
histogram is contiguous or not). Very simple and minimalist solution, 
but ARRAY are not yet supported in TAP.

     2- A URL toward a VOTable containing the histogram ; the 
structure/data-model would be then to be defined. Easy to store and to 
get in TAP_SCHEMA but probably not really query-able by users, although 
I assume that most of them would prefer to get the whole histogram 
rather than just parts of it.

     3- Having an additional table in TAP_SCHEMA. I never thought to any 
precise structure here as you did Tom. However I am not sure I like the 
idea to have 4 tables to store an histogram, but I may have wrongly 
interpreted your suggestion. Considering the structure of your tables, I 
would say that just one table with the following structure would be 
fairly enough and would cover the case of continuous histogram or not:

CREATE TABLE "TAP_SCHEMA".histogram_bins (
     table_name VARCHAR,
     column_name VARCHAR,
     step INTEGER,
     lb DOUBLE PRECISION,
     hb DOUBLE PRECISION,
     value INTEGER,
     PRIMARY KEY (table_name, column_name)
);

Note 1: I think "step" could be optional if the minimum and maximum of 
the histogram are exactly the minimum and maximum values provided in 
TAP_SCHEMA.columns (min_value and max_value).

Note 2: Depending of the implementation, "step" and "value" could be 
BIGINT instead of INTEGER.


Considering these solutions and the fact that I have already all 
histograms for the columns of my TAP service (for Gaia, Hipparcos, 
Tycho2, PPMXL and 2MASS), I could easily implement a prototype for one 
(or more if really needed) of these solutions or any other preferred 
suggestion, if we all agree that having histograms in TAP metadata is 
really a good idea and if we all merge toward of solution.

Cheers,
Grégory


On 21/10/2016 09:21, Tom Donaldson wrote:
> Hello All,
>
> At MAST, we are currently computing and using min and max for numeric 
> columns, and of course could compute the other basic statistics.
>
> Our main use of the min/max values is to define bin sizes for 
> histograms that support the user interface for certain queries.  We 
> also store histograms for discrete-valued columns.  It seems to me 
> that it is worth considering adding this histogram information as 
> optional TAP metadata.
>
> We use 4 tables to store the histogram-related information.  Our 
> default number of bins per column is 145, mostly because that gives a 
> decent looking histgram in the UI without being too large.  I’m sure 
> these could all be rearranged to some degree to make it more 
> TAP-suitable, but the current tables look like this:
>
> CREATE TABLE [dbo].[histogramBins](
>        [columnName] [varchar](32) NULL,
>        [step] [int] NULL,
>        [lb] [float] NULL,
>        [hb] [float] NULL
> )
> CREATE TABLE [dbo].[histogramContinuous](
>        [columnName] [varchar](32) NULL,
>        [step] [int] NULL,
>        [lb] [float] NULL,
>        [hb] [float] NULL,
>        [theCount] [bigint] NULL
> )
> CREATE TABLE [dbo].[histogramDiscrete](
>        [columnName] [varchar](32) NULL,
>        [distinctValue] [varchar](2048) NULL,
>        [theCount] [bigint] NULL,
>        [hasBeenSplit] [char](1) NULL
> )
> CREATE TABLE [dbo].[histogramMinMax](
>        [columnName] [varchar](32) NULL,
>        [minimum] [float] NULL,
>        [maximum] [float] NULL
> )
>
> Thanks,
> Tom
>
>
>
>> On Oct 20, 2016, at 5:07 PM, Gregory Mantelet 
>> <gmantele at ari.uni-heidelberg.de 
>> <mailto:gmantele at ari.uni-heidelberg.de>> wrote:
>>
>>
>> Dear DAL and Apps members,
>>
>> Since I do not attend to this interop, I would like to highlight 
>> quickly one of my last development concerning TAP because I think it 
>> may be interested to either do the same in your own TAP service or 
>> merely use it. As suggested by the title of this email it is about 
>> adding metadata in TAP.
>>
>> (I send this email also to Apps because of the last point I make in 
>> this email: a compatibility with a new feature of TOPCAT)
>>
>>
>> ** Columns metadata
>>
>> The idea is to add basic statistics like a count, min, max, ... for 
>> some numerical columns of tables published in a TAP service. For 
>> that, I have just added the following columns in TAP_SCHEMA.columns:
>>
>>     - min_value
>>     - max_value
>>     - mean
>>     - std_dev
>>     - q1          (i.e. first quartile)
>>     - median (i.e. second quartile)
>>     - q3          (i.e. third quartile)
>>     - filling     (number of rows having a NOT NULL value for this 
>> column)
>>
>> Except for "filling" which must be an integer (INTEGER or BIGINT in 
>> PostgreSQL), I have chosen to set all these columns as DOUBLE 
>> PRECISION since most of the columns to describe are, in the "worst" 
>> case, double values.
>>
>> When no statistics can be provided for a column, all these additional 
>> metadata would be NULL.
>>
>>
>> ** Tables metadata
>>
>> In addition, I have also added another column in TAP_SCHEMA.tables:
>>
>>     - row_count (of type INTEGER or BIGINT)
>>
>>
>> ** VOSI description of tables
>>
>> Since in TAP all tables and columns metadata MUST be the same in 
>> TAP_SCHEMA and /tables, I have also updated our /tables resource.
>>
>> Besides, on a recommendation of Mark Taylor, I designed and linked a 
>> simple XSD schema  in order to have a valid XML document. You can 
>> find this schema at the following address:
>>
>> http://gaia.ari.uni-heidelberg.de/tap-stats.xsd
>>
>>
>> ** Visibility in TOPCAT
>>
>> Thanks to Mark Taylor, any custom metadata (non-standard TAP columns) 
>> can be displayed in the last version of TOPCAT. Thus, all the 
>> statistics described above can be seen there for our Gaia TAP service 
>> (n.b. you can find this TAP service easily in the registry with the 
>> keywords "Gaia" and "ARI", but in case you can not, here is the root 
>> access URL: http://gaia.ari.uni-heidelberg.de/tap).
>>
>>
>> ** Last words...
>>
>> According to me all these basic statistics may be useful to discover 
>> the content of a table, especially when this one is as large as Gaia, 
>> PPMXL, 2MASS, ... It can indeed prevent some users to perform 
>> apparently simple and short queries such as "SELECT COUNT(*) FROM 
>> a_big_table" which on the contrary to what most people think is not 
>> often a quick query on large tables. Having already computed such 
>> information is then an economy of time and resources for the users 
>> and the server.
>>
>> Finally, I am not trying to convince anybody to have such metadata, 
>> but I just want to highlight a possible extension of TAP helping in 
>> simple data discovery. Besides, this use-case also demonstrates how 
>> easy it could be to add more simple metadata inside a TAP service. So 
>> maybe it could be interested, if possible, to write an appendix about 
>> that in the next version of TAP or just as an IVOA note. What do you 
>> think?
>>
>> If anybody has questions or wants further details about the TAP 
>> "extension" I presented here, do not hesitate to ask ; I am not at 
>> the interop, but I am fully available by email
>>
>> Regards,
>> Grégory
>>
>>
>> PS: For those who are interested, I also provide histograms and 
>> sky-maps (using Healpix) for most of the published columns on the 
>> page http://gaia.ari.uni-heidelberg.de/tap/tables. Both can be 
>> downloaded as images but also as tables that you can then 
>> display/process as you want (e.g. display the histogram in TOPCAT, 
>> display and navigate inside the Healpix map in Aladin, ...).
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20161021/5bfc183a/attachment-0001.html>


More information about the dal mailing list