Statistics metadata in TAP

Gregory Mantelet gmantele at ari.uni-heidelberg.de
Thu Oct 20 23:07:46 CEST 2016


Dear DAL and Apps members,

Since I do not attend to this interop, I would like to highlight quickly 
one of my last development concerning TAP because I think it may be 
interested to either do the same in your own TAP service or merely use 
it. As suggested by the title of this email it is about adding metadata 
in TAP.

(I send this email also to Apps because of the last point I make in this 
email: a compatibility with a new feature of TOPCAT)


** Columns metadata

The idea is to add basic statistics like a count, min, max, ... for some 
numerical columns of tables published in a TAP service. For that, I have 
just added the following columns in TAP_SCHEMA.columns:

     - min_value
     - max_value
     - mean
     - std_dev
     - q1          (i.e. first quartile)
     - median (i.e. second quartile)
     - q3          (i.e. third quartile)
     - filling     (number of rows having a NOT NULL value for this column)

Except for "filling" which must be an integer (INTEGER or BIGINT in 
PostgreSQL), I have chosen to set all these columns as DOUBLE PRECISION 
since most of the columns to describe are, in the "worst" case, double 
values.

When no statistics can be provided for a column, all these additional 
metadata would be NULL.


** Tables metadata

In addition, I have also added another column in TAP_SCHEMA.tables:

     - row_count (of type INTEGER or BIGINT)


** VOSI description of tables

Since in TAP all tables and columns metadata MUST be the same in 
TAP_SCHEMA and /tables, I have also updated our /tables resource.

Besides, on a recommendation of Mark Taylor, I designed and linked a 
simple XSD schema  in order to have a valid XML document. You can find 
this schema at the following address:

http://gaia.ari.uni-heidelberg.de/tap-stats.xsd


** Visibility in TOPCAT

Thanks to Mark Taylor, any custom metadata (non-standard TAP columns) 
can be displayed in the last version of TOPCAT. Thus, all the statistics 
described above can be seen there for our Gaia TAP service (n.b. you can 
find this TAP service easily in the registry with the keywords "Gaia" 
and "ARI", but in case you can not, here is the root access URL: 
http://gaia.ari.uni-heidelberg.de/tap).


** Last words...

According to me all these basic statistics may be useful to discover the 
content of a table, especially when this one is as large as Gaia, PPMXL, 
2MASS, ... It can indeed prevent some users to perform apparently simple 
and short queries such as "SELECT COUNT(*) FROM a_big_table" which on 
the contrary to what most people think is not often a quick query on 
large tables. Having already computed such information is then an 
economy of time and resources for the users and the server.

Finally, I am not trying to convince anybody to have such metadata, but 
I just want to highlight a possible extension of TAP helping in simple 
data discovery. Besides, this use-case also demonstrates how easy it 
could be to add more simple metadata inside a TAP service. So maybe it 
could be interested, if possible, to write an appendix about that in the 
next version of TAP or just as an IVOA note. What do you think?

If anybody has questions or wants further details about the TAP 
"extension" I presented here, do not hesitate to ask ; I am not at the 
interop, but I am fully available by email

Regards,
Grégory


PS: For those who are interested, I also provide histograms and sky-maps 
(using Healpix) for most of the published columns on the page 
http://gaia.ari.uni-heidelberg.de/tap/tables. Both can be downloaded as 
images but also as tables that you can then display/process as you want 
(e.g. display the histogram in TOPCAT, display and navigate inside the 
Healpix map in Aladin, ...).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20161020/19263a7b/attachment.html>


More information about the apps mailing list