Statistics metadata in TAP
Gregory Mantelet
gmantele at ari.uni-heidelberg.de
Fri Oct 21 12:34:18 CEST 2016
Hi Thomas, Tom,
To answer to Thomas, yes, I originally planned to provide also
histograms in the TAP metadata, but since it is not as simple
information as the others, I preferred to go step by step and keep that
idea for later.
As proposes Tom, providing histograms in TAP could be a second
interesting step. But I admit being a bit sceptical about the interest
from the users for such information. For the moment, since the moment
the statistics I presented are online, I have very few download (nearly
none) of the histograms ; maybe because this service is not so well
known for the moment or maybe because the image of the histogram is
enough for the users, but I doubt about all these reasons. Then, maybe
users are merely not aware yet of such possibility and if they were,
maybe they would use it. So, I do not know...
These sceptical ideas apart, in order to have histograms in TAP I
originally thought to three solutions:
1- A column of type ARRAY (1D or 2D depending of whether the
histogram is contiguous or not). Very simple and minimalist solution,
but ARRAY are not yet supported in TAP.
2- A URL toward a VOTable containing the histogram ; the
structure/data-model would be then to be defined. Easy to store and to
get in TAP_SCHEMA but probably not really query-able by users, although
I assume that most of them would prefer to get the whole histogram
rather than just parts of it.
3- Having an additional table in TAP_SCHEMA. I never thought to any
precise structure here as you did Tom. However I am not sure I like the
idea to have 4 tables to store an histogram, but I may have wrongly
interpreted your suggestion. Considering the structure of your tables, I
would say that just one table with the following structure would be
fairly enough and would cover the case of continuous histogram or not:
CREATE TABLE "TAP_SCHEMA".histogram_bins (
table_name VARCHAR,
column_name VARCHAR,
step INTEGER,
lb DOUBLE PRECISION,
hb DOUBLE PRECISION,
value INTEGER,
PRIMARY KEY (table_name, column_name)
);
Note 1: I think "step" could be optional if the minimum and maximum of
the histogram are exactly the minimum and maximum values provided in
TAP_SCHEMA.columns (min_value and max_value).
Note 2: Depending of the implementation, "step" and "value" could be
BIGINT instead of INTEGER.
Considering these solutions and the fact that I have already all
histograms for the columns of my TAP service (for Gaia, Hipparcos,
Tycho2, PPMXL and 2MASS), I could easily implement a prototype for one
(or more if really needed) of these solutions or any other preferred
suggestion, if we all agree that having histograms in TAP metadata is
really a good idea and if we all merge toward of solution.
Cheers,
Grégory
On 21/10/2016 09:21, Tom Donaldson wrote:
> Hello All,
>
> At MAST, we are currently computing and using min and max for numeric
> columns, and of course could compute the other basic statistics.
>
> Our main use of the min/max values is to define bin sizes for
> histograms that support the user interface for certain queries. We
> also store histograms for discrete-valued columns. It seems to me
> that it is worth considering adding this histogram information as
> optional TAP metadata.
>
> We use 4 tables to store the histogram-related information. Our
> default number of bins per column is 145, mostly because that gives a
> decent looking histgram in the UI without being too large. I’m sure
> these could all be rearranged to some degree to make it more
> TAP-suitable, but the current tables look like this:
>
> CREATE TABLE [dbo].[histogramBins](
> [columnName] [varchar](32) NULL,
> [step] [int] NULL,
> [lb] [float] NULL,
> [hb] [float] NULL
> )
> CREATE TABLE [dbo].[histogramContinuous](
> [columnName] [varchar](32) NULL,
> [step] [int] NULL,
> [lb] [float] NULL,
> [hb] [float] NULL,
> [theCount] [bigint] NULL
> )
> CREATE TABLE [dbo].[histogramDiscrete](
> [columnName] [varchar](32) NULL,
> [distinctValue] [varchar](2048) NULL,
> [theCount] [bigint] NULL,
> [hasBeenSplit] [char](1) NULL
> )
> CREATE TABLE [dbo].[histogramMinMax](
> [columnName] [varchar](32) NULL,
> [minimum] [float] NULL,
> [maximum] [float] NULL
> )
>
> Thanks,
> Tom
>
>
>
>> On Oct 20, 2016, at 5:07 PM, Gregory Mantelet
>> <gmantele at ari.uni-heidelberg.de
>> <mailto:gmantele at ari.uni-heidelberg.de>> wrote:
>>
>>
>> Dear DAL and Apps members,
>>
>> Since I do not attend to this interop, I would like to highlight
>> quickly one of my last development concerning TAP because I think it
>> may be interested to either do the same in your own TAP service or
>> merely use it. As suggested by the title of this email it is about
>> adding metadata in TAP.
>>
>> (I send this email also to Apps because of the last point I make in
>> this email: a compatibility with a new feature of TOPCAT)
>>
>>
>> ** Columns metadata
>>
>> The idea is to add basic statistics like a count, min, max, ... for
>> some numerical columns of tables published in a TAP service. For
>> that, I have just added the following columns in TAP_SCHEMA.columns:
>>
>> - min_value
>> - max_value
>> - mean
>> - std_dev
>> - q1 (i.e. first quartile)
>> - median (i.e. second quartile)
>> - q3 (i.e. third quartile)
>> - filling (number of rows having a NOT NULL value for this
>> column)
>>
>> Except for "filling" which must be an integer (INTEGER or BIGINT in
>> PostgreSQL), I have chosen to set all these columns as DOUBLE
>> PRECISION since most of the columns to describe are, in the "worst"
>> case, double values.
>>
>> When no statistics can be provided for a column, all these additional
>> metadata would be NULL.
>>
>>
>> ** Tables metadata
>>
>> In addition, I have also added another column in TAP_SCHEMA.tables:
>>
>> - row_count (of type INTEGER or BIGINT)
>>
>>
>> ** VOSI description of tables
>>
>> Since in TAP all tables and columns metadata MUST be the same in
>> TAP_SCHEMA and /tables, I have also updated our /tables resource.
>>
>> Besides, on a recommendation of Mark Taylor, I designed and linked a
>> simple XSD schema in order to have a valid XML document. You can
>> find this schema at the following address:
>>
>> http://gaia.ari.uni-heidelberg.de/tap-stats.xsd
>>
>>
>> ** Visibility in TOPCAT
>>
>> Thanks to Mark Taylor, any custom metadata (non-standard TAP columns)
>> can be displayed in the last version of TOPCAT. Thus, all the
>> statistics described above can be seen there for our Gaia TAP service
>> (n.b. you can find this TAP service easily in the registry with the
>> keywords "Gaia" and "ARI", but in case you can not, here is the root
>> access URL: http://gaia.ari.uni-heidelberg.de/tap).
>>
>>
>> ** Last words...
>>
>> According to me all these basic statistics may be useful to discover
>> the content of a table, especially when this one is as large as Gaia,
>> PPMXL, 2MASS, ... It can indeed prevent some users to perform
>> apparently simple and short queries such as "SELECT COUNT(*) FROM
>> a_big_table" which on the contrary to what most people think is not
>> often a quick query on large tables. Having already computed such
>> information is then an economy of time and resources for the users
>> and the server.
>>
>> Finally, I am not trying to convince anybody to have such metadata,
>> but I just want to highlight a possible extension of TAP helping in
>> simple data discovery. Besides, this use-case also demonstrates how
>> easy it could be to add more simple metadata inside a TAP service. So
>> maybe it could be interested, if possible, to write an appendix about
>> that in the next version of TAP or just as an IVOA note. What do you
>> think?
>>
>> If anybody has questions or wants further details about the TAP
>> "extension" I presented here, do not hesitate to ask ; I am not at
>> the interop, but I am fully available by email
>>
>> Regards,
>> Grégory
>>
>>
>> PS: For those who are interested, I also provide histograms and
>> sky-maps (using Healpix) for most of the published columns on the
>> page http://gaia.ari.uni-heidelberg.de/tap/tables. Both can be
>> downloaded as images but also as tables that you can then
>> display/process as you want (e.g. display the histogram in TOPCAT,
>> display and navigate inside the Healpix map in Aladin, ...).
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20161021/5bfc183a/attachment-0001.html>
More information about the dal
mailing list