Parquet with STIL/STILTS/TOPCAT

Van Klaveren, Brian N. bvan at slac.stanford.edu
Mon Mar 8 23:39:20 CET 2021


Hi,

I don't have any good recommendations, but I'll just regurgitate a few things I know.

Parquet files have KeyValue metadata lists at the file level and column level. At the time I had done source-level evaluation a few years ago, support for column level metadata was not generally good in higher level libraries despite it being in the spec (which I think is generally readable by consulting the parquet.thrift file in the parquet repo). I'm not sure if that was a consideration, but the pandas library has serialized certain metadata as a string-serialized JSON blob generally speaking to the File-level metadata under just a "pandas" key.

You may see such use here, in the fastparquet engine, when writing out a pandas data frame:
https://github.com/dask/fastparquet/blob/efd3fd19a9f0dcf91045c31ff4dbb7cc3ec504f2/fastparquet/writer.py#L738-L774

I would note that it was also the case that pandas dataframes don't usually have a notion of user-metadata which is stored in that way, and for the core library, you don't really have a hook as far as I know to easily get that into there. So, I suppose if a service was writing out pandas files via this interface, it would be their responsibility to update the files after with appropriate metadata (that's more of a note for implementors creating files, so not relevant to consumers). In any case  I believe GeoPandas was trying to solve that problem at one point at the DataFrame (table) level, but I'm not aware how that worked out. I had hoped that pandas or other arrow-based tabular projects might be able to store metadata in a column in a way that would persist across certain operations (splitting/joining) on a table, but I don't think any library takes that into account right now.

With all that said, I'm not aware of specific applications in astronomy because I'm not aware of many applications, beyond pandas itself, actually using that feature generally, but I also haven't looked much in the last year or so. I would expect the default solution is serializing VOTable metadata to XML, probably to an `ivoa:VOTable` key or something, in the parquet table, and ignore attempting anything with column level metadata.

Brian


On Mar 8, 2021, at 10:36 AM, Mark Taylor <m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk>> wrote:

[Crossposted to Apps and DAL: I suggest followups to Apps]

Hi all.

Gregory Dubois-Felsmann was talking in the Apps/DAL/DM/Edu joint
session at the last interop
(https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf)
about possible use in the VO of the Apache Parquet file format
(apparently in current/future use within LSST and IPAC),
and requested some discussion of its use within the Apps/DAL/DM
working groups.  I've also had interest in this format in relation
to TOPCAT from DPAC/Gaia.

So I have implemented prototype Parquet I/O handlers for STIL.
You can find a parquet-capable TOPCAT here:

  ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar

This seems to work OK with the (very small number of) example
parquet files containing astronomy data that I've tried it with.
Unlike FITS, loading arbitrarily large files is not instant,
since the layout of parquet files means that the data has to be
decompressed before use, but some of the I/O is done in parallel,
so read speed isn't too bad on a multi-core machine (in my tests).
Currently one parquet file maps to one topcat table, but aggregating
multiple files into a single table could come in future.
Other features could be added too.

One thing this doesn't so far do is any kind of metadata persistence:
apart from column name and datatype, no metadata (e.g. units, UCDs)
is read or written.  There are places in the parquet file format
that such information could be stored (e.g. as JSON or VOTable XML),
but I haven't come across any standard way to organise such information.

Somewhat annoying is the size of the dependencies: the implementation
requires 15 Mb of Apache support libraries (single-handedly doubling
the size of the distributed STILTS jar file).  So I'm a bit hesitant
to include this capability in the standard STIL/STILTS/TOPCAT
releases if they are only going to be of marginal interest.

This message is to gauge interest and request input:

  - is Parquet a file format that people anticipate using with
    applications like TOPCAT and STILTS, or with the STIL library?

  - does anybody have example astronomy Parquet files I could look at?

  - does anybody know of Parquet metadata storage conventions in use
    in astronomy?

  - if you try out the parquet-capable topcat linked above, does it work?

Thanks,

Mark

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk>          http://www.star.bristol.ac.uk/~mbt/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20210308/ada944fa/attachment-0001.html>


More information about the dal mailing list