Parquet with STIL/STILTS/TOPCAT

Mark Taylor m.b.taylor at bristol.ac.uk
Mon Mar 8 19:36:39 CET 2021


[Crossposted to Apps and DAL: I suggest followups to Apps]

Hi all.

Gregory Dubois-Felsmann was talking in the Apps/DAL/DM/Edu joint
session at the last interop
(https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf)
about possible use in the VO of the Apache Parquet file format
(apparently in current/future use within LSST and IPAC),
and requested some discussion of its use within the Apps/DAL/DM
working groups.  I've also had interest in this format in relation
to TOPCAT from DPAC/Gaia.

So I have implemented prototype Parquet I/O handlers for STIL.
You can find a parquet-capable TOPCAT here:

   ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar

This seems to work OK with the (very small number of) example 
parquet files containing astronomy data that I've tried it with.  
Unlike FITS, loading arbitrarily large files is not instant, 
since the layout of parquet files means that the data has to be 
decompressed before use, but some of the I/O is done in parallel,
so read speed isn't too bad on a multi-core machine (in my tests).  
Currently one parquet file maps to one topcat table, but aggregating 
multiple files into a single table could come in future.
Other features could be added too.

One thing this doesn't so far do is any kind of metadata persistence:
apart from column name and datatype, no metadata (e.g. units, UCDs)
is read or written.  There are places in the parquet file format 
that such information could be stored (e.g. as JSON or VOTable XML), 
but I haven't come across any standard way to organise such information.

Somewhat annoying is the size of the dependencies: the implementation
requires 15 Mb of Apache support libraries (single-handedly doubling
the size of the distributed STILTS jar file).  So I'm a bit hesitant
to include this capability in the standard STIL/STILTS/TOPCAT
releases if they are only going to be of marginal interest.

This message is to gauge interest and request input:

   - is Parquet a file format that people anticipate using with
     applications like TOPCAT and STILTS, or with the STIL library?

   - does anybody have example astronomy Parquet files I could look at?

   - does anybody know of Parquet metadata storage conventions in use
     in astronomy?

   - if you try out the parquet-capable topcat linked above, does it work?

Thanks,

Mark

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/


More information about the dal mailing list