Parquet with STIL/STILTS/TOPCAT

Wed Mar 10 12:22:47 CET 2021

Hi Mark,

We are using Parquet files to provide the main catalog data for our 
Apache Spark science platform.

We currently have copies of Gaia DR2 and eDR3 source catalog, plus 
source catalogs from Pan-STARRS, TwoMass and Wise stored as Parquet 
files.

Each source catalog maps to a directory of Parquet files.

     6513 Parquet files for the Gaia DR2 source catalog (total 473G 
bytes)
    11931 Parquet files for the Gaia source catalog (total 533G bytes)

We haven't looked at publishing data using Parquet yet, but this might 
be a valid solution for packaging and copying large data sets in the 
future.

Having said that, we would be very interested in using Parquet enabled 
versions TopCat and STIL internally to look at and examine individual 
Parquet files or collections of multiple files.

Thanks,
-- Dave

--------
Dave Morris
Research Software Engineer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------

On 2021-03-08 18:36, Mark Taylor wrote:
> [Crossposted to Apps and DAL: I suggest followups to Apps]
> 
> Hi all.
> 
> Gregory Dubois-Felsmann was talking in the Apps/DAL/DM/Edu joint
> session at the last interop
> (https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf)
> about possible use in the VO of the Apache Parquet file format
> (apparently in current/future use within LSST and IPAC),
> and requested some discussion of its use within the Apps/DAL/DM
> working groups.  I've also had interest in this format in relation
> to TOPCAT from DPAC/Gaia.
> 
> So I have implemented prototype Parquet I/O handlers for STIL.
> You can find a parquet-capable TOPCAT here:
> 
>    
> ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar
> 
> This seems to work OK with the (very small number of) example
> parquet files containing astronomy data that I've tried it with.
> Unlike FITS, loading arbitrarily large files is not instant,
> since the layout of parquet files means that the data has to be
> decompressed before use, but some of the I/O is done in parallel,
> so read speed isn't too bad on a multi-core machine (in my tests).
> Currently one parquet file maps to one topcat table, but aggregating
> multiple files into a single table could come in future.
> Other features could be added too.
> 
> One thing this doesn't so far do is any kind of metadata persistence:
> apart from column name and datatype, no metadata (e.g. units, UCDs)
> is read or written.  There are places in the parquet file format
> that such information could be stored (e.g. as JSON or VOTable XML),
> but I haven't come across any standard way to organise such 
> information.
> 
> Somewhat annoying is the size of the dependencies: the implementation
> requires 15 Mb of Apache support libraries (single-handedly doubling
> the size of the distributed STILTS jar file).  So I'm a bit hesitant
> to include this capability in the standard STIL/STILTS/TOPCAT
> releases if they are only going to be of marginal interest.
> 
> This message is to gauge interest and request input:
> 
>    - is Parquet a file format that people anticipate using with
>      applications like TOPCAT and STILTS, or with the STIL library?
> 
>    - does anybody have example astronomy Parquet files I could look at?
> 
>    - does anybody know of Parquet metadata storage conventions in use
>      in astronomy?
> 
>    - if you try out the parquet-capable topcat linked above, does it 
> work?
> 
> Thanks,
> 
> Mark
> 
> --
> Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
> m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/