Parquet with STIL/STILTS/TOPCAT

Mark Taylor m.b.taylor at bristol.ac.uk
Thu Mar 11 19:07:46 CET 2021


David and Dave,

thanks for this feedback, I'm glad it works on most of the files
that you've tried it on.  I had a go at the GPIPS one too, it seems
fine though I note that this file has only a single row block,
which limits the possibility for parallel processing.

You're right that this version does not cope with
multi-dimensional-array-valued columns.
The model that parquet uses for that kind of data doesn't fit very
well with what STIL does, so it's a bit tricky for me to support.
For now I'm filing that issue under "wait and see whether anybody
cares enough to complain".

I think I could fairly easily add something to work with a collection
(e.g. a directory full) of mutually compatible parquet files.
The question is really whether people are going to want to do that.
I suppose the main platform for working with a directory full of
parquet files would be something like Spark as Dave M mentions.
Do you think there would be interest in using stilts to work with them?  
It ought to be able to do things like calculate all-sky healpix maps 
quite fast in parallel on multiple cores.  Attempting to run topcat 
on a gaia-scale dataset is *probably* a non-starter.  It could be done,
but it wouldn't be fun.

Anyway, I remain interested in discussion and requirements on this
topic if anybody wants to talk about it.

Mark


On Wed, 10 Mar 2021, Shupe, David L. wrote:

> Mark —
> 
> I am very happy to see this Parquet-capable prototype! Yes, it is an interesting capability for TOPCAT and STILTS.
> 
> This TOPCAT prototype worked successfully for me on a catalog for the Galactic Plane Infrared Polarization Survey, served by IRSA from the directory https://irsa.ipac.caltech.edu/data/GPIPS/catalog where there are also IPAC table and VOTable formats. (See https://irsa.ipac.caltech.edu/data/GPIPS/overview.html for all the context for these data.)
> 
> It also works on a number of other Parquet files I am working with; though not, as you noted with a Parquet dataset which is a directory containing several Parquet files. It also did not work on some Parquet files with multidimensional columns. I could make these available to you if it would be helpful.
> 
> So far in our work at IRSA, we have not tried storing additional metadata. I think Brian Van Klaveren’s followup message in this thread has the latest thinking on that.
> 
> -David
> David L. Shupe, PhD (he/him/Dave)
> Scientist, NASA/IPAC Infrared Science Archive
> Lead, Zwicky Transient Facility
> 
> Caltech/IPAC
> Mail Code MR-100, Pasadena CA 91125
> 
> On Mar 8, 2021, at 10:36 AM, Mark Taylor <m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk>> wrote:
> 
> This message is to gauge interest and request input:
> 
>   - is Parquet a file format that people anticipate using with
>     applications like TOPCAT and STILTS, or with the STIL library?
> 
>   - does anybody have example astronomy Parquet files I could look at?
> 
>   - does anybody know of Parquet metadata storage conventions in use
>     in astronomy?
> 
>   - if you try out the parquet-capable topcat linked above, does it work?
> 
> 

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/


More information about the apps mailing list