Parquet with STIL/STILTS/TOPCAT
Mark Taylor
m.b.taylor at bristol.ac.uk
Thu Mar 11 19:07:46 CET 2021
David and Dave,
thanks for this feedback, I'm glad it works on most of the files
that you've tried it on. I had a go at the GPIPS one too, it seems
fine though I note that this file has only a single row block,
which limits the possibility for parallel processing.
You're right that this version does not cope with
multi-dimensional-array-valued columns.
The model that parquet uses for that kind of data doesn't fit very
well with what STIL does, so it's a bit tricky for me to support.
For now I'm filing that issue under "wait and see whether anybody
cares enough to complain".
I think I could fairly easily add something to work with a collection
(e.g. a directory full) of mutually compatible parquet files.
The question is really whether people are going to want to do that.
I suppose the main platform for working with a directory full of
parquet files would be something like Spark as Dave M mentions.
Do you think there would be interest in using stilts to work with them?
It ought to be able to do things like calculate all-sky healpix maps
quite fast in parallel on multiple cores. Attempting to run topcat
on a gaia-scale dataset is *probably* a non-starter. It could be done,
but it wouldn't be fun.
Anyway, I remain interested in discussion and requirements on this
topic if anybody wants to talk about it.
Mark
On Wed, 10 Mar 2021, Shupe, David L. wrote:
> Mark —
>
> I am very happy to see this Parquet-capable prototype! Yes, it is an interesting capability for TOPCAT and STILTS.
>
> This TOPCAT prototype worked successfully for me on a catalog for the Galactic Plane Infrared Polarization Survey, served by IRSA from the directory https://irsa.ipac.caltech.edu/data/GPIPS/catalog where there are also IPAC table and VOTable formats. (See https://irsa.ipac.caltech.edu/data/GPIPS/overview.html for all the context for these data.)
>
> It also works on a number of other Parquet files I am working with; though not, as you noted with a Parquet dataset which is a directory containing several Parquet files. It also did not work on some Parquet files with multidimensional columns. I could make these available to you if it would be helpful.
>
> So far in our work at IRSA, we have not tried storing additional metadata. I think Brian Van Klaveren’s followup message in this thread has the latest thinking on that.
>
> -David
> David L. Shupe, PhD (he/him/Dave)
> Scientist, NASA/IPAC Infrared Science Archive
> Lead, Zwicky Transient Facility
>
> Caltech/IPAC
> Mail Code MR-100, Pasadena CA 91125
>
> On Mar 8, 2021, at 10:36 AM, Mark Taylor <m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk>> wrote:
>
> This message is to gauge interest and request input:
>
> - is Parquet a file format that people anticipate using with
> applications like TOPCAT and STILTS, or with the STIL library?
>
> - does anybody have example astronomy Parquet files I could look at?
>
> - does anybody know of Parquet metadata storage conventions in use
> in astronomy?
>
> - if you try out the parquet-capable topcat linked above, does it work?
>
>
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk http://www.star.bristol.ac.uk/~mbt/
More information about the apps
mailing list