Parquet with STIL/STILTS/TOPCAT
Mark Taylor
m.b.taylor at bristol.ac.uk
Tue Mar 9 10:41:54 CET 2021
Brian,
thanks for this lot. I hadn't spotted the column-level key-value store
(ColumnMetaData.key_value_metadata), but I see this is at the level of
ColumnChunk rather than SchemaElement, so it's not really in the
place you'd expect for storing column metadata like units and UCDs -
maybe that's why it's not much supported in the libraries.
I've seen what pandas does; in principle I could write something
compatible with or extending that, but it doesn't feel like good
practice to trespass on their namespace.
If there is no existing convention on column metadata that I can adopt,
then the way forward is probably either your suggestion of sticking
a VOTable header in the file-level metadata, or writing something
more generic in JSON - the advantage of the latter is it could be
understood, and if required extended, by software that doesn't
know or doesn't care about VOTable.
I'm willing to do either of those (or somebody's better idea)
*if* it looks like parquet is a format that TOPCAT/STIL users are
going to want to use.
Mark
On Mon, 8 Mar 2021, Van Klaveren, Brian N. wrote:
> Hi,
>
> I don't have any good recommendations, but I'll just regurgitate a few things I know.
>
> Parquet files have KeyValue metadata lists at the file level and column level. At the time I had done source-level evaluation a few years ago, support for column level metadata was not generally good in higher level libraries despite it being in the spec (which I think is generally readable by consulting the parquet.thrift file in the parquet repo). I'm not sure if that was a consideration, but the pandas library has serialized certain metadata as a string-serialized JSON blob generally speaking to the File-level metadata under just a "pandas" key.
>
> You may see such use here, in the fastparquet engine, when writing out a pandas data frame:
> https://github.com/dask/fastparquet/blob/efd3fd19a9f0dcf91045c31ff4dbb7cc3ec504f2/fastparquet/writer.py#L738-L774
>
> I would note that it was also the case that pandas dataframes don't usually have a notion of user-metadata which is stored in that way, and for the core library, you don't really have a hook as far as I know to easily get that into there. So, I suppose if a service was writing out pandas files via this interface, it would be their responsibility to update the files after with appropriate metadata (that's more of a note for implementors creating files, so not relevant to consumers). In any case I believe GeoPandas was trying to solve that problem at one point at the DataFrame (table) level, but I'm not aware how that worked out. I had hoped that pandas or other arrow-based tabular projects might be able to store metadata in a column in a way that would persist across certain operations (splitting/joining) on a table, but I don't think any library takes that into account right now.
>
> With all that said, I'm not aware of specific applications in astronomy because I'm not aware of many applications, beyond pandas itself, actually using that feature generally, but I also haven't looked much in the last year or so. I would expect the default solution is serializing VOTable metadata to XML, probably to an `ivoa:VOTable` key or something, in the parquet table, and ignore attempting anything with column level metadata.
>
> Brian
>
>
> On Mar 8, 2021, at 10:36 AM, Mark Taylor <m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk>> wrote:
>
> [Crossposted to Apps and DAL: I suggest followups to Apps]
>
> Hi all.
>
> Gregory Dubois-Felsmann was talking in the Apps/DAL/DM/Edu joint
> session at the last interop
> (https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf)
> about possible use in the VO of the Apache Parquet file format
> (apparently in current/future use within LSST and IPAC),
> and requested some discussion of its use within the Apps/DAL/DM
> working groups. I've also had interest in this format in relation
> to TOPCAT from DPAC/Gaia.
>
> So I have implemented prototype Parquet I/O handlers for STIL.
> You can find a parquet-capable TOPCAT here:
>
> ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar
>
> This seems to work OK with the (very small number of) example
> parquet files containing astronomy data that I've tried it with.
> Unlike FITS, loading arbitrarily large files is not instant,
> since the layout of parquet files means that the data has to be
> decompressed before use, but some of the I/O is done in parallel,
> so read speed isn't too bad on a multi-core machine (in my tests).
> Currently one parquet file maps to one topcat table, but aggregating
> multiple files into a single table could come in future.
> Other features could be added too.
>
> One thing this doesn't so far do is any kind of metadata persistence:
> apart from column name and datatype, no metadata (e.g. units, UCDs)
> is read or written. There are places in the parquet file format
> that such information could be stored (e.g. as JSON or VOTable XML),
> but I haven't come across any standard way to organise such information.
>
> Somewhat annoying is the size of the dependencies: the implementation
> requires 15 Mb of Apache support libraries (single-handedly doubling
> the size of the distributed STILTS jar file). So I'm a bit hesitant
> to include this capability in the standard STIL/STILTS/TOPCAT
> releases if they are only going to be of marginal interest.
>
> This message is to gauge interest and request input:
>
> - is Parquet a file format that people anticipate using with
> applications like TOPCAT and STILTS, or with the STIL library?
>
> - does anybody have example astronomy Parquet files I could look at?
>
> - does anybody know of Parquet metadata storage conventions in use
> in astronomy?
>
> - if you try out the parquet-capable topcat linked above, does it work?
>
> Thanks,
>
> Mark
>
> --
> Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> m.b.taylor at bristol.ac.uk<mailto:m.b.taylor at bristol.ac.uk> http://www.star.bristol.ac.uk/~mbt/
>
>
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk http://www.star.bristol.ac.uk/~mbt/
More information about the apps
mailing list