<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Hi,
<div class=""><br class="">
</div>
<div class="">I don't have any good recommendations, but I'll just regurgitate a few things I know.</div>
<div class=""><br class="">
</div>
<div class="">Parquet files have KeyValue metadata lists at the file level and column level. At the time I had done source-level evaluation a few years ago, support for column level metadata was not generally good in higher level libraries despite it being
in the spec (which I think is generally readable by consulting the parquet.thrift file in the parquet repo). I'm not sure if that was a consideration, but the pandas library has serialized certain metadata as a string-serialized JSON blob generally speaking
to the File-level metadata under just a "pandas" key.</div>
<div class=""><br class="">
</div>
<div class="">You may see such use here, in the fastparquet engine, when writing out a pandas data frame:</div>
<div class="">
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
<a href="https://github.com/dask/fastparquet/blob/efd3fd19a9f0dcf91045c31ff4dbb7cc3ec504f2/fastparquet/writer.py#L738-L774" class="">https://github.com/dask/fastparquet/blob/efd3fd19a9f0dcf91045c31ff4dbb7cc3ec504f2/fastparquet/writer.py#L738-L774</a></div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
<br class="">
</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
I would note that it was also the case that pandas dataframes don't usually have a notion of user-metadata which is stored in that way, and for the core library, you don't really have a hook as far as I know to easily get that into there. So, I suppose if a
service was writing out pandas files via this interface, it would be their responsibility to update the files after with appropriate metadata (that's more of a note for implementors creating files, so not relevant to consumers). In any case I believe GeoPandas
was trying to solve that problem at one point at the DataFrame (table) level, but I'm not aware how that worked out. I had hoped that pandas or other arrow-based tabular projects might be able to store metadata in a column in a way that would persist across
certain operations (splitting/joining) on a table, but I don't think any library takes that into account right now.</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
<br class="">
</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
With all that said, I'm not aware of specific applications in astronomy because I'm not aware of many applications, beyond pandas itself, actually using that feature generally, but I also haven't looked much in the last year or so. I would expect the default
solution is serializing VOTable metadata to XML, probably to an `ivoa:VOTable` key or something, in the parquet table, and ignore attempting anything with column level metadata.</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
<br class="">
</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
Brian</div>
<div style="margin: 0px; font-stretch: normal; font-size: 13px; line-height: normal; font-family: "Helvetica Neue"; color: rgba(0, 0, 0, 0.85);" class="">
<br class="">
</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Mar 8, 2021, at 10:36 AM, Mark Taylor <<a href="mailto:m.b.taylor@bristol.ac.uk" class="">m.b.taylor@bristol.ac.uk</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">[Crossposted to Apps and DAL: I suggest followups to Apps]<br class="">
<br class="">
Hi all.<br class="">
<br class="">
Gregory Dubois-Felsmann was talking in the Apps/DAL/DM/Edu joint<br class="">
session at the last interop<br class="">
(<a href="https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf" class="">https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Apps/CatalogFiles-IVOA-20201119-v2.pdf</a>)<br class="">
about possible use in the VO of the Apache Parquet file format<br class="">
(apparently in current/future use within LSST and IPAC),<br class="">
and requested some discussion of its use within the Apps/DAL/DM<br class="">
working groups. I've also had interest in this format in relation<br class="">
to TOPCAT from DPAC/Gaia.<br class="">
<br class="">
So I have implemented prototype Parquet I/O handlers for STIL.<br class="">
You can find a parquet-capable TOPCAT here:<br class="">
<br class="">
<a href="ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar" class="">ftp://andromeda.star.bris.ac.uk/pub/star/topcat/pre/topcat-full_parquet.jar</a><br class="">
<br class="">
This seems to work OK with the (very small number of) example <br class="">
parquet files containing astronomy data that I've tried it with. <br class="">
Unlike FITS, loading arbitrarily large files is not instant, <br class="">
since the layout of parquet files means that the data has to be <br class="">
decompressed before use, but some of the I/O is done in parallel,<br class="">
so read speed isn't too bad on a multi-core machine (in my tests). <br class="">
Currently one parquet file maps to one topcat table, but aggregating <br class="">
multiple files into a single table could come in future.<br class="">
Other features could be added too.<br class="">
<br class="">
One thing this doesn't so far do is any kind of metadata persistence:<br class="">
apart from column name and datatype, no metadata (e.g. units, UCDs)<br class="">
is read or written. There are places in the parquet file format <br class="">
that such information could be stored (e.g. as JSON or VOTable XML), <br class="">
but I haven't come across any standard way to organise such information.<br class="">
<br class="">
Somewhat annoying is the size of the dependencies: the implementation<br class="">
requires 15 Mb of Apache support libraries (single-handedly doubling<br class="">
the size of the distributed STILTS jar file). So I'm a bit hesitant<br class="">
to include this capability in the standard STIL/STILTS/TOPCAT<br class="">
releases if they are only going to be of marginal interest.<br class="">
<br class="">
This message is to gauge interest and request input:<br class="">
<br class="">
- is Parquet a file format that people anticipate using with<br class="">
applications like TOPCAT and STILTS, or with the STIL library?<br class="">
<br class="">
- does anybody have example astronomy Parquet files I could look at?<br class="">
<br class="">
- does anybody know of Parquet metadata storage conventions in use<br class="">
in astronomy?<br class="">
<br class="">
- if you try out the parquet-capable topcat linked above, does it work?<br class="">
<br class="">
Thanks,<br class="">
<br class="">
Mark<br class="">
<br class="">
--<br class="">
Mark Taylor Astronomical Programmer Physics, Bristol University, UK<br class="">
<a href="mailto:m.b.taylor@bristol.ac.uk" class="">m.b.taylor@bristol.ac.uk</a> <a href="http://www.star.bristol.ac.uk/~mbt/" class="">http://www.star.bristol.ac.uk/~mbt/</a><br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>