SQLite + TAP_SCHEMA or similar for self-describing SQLite files

Fri Sep 22 00:30:03 CEST 2017

Hi DAL,

First off, I had a hard time deciding which list to send this to, but ultimately just chose DAL.

One of the things we're working on for LSST is SQL support for output from our Table APIs in our stack (afw, github.com/lsst/afw), which is sort of similar to astropy.table or pandas DataFrame, for example (afw.table being based on NDArray).

I've ported some pandas' read_sql, to_sql to our system (which is based on SQLAlchemy), and that allows arbitrary table (de)serialization to arbitrary database backends, but pandas doesn't care about metadata, so we need to handle that ourselves for round-tripping to work properly.  We expect our users may be dynamically creating new table objects and persisting them to the database in our scientific platform (or externally to their own databases), so that's why storing the metadata is even more important. Our project will use multiple databases, including SQLite, Oracle, MySQL, and Postgres, depending on the context, so our solutions must take all of those into account.

Self-describing
===============

For the metadata describing the tables, I'm hoping to use a variant of the TAP_SCHEMA supporting tables and/or the RegTAP tables (e.g. res_table, table_column), sticking as close as possible to those tables. We are also planning on also using SQLite files as a portable data format, which doesn't have the notion of schemas (well it does, but they are separate files), so we need to keep those tables within the same schema as the tables we are writing, so actually having a separate "TAP_SCHEMA" or "rr" schema doesn't work, because you'd need two files.

Initially, we thought about creating corresponding metadata tables for every table with some namespace. For example, if you had a table named simply "result", you might have "result$res_table" (with one row), "result$table_column". That's okay, but potentially leads to a lot of tables and some some other issues around it, namely that you end up doing quite a bit of string interpolation since table names can't be parameterized, plus INFORMATION_SCHEMA isn't really standardized across databases, so we can't quite rely on that either for easy table discovery.

So we were thinking about having one table per schema. In that scenario, we need to pick a few good names, which hopefully never collide with the names people might otherwise choose. An easy option might be to just use "$" or "_" for the namespacing, e.g. "meta$tables" or "rr_res_table". Or we could just create new names, e.g. "meta_table", "rs_column".

There's a few other issues:
1. If these SQLite tables were an optional output of a TAP endpoint, I think we'd like some way of denoting which table is the primary result table. I was thinking we'd allow arbitrarily-named tables in out SQLite files, and the implementation would need to add a ";meta.main" to a table's UCD in the metadata table corresponding. Then, when reading in a file, a client could determine quickly which table is the default table for that file from a simple query.

2. Representing column groupings in a table. Something similar to gavo's TAP_SCHEMA.groups I think would be fine.

3. Other TAP/VOTable-specific features (params, fields, info, etc...). Do also we need to store information about indexes, keys, constraints, or would UCDs work fine for most of this?

4. Versioning - Since it's not an XSD, we would at least need one extra table to denote the version of the metadata tables we've included.

To be clear, I'm just trying to kick around a few ideas how we might implement this, so that semantics would be close to that of VOTable and TAP, in case others were interested in standardizing on this in the future as an alternative output format, especially from a TAP endpoint. There's obviously quite a bit of overlap with TAP, VOTable, and potentially parts of RegTAP, so I thought I'd be interested in hearing people's thoughts.

Thanks,
Brian