Afw.table SQL persistence and round-tripping

brianv0 · February 20, 2017, 7:11pm

Proposed date: TBD
Connection: TBD
Suggested audience: Anyone interested in pipelines, data formats, afw.table, and interop with astropy.

afw.table SQL persistence and round-tripping

Background

An afw.table can be indirectly persisted to (and probably from) a relational database by first converting it to an astropy table, then converting once more to an pandas.DataFrame, and finally using the pandas.DataFrame.to_sql and/or read_sql methods, though the reverse doesn’t full work as far as I know.

Preliminary Proposals

Support readSql and writeSql in afw.table
Design a portable way or convention of representing table and column metadata from columns. Possibly add createDdl API method with similar options which will output annotated CREATE TABLE statements.

This may also trigger an RFD about metadata use inside of afw.table.

Fully support roundtrips

Motivation

The primary motivation of this RFD is to enable rich format and metadata support in dbserv, as well as improve interoperability with Qserv for afw.table. The secondary motivation it to support SQLite as an official file format, which might be a data product generated by dbserv.

The ability to persist tables to and from a database, especially SQLite and MySQL(Qserv), is obviously valuable. While it’s currently possible to do this with pandas, it’s not possible to persist additional metadata, like field (Column) units, doc/comments, or table metadata. Part of this is because pandas makes no attempt at supporting such metadata, due tothe difficulty of determining propagation behavior across table operations like projections and joins. The other reason is there’s not a good, portable way of supporting this across different databases (more on this later). Other formats, like FITS and VOTable, do have first-class support this sort of metadata, though VOTable appears the most thorough.

Implementation

Supporting readSql and writeSql will be pretty easy, as we can probably port most the code from pandas directly into afw.table with slight modifications.

Internally, pandas leverages SQLAlchemy and some type mapping to dynamically create a SQLAlchemy model which can be used to create SQL DDL (CREATE TABLE) and DML (INSERT, SELECT) statements to create, update, or read from SQL tables. Leveraging SQLAlchemy means that the mechanism is relatively agnostic to the underlying database systems (e.g. MySQL, SQLite, Postgres). Porting this would introduce an optional/runtime dependency on SQLAlchemy to afw.table.

Supporting metadata in databases is somewhat dependent on the types of metadata support in afw.table and other formats. To a first approximation, we would want to support a description string for tables and columns, and a unit string for columns. MySQL and Postgres both support comments on tables and columns which has been used in the past to extend the capabilities of databases. SQLite does not allow anything of this nature, but it does store the exact CREATE TABLE statement that was used to create a table, which could potentially be parsed to add context. You could imagine something like Sphinx-like docstring with annotations that could be parsed by another tool. The drawback with both the MySQL and SQLite options, basically encoding them in comments when creating the table, is that they are both effectively immutable and they both would require some form of parsing, and they both probably have relatively small amounts of data that you can store. Postgres seems to be less immutable, but similarly constrained in size.

Using another table would probably be beneficial. There’s two options for something like that. The first would look like the INFORMATION_SCHEMA convention that’s used in several databases - a global schema with tables for metadata about tables and columns in the database. For example, you might have META_SCHEMA.Table and META_SCHEMA.Column tables. These tables would need to be writable by any process which is creating a table.

Another convention would be to create additional metadata support tables for this metadata for every table. For example, a Source table might also have META$Source$Table and a META$Source$Column tables (Note: $ is occasionally used, as a convention, to denote a “special” table in some way). This does have the benefit over the previous convention that any process which could create a Source table could also implicitly create the META$ tables.

I’m not super keen on any of these solutions, but something would be require for good roundtripping support. In the mean time, I do think adding a generic readMetadata/writeMetadata callback mechanism to readSql/writeSql APIs would be good way forward.

Alternatively, the generic callback mechanism could also be proposed to the pandas project, or the SQL mechanism could be ported into astropy as well, probably with the callback mechanism too. This may be fine as well, but it does potentially pose an additional steps to many operations. In any case, it’s probable those communities would find these additions valuable, and the ability to read and write SQL from astropy specifically would allow for SQLite to move forward as a file format for distribution of data.

Future work

At least for afw.table, we should also evaluate other file formats that pandas.DataFrame supports, with the next candidate being HDF5. We may also have metadata representation issues in other formats complicating round-tripping, so it would be good to think about those. AstroPy does support HDF5, but only for metadata tagged on the table.

jbosch · February 21, 2017, 5:03pm

I’m certainly interested in joining this discussion.

I think it’s fair to say that metadata management in afw.table has not been really thought through much, and there’s plenty of room to either include more directly in afw.table objects or to structure what we have differently. It shouldn’t be hard to do any of that while maintaining backwards compatibility in terms of reading existing FITS files written with older versions, and I think that’s really the only requirement.

I’m inclined to say that we should always use extra tables for metadata when converting to SQL, and write engine-specific comments for convenience but not round-tripping. But that’s largely an opinion I developed from experience trying to round-trip things through various FITS conventions and standards, and it may not be that similar to round-tripping through SQL.

ktl · February 21, 2017, 8:25pm

First reaction: I would very much like to avoid having any SQL (or FITS for that matter) dependencies within the afw.table code. All persistence should be done “outboard” with a separate class (or even package) that ideally only uses the afw.table public interface but might have some knowledge of internals (the equivalent of a C++ friend) if necessary.

Second reaction: the “Background” paragraph doesn’t mention the database ingestion code (now in daf_ingest?) that was written previously and, while possibly out of date, can likely be recovered relatively easily.

brianv0 · February 21, 2017, 9:20pm

I will add that as background. That was a mistake on my part.

A few points on that:

daf_ingest is MySQL-specific (no SQLite support, or other databases)
Doesn’t support metadata (but could be patched)
Doesn’t address round-tripping of data (no daf_consume), which I assume would be useful for the Butler.

So maybe we need an afw.io module which contains optional dependencies?

As mentioned, we could also rely on astropy.io (and add a SQL io layer there).

timj · February 21, 2017, 9:22pm

is that fixable though?

brianv0 · February 21, 2017, 10:11pm

It’s hardcoded with MySQL strings, but we’d probably want to solve this for SQLite and Postgres (L1 DB Candidate) too, and at that point we should just use SQLAlchemy. The pandas code is straight forward, it’d be easier to port that than it’d be to worry about supporting n>2 databases in daf_ingest.

timj · February 21, 2017, 10:15pm

So you are proposing that daf_ingest be completely dropped?

ktl · February 21, 2017, 10:47pm

Third reaction: “round-tripping” is a little tricky. While input and output are usually thought of as mirror images, I think it is actually rare that they are exactly reversible. Often a retrieved table will be a subset of a persisted one, or a selection across multiple union-ed tables, or even a join across multiple persisted tables.

Fourth reaction: it is not necessary for the process creating a table to also create its metadata if the metadata has been pre-specified because the table’s schema is defined in advance. That was the direction we were originally going with the baseline schema. The dynamic specification of table schemas (and – not yet implemented – their metadata) was a concession to flexibility during development but could be rethought at this point.