Proposed date: TBD
Connection: TBD
Suggested audience: Anyone interested in pipelines, data formats, afw.table, and interop with astropy.
afw.table SQL persistence and round-tripping
Background
An afw.table
can be indirectly persisted to (and probably from) a relational database by first converting it to an astropy table, then converting once more to an pandas.DataFrame
, and finally using the pandas.DataFrame.to_sql
and/or read_sql
methods, though the reverse doesn’t full work as far as I know.
Preliminary Proposals
- Support
readSql
andwriteSql
inafw.table
- Design a portable way or convention of representing table and column metadata from columns. Possibly add
createDdl
API method with similar options which will output annotatedCREATE TABLE
statements.
- This may also trigger an RFD about metadata use inside of
afw.table
.
- Fully support roundtrips
Motivation
The primary motivation of this RFD is to enable rich format and metadata support in dbserv, as well as improve interoperability with Qserv for afw.table
. The secondary motivation it to support SQLite as an official file format, which might be a data product generated by dbserv.
The ability to persist tables to and from a database, especially SQLite and MySQL(Qserv), is obviously valuable. While it’s currently possible to do this with pandas, it’s not possible to persist additional metadata, like field (Column) units, doc/comments, or table metadata. Part of this is because pandas makes no attempt at supporting such metadata, due tothe difficulty of determining propagation behavior across table operations like projections and joins. The other reason is there’s not a good, portable way of supporting this across different databases (more on this later). Other formats, like FITS and VOTable, do have first-class support this sort of metadata, though VOTable appears the most thorough.
Implementation
Supporting readSql and writeSql will be pretty easy, as we can probably port most the code from pandas
directly into afw.table
with slight modifications.
Internally, pandas leverages SQLAlchemy and some type mapping to dynamically create a SQLAlchemy model which can be used to create SQL DDL (CREATE TABLE
) and DML (INSERT
, SELECT
) statements to create, update, or read from SQL tables. Leveraging SQLAlchemy means that the mechanism is relatively agnostic to the underlying database systems (e.g. MySQL, SQLite, Postgres). Porting this would introduce an optional/runtime dependency on SQLAlchemy to afw.table
.
Supporting metadata in databases is somewhat dependent on the types of metadata support in afw.table
and other formats. To a first approximation, we would want to support a description string for tables and columns, and a unit string for columns. MySQL and Postgres both support comments on tables and columns which has been used in the past to extend the capabilities of databases. SQLite does not allow anything of this nature, but it does store the exact CREATE TABLE
statement that was used to create a table, which could potentially be parsed to add context. You could imagine something like Sphinx-like docstring with annotations that could be parsed by another tool. The drawback with both the MySQL and SQLite options, basically encoding them in comments when creating the table, is that they are both effectively immutable and they both would require some form of parsing, and they both probably have relatively small amounts of data that you can store. Postgres seems to be less immutable, but similarly constrained in size.
Using another table would probably be beneficial. There’s two options for something like that. The first would look like the INFORMATION_SCHEMA
convention that’s used in several databases - a global schema with tables for metadata about tables and columns in the database. For example, you might have META_SCHEMA.Table
and META_SCHEMA.Column
tables. These tables would need to be writable by any process which is creating a table.
Another convention would be to create additional metadata support tables for this metadata for every table. For example, a Source
table might also have META$Source$Table
and a META$Source$Column
tables (Note: $
is occasionally used, as a convention, to denote a “special” table in some way). This does have the benefit over the previous convention that any process which could create a Source
table could also implicitly create the META$
tables.
I’m not super keen on any of these solutions, but something would be require for good roundtripping support. In the mean time, I do think adding a generic readMetadata/writeMetadata callback mechanism to readSql/writeSql APIs would be good way forward.
Alternatively, the generic callback mechanism could also be proposed to the pandas project, or the SQL mechanism could be ported into astropy as well, probably with the callback mechanism too. This may be fine as well, but it does potentially pose an additional steps to many operations. In any case, it’s probable those communities would find these additions valuable, and the ability to read and write SQL from astropy specifically would allow for SQLite to move forward as a file format for distribution of data.
Future work
At least for afw.table
, we should also evaluate other file formats that pandas.DataFrame
supports, with the next candidate being HDF5. We may also have metadata representation issues in other formats complicating round-tripping, so it would be good to think about those. AstroPy does support HDF5, but only for metadata tagged on the table.