Proposed date: TBD
Suggested audience: Anyone interested in pipelines, data formats, afw.table, and interop with astropy.
afw.table SQL persistence and round-tripping
afw.table can be indirectly persisted to (and probably from) a relational database by first converting it to an astropy table, then converting once more to an
pandas.DataFrame, and finally using the
read_sql methods, though the reverse doesn’t full work as far as I know.
- Design a portable way or convention of representing table and column metadata from columns. Possibly add
createDdlAPI method with similar options which will output annotated
- This may also trigger an RFD about metadata use inside of
- Fully support roundtrips
The primary motivation of this RFD is to enable rich format and metadata support in dbserv, as well as improve interoperability with Qserv for
afw.table. The secondary motivation it to support SQLite as an official file format, which might be a data product generated by dbserv.
The ability to persist tables to and from a database, especially SQLite and MySQL(Qserv), is obviously valuable. While it’s currently possible to do this with pandas, it’s not possible to persist additional metadata, like field (Column) units, doc/comments, or table metadata. Part of this is because pandas makes no attempt at supporting such metadata, due tothe difficulty of determining propagation behavior across table operations like projections and joins. The other reason is there’s not a good, portable way of supporting this across different databases (more on this later). Other formats, like FITS and VOTable, do have first-class support this sort of metadata, though VOTable appears the most thorough.
Supporting readSql and writeSql will be pretty easy, as we can probably port most the code from
pandas directly into
afw.table with slight modifications.
Internally, pandas leverages SQLAlchemy and some type mapping to dynamically create a SQLAlchemy model which can be used to create SQL DDL (
CREATE TABLE) and DML (
SELECT) statements to create, update, or read from SQL tables. Leveraging SQLAlchemy means that the mechanism is relatively agnostic to the underlying database systems (e.g. MySQL, SQLite, Postgres). Porting this would introduce an optional/runtime dependency on SQLAlchemy to
Supporting metadata in databases is somewhat dependent on the types of metadata support in
afw.table and other formats. To a first approximation, we would want to support a description string for tables and columns, and a unit string for columns. MySQL and Postgres both support comments on tables and columns which has been used in the past to extend the capabilities of databases. SQLite does not allow anything of this nature, but it does store the exact
CREATE TABLE statement that was used to create a table, which could potentially be parsed to add context. You could imagine something like Sphinx-like docstring with annotations that could be parsed by another tool. The drawback with both the MySQL and SQLite options, basically encoding them in comments when creating the table, is that they are both effectively immutable and they both would require some form of parsing, and they both probably have relatively small amounts of data that you can store. Postgres seems to be less immutable, but similarly constrained in size.
Using another table would probably be beneficial. There’s two options for something like that. The first would look like the
INFORMATION_SCHEMA convention that’s used in several databases - a global schema with tables for metadata about tables and columns in the database. For example, you might have
META_SCHEMA.Column tables. These tables would need to be writable by any process which is creating a table.
Another convention would be to create additional metadata support tables for this metadata for every table. For example, a
Source table might also have
META$Source$Table and a
META$Source$Column tables (Note:
$ is occasionally used, as a convention, to denote a “special” table in some way). This does have the benefit over the previous convention that any process which could create a
Source table could also implicitly create the
I’m not super keen on any of these solutions, but something would be require for good roundtripping support. In the mean time, I do think adding a generic readMetadata/writeMetadata callback mechanism to readSql/writeSql APIs would be good way forward.
Alternatively, the generic callback mechanism could also be proposed to the pandas project, or the SQL mechanism could be ported into astropy as well, probably with the callback mechanism too. This may be fine as well, but it does potentially pose an additional steps to many operations. In any case, it’s probable those communities would find these additions valuable, and the ability to read and write SQL from astropy specifically would allow for SQLite to move forward as a file format for distribution of data.
At least for
afw.table, we should also evaluate other file formats that
pandas.DataFrame supports, with the next candidate being HDF5. We may also have metadata representation issues in other formats complicating round-tripping, so it would be good to think about those. AstroPy does support HDF5, but only for metadata tagged on the table.