I mentioned in the AHM using SQLite as an alternative interchange format. It’s my plan to eventually offer this as a data format which can be returned from dbserv (Which can currently support a custom JSON format we’ve developed as well as VOTable, FITS, and likely HDF5 in the future). It may even be the case in the future that results from QServ will actually be stored as a SQLite file.
I’m not proposing project wide adoption of such a practice, but I thought it might be good to get some thoughts down on community and see if it piques any interest.
Some reasons why you might want to do such a thing is partially covered by the SQLite people themselves:
https://www.sqlite.org/appfileformat.html
From a project perspective, I think there’s several advantages of using SQLite as a file format:
- You can view the data and perform basic analysis on data using just SQL on any *nix machine via
sqlite3
CLI - Metadata about tables can be stored in a table. This can include UCDs, units, data types, and column grouping information, for example.
- Extracting metadata about the tables can be performed using just SQL
- The tables can be directly used by all python installations without any additional software. SQLite files are supported by nearly all programming languages.
- You can open several SQLite files at the same time and perform joins across them.
- via
ATTACH DATABASE
statement - The format is portable across all systems, 32 and 64 bit, little and big endian.
- No intermediate representation typically necessary if you want to use pandas
- You can also store BLOBs in a SQLite file, if necessary, using a key:value pattern in a table
- Some code can live with the data, for example, triggers.
- Views can directly support a type of polymorphism on the data
- Binary data is fairly small
Drawbacks:
- You can’t, as far as I know, stream such a format. An example of this might be streaming results from a synchronous database connection. It must be materialized fully to disk (or memory) before it can be transferred. This increases the latency in the case of large result sets. I believe HDF5 will have the same issue, and possibly FITS as well.
- Column data types aren’t strongly typed (edit: without hacks)
- Binary representation of columns can vary if not careful
- No standards or best practices for such a use are currently established (related to previous point)
- I believe we have the same issue with JSON and HDF5 representations
A User Story
Below is a user story in how one file format can be used in multiple ways across multiple languages and applications.
Alice logs into the LSST SUIT system. She performs a query on a patch of the sky, and gets the data out. The SUIT system builds a QServ SQL query for her, and dispatches that query to dbserv. dbserv executes the query, and stores the results as a SQLite file. dbserv also took the opportunity to insert scientific metadata about the query, such as the UCDs of the columns, units, and how columns group together logically to a few metadata tables in the SQLite file. The SUIT system transfers that resultant SQLite file over to an SUIT server and presents that information to Alice.
Alice interacts with the data. She wants a binned histogram from SUIT, which the SUIT server knows can be expressed as a GROUP BY query on the SQLite file SUIT received. The SUIT server builds a new query, executes it against the SQLite file, and presents a histogram back to Alice
On further examination, Alice decides she would like to perform some sort of correlation with an external catalog, in the form of a FITS file, that she has. She uploads that file to the SUIT servers which also convert it to a SQLite file. The SUIT system attaches both files to a SQLite session and performs a join between the two, presenting the results to Alice.
Alice is happy with her selection of the original dataset. She downloads it and explore it further on her own machine. Alice loads the data to a dataframe in pandas, and interacts with it on her own machine while on the plane to Tahiti. Alice should really stop working while on vacation.