SQLite as a cross-application data format, FITS/HDF5/VOTable alternative

brianv0 · August 26, 2016, 8:23pm

I mentioned in the AHM using SQLite as an alternative interchange format. It’s my plan to eventually offer this as a data format which can be returned from dbserv (Which can currently support a custom JSON format we’ve developed as well as VOTable, FITS, and likely HDF5 in the future). It may even be the case in the future that results from QServ will actually be stored as a SQLite file.

I’m not proposing project wide adoption of such a practice, but I thought it might be good to get some thoughts down on community and see if it piques any interest.

Some reasons why you might want to do such a thing is partially covered by the SQLite people themselves:
https://www.sqlite.org/appfileformat.html

From a project perspective, I think there’s several advantages of using SQLite as a file format:

You can view the data and perform basic analysis on data using just SQL on any *nix machine via sqlite3 CLI
Metadata about tables can be stored in a table. This can include UCDs, units, data types, and column grouping information, for example.
- Extracting metadata about the tables can be performed using just SQL
The tables can be directly used by all python installations without any additional software. SQLite files are supported by nearly all programming languages.
You can open several SQLite files at the same time and perform joins across them.
via ATTACH DATABASE statement
The format is portable across all systems, 32 and 64 bit, little and big endian.
No intermediate representation typically necessary if you want to use pandas
You can also store BLOBs in a SQLite file, if necessary, using a key:value pattern in a table
Some code can live with the data, for example, triggers.
Views can directly support a type of polymorphism on the data
Binary data is fairly small

Drawbacks:

You can’t, as far as I know, stream such a format. An example of this might be streaming results from a synchronous database connection. It must be materialized fully to disk (or memory) before it can be transferred. This increases the latency in the case of large result sets. I believe HDF5 will have the same issue, and possibly FITS as well.
Column data types aren’t strongly typed (edit: without hacks)
Binary representation of columns can vary if not careful
No standards or best practices for such a use are currently established (related to previous point)
I believe we have the same issue with JSON and HDF5 representations

A User Story

Below is a user story in how one file format can be used in multiple ways across multiple languages and applications.

Alice logs into the LSST SUIT system. She performs a query on a patch of the sky, and gets the data out. The SUIT system builds a QServ SQL query for her, and dispatches that query to dbserv. dbserv executes the query, and stores the results as a SQLite file. dbserv also took the opportunity to insert scientific metadata about the query, such as the UCDs of the columns, units, and how columns group together logically to a few metadata tables in the SQLite file. The SUIT system transfers that resultant SQLite file over to an SUIT server and presents that information to Alice.

Alice interacts with the data. She wants a binned histogram from SUIT, which the SUIT server knows can be expressed as a GROUP BY query on the SQLite file SUIT received. The SUIT server builds a new query, executes it against the SQLite file, and presents a histogram back to Alice

On further examination, Alice decides she would like to perform some sort of correlation with an external catalog, in the form of a FITS file, that she has. She uploads that file to the SUIT servers which also convert it to a SQLite file. The SUIT system attaches both files to a SQLite session and performs a join between the two, presenting the results to Alice.

Alice is happy with her selection of the original dataset. She downloads it and explore it further on her own machine. Alice loads the data to a dataframe in pandas, and interacts with it on her own machine while on the plane to Tahiti. Alice should really stop working while on vacation.

jbosch · August 26, 2016, 8:34pm

I’m quite intrigued. I think it’d be particularly useful if we could add a layer that stores column types and additional metadata in additional tables, and uses that to allow users to extract query results into appropriately-typed NumPy column arrays (which would extend naturally to getting astropy/pandas views).

In the past, I’ve thought of such a layer as living below the butler, in the sense that we’d execute any queries through the butler and get back the results as in-memory table objects of some kind, and only the butler would know the data is stored internally in a SQL database. But there might be some advantages of exposing the database more directly to users.

parejkoj · August 26, 2016, 8:36pm

+1 for the example use case, though it’s not very realistic (who does work while flying to Tahiti?).

This sounds like a potentially very useful dbserv product. My only real concern is the “Column data types aren’t strongly typed” - I don’t know much about SQLite, but is there any way around that? This feels similar to my big problem with JSON (does not correctly represent IEEE floats, including NaN and Inf).

Also, how, and how well, does it compress?

mwv · August 26, 2016, 8:39pm

ditto

[By previous agreement at the LSST2016 meeting, @parejkoj and I have realized that just saying “ditto” would often save one of us much typing.]

gpdf · August 26, 2016, 9:41pm

I’m really enthusiastic, particularly from an SUIT perspective. This goes a long way toward making it possible to offer a “workspace” environment for user interaction with outputs from queries that scales from laptop to full qserv-sharded formal Level 3 data products, and makes it easier to make query outputs in a local workspace accessible with the same tools that we use for the DAC-level data products.

brianv0 · August 26, 2016, 10:36pm

SQLite does store integers as ints or floating points as IEEE 754-2008 (both big endian). It can degrade to another format on a per-record basis, if needed.

The decision happens at insertion time.
https://www.sqlite.org/datatype3.html#section_3
https://www.sqlite.org/fileformat2.html#section_2_1

An example of this behavior is below:

bvan@PC92576:docker$ sqlite3 
SQLite version 3.11.1 2016-03-03 16:17:53
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table foo(x double);
sqlite> insert into foo values (1.23);
sqlite> insert into foo values (9e999);
sqlite> select x, typeof(x) from foo;
1.23|real
Inf|real
sqlite> insert into foo values ('Inf');
sqlite> insert into foo values (-9e999);
sqlite> insert into foo values (1/0);
sqlite> select x, typeof(x) from foo;
1.23|real
Inf|real
Inf|text
-Inf|real
|null
sqlite> 
sqlite> insert into foo values (4.321);
sqlite> insert into foo values (9871236540);
sqlite> select x, typeof(x) from foo;
1.23|real
Inf|real
Inf|text
-Inf|real
|null
4.321|real
9871236540.0|real
sqlite>

The issue is you can do something like this as well:

sqlite> insert into foo values ('foobarbaz');

…and it degrades to text. You’ll notice above this is also what happened when I inserted the text literal 'Inf' instead of the too-large double, 9e999.

So it would be up to the application to determine what ‘foobarbaz’ is and how to handle it. In practice, I think the database API should probably handle this okay.

It is possible to prevent this and do a CHECK constraint on all columns if you want to enforce such a behavior. You can combine this with a trigger to make sure all inserted values will be converted to IEEE 754 floating points, or an insertion shall fail, but it would be a lot of DDL boiler plate.

The following query will tell you if a column (from the above insert statements) is null, real, or text. It could be modified and put into a CHECK constraint or a trigger, or just used to validate:

select x, typeof(case when typeof(x) == 'text' then 
     (case when x = 'Inf' then 9e999
          when x = '-Inf' then -9e999
          when x = 'Nan' then 1/0
          else x end) else x end) -- assume real or text
      from foo;

As for compression, from the second link above, integers are effectively compressed by default. The SQLite format is a binary format at the end of the day, so it will compress as well as a binary format can compress.

brianv0 · August 26, 2016, 10:46pm

Alternatively, I think you optionally could just add the following check constraint:

create table foo(x double CONSTRAINT x_is_real CHECK (typeof(x) = 'real'));

This fails the text case:

sqlite> insert into foo values (1.23);
sqlite> insert into foo values (9e999);
sqlite> select x, typeof(x) from foo;
1.23|real
Inf|real
sqlite> insert into foo values ('Inf');
Error: CHECK constraint failed: x_is_real

xiuqin · August 27, 2016, 4:10am

I am excited to see that you are exploring SQLite as a output format. We discussed using SQLite in SUIT server as a way to do better SQL like filtering support for tabular data at one point. We should definitely explore SQLite format more. It could provide better support of user data in workspace and Level3 data.

afausti · August 27, 2016, 10:05pm

@brianv0 that’s an interesting ideia, in particular the possibility of open several SQLite files at the same time and perform joins seems useful. Perhaps SpatialLite would be useful to add spatial indexes and related functions. Just to mention that DECALs survey release their data in a db file (in this case postgreSQL) with defined schema and sample queries, that is particular useful for them because from the photometric data they do target selection and express that as SQL queries.

ctslater · August 29, 2016, 4:16pm

Some potential use cases come to mind: there are a few places in level 1 where SQL is the natural way to specify the input data. For example, after generating DIA sources, we need to associate them to any previous DIA object, and we need to retrieve the previous DIA source measurements associated with each DIA object that we match to. This is really a query on the sky, so SQL makes more sense than visit/tract/patch/etc. But in order to make this runnable and testable on small scales, I don’t think we want to bake in the assumption that there will always be an external database to connect to. One could imagine the butler acting as a pass through to send queries either to a local SQLite “repository” or a production database depending on the user context. I think that would give the pipelines a lot of flexibility.

The moving objects pipeline has many of the same needs.

natepease · August 31, 2016, 12:01am

It does seem reasonable that SQLite could be a type of butler storage format. It would require code that knew how to read & write objects from/to sqlite.