I would like to provide some input on this, as someone whose responsibility is to set up a platform for data release processing at the satellite site.
First of all, thanks for including in the new Butler design the possibilities for using multiple storage technologies (and is associated protocols) and multiple storage formats. That will be very useful for us for investigating what is the most effective combination for every particular use case.
Let me fist start with some motivational work. At CC-IN2P3 we observed that reading CFHT data (in the form of FITS files) using a networked file system (GPFS in our case) was not very efficient for some workflows, or parts of a workflow.
We starting digging to understand the reasons for this and came to the conclusion that when the application consuming that data is mostly interested in reading the HDUs of the FITS files, there was excessive network traffic between GPFS file servers (where the FITS files are stored) and compute nodes (where the application consuming the data runs). We did some modifications to the configuration of the GPFS file system and the situation improved, without being totally satisfactory: the typical block size as reported by the file system is the unit of I/O used by cfitsio
, the library actually doing I/O for the application.
We then decided to look at the contents of 9 million FITS files files containing CFHT data to understand better what was happening. Those files all have a single HDU and single data unit, on purpose (the scientists want the data to be organized that way). This is what we found:
- Each file’s data unit site is about 18 MB (which is basically the amount of data necessary to store data collected by 1 CCD of the Megacam).
- The 98th percentile of the size of the HDUs is 36 KB
- The 98th percentile of the number of key-value pairs in a HDU is 438
Looking at the size of the typical HDU (36 KB), one can understand why the excessive network traffic was generated: each read operation issued by cfitsio
asked GPFS to transport one block (1MB initially, 256 KB after reconfiguration) from the file server to the client. However, the excess data was discarded as the application was mostly interested (at least in that phase) on scanning the contents of the key-value pairs in the HDU.
All this made us wonder if given the small amount of data in a typical HDUs it would be better to store the key-value pairs in a database and make the application query the database for retrieving any value, instead of reading the FITS file from the file system. The FITS file in the file system would not be modified, but instead a copy of the relevant data would be stored in the database. The next step would be to try this and quantify the benefit (if any) of the two solutions.
My feeling is that if the application reads a local FITS file, there is no significant benefit on having the HDU contents on a database. But if the FITS files are stored in a networked file system, that separation could be beneficial. Reading the data via HTTP (in case of using an object store) should give the application more flexibility since you can download the exact amount of bytes you want in a single request.
So, what all this has to do with the Butler design? Looking at the documents, it is not clear to me if it would be possible to implement a DatabaseStorage
class so that to use a database to read the data in the HDUs and use the FilesystemStorage
class for reading the contents of the data unit (e.g. pixels). I did not understand by reading the diagrams if at the level of abstraction of a Storage
concrete class there is enough information to infer what the application intends to do (e.g. retrieving a key-value pair from the HDU vs. retrieving a data unit) and optimize accordingly.
To conclude, I’m not knowledgeable enough to say if LSST will store its data in a comparable way to CFHT and if the amount of key-value pairs LSST would embed in a FITS file would be comparable either. I also don’t know if all this is specific to FITS format and if we use HDF5 this problem would disappear.
My point on doing this (very long) contribution is that this is the kind of issues that we need to solve when deploying and operating the “plumbing” for a survey like LSST. Having a flexible software layer which allows us to evaluate alternative solutions is extremely valuable. I understand that that flexibility has a development and maintenance cost but I’m sure that effort will pay over the 10+ years that software will be in production.
Sorry for being too long and I’m happy to provide more details about this if that is considered useful.