Serialization mechanisms

ktl · April 28, 2016, 10:13am

Continuing the discussion from How should we store 100 GB + data sets? Is git-lfs still the answer?:

This looks like it might be useful instead of Boost serialization for ORM-ish persistence of data.

Amazon Ion has also appeared recently and might be a potential competitor.

jbosch · April 28, 2016, 4:22pm

One thing I’ve been assuming we want in serialization the ability to save some object state in a more structured way that can be naturally expressed in some of the standardized file formats we might target (like FITS or HDF5). I think that would need a serialization library that is a bit more extensible than flatbuffers appears to be (at first glance, at least). I think that’s critical if we ever want to be able to describe the data model of any of our objects in a way that would allow them to be read without our code (I continue to be skeptical that we can do this for many objects, but I believe @rhl and @timj have expressed at least somewhat incompatible opinions).

The biggest flaw of our current afw::table::io serialization library is that it requires all object state to be expressed as a set of structured tables, but it’s also an advantage, because it turns out most of our objects can be mostly represented quite well this way - they just have a small amount of extra state that doesn’t really fit into a table well (it just goes into a single-row table in our current system). I think we could do a lot better if we reimplemented this as a backend extension of a real third-party library, and save the structured data in tables (or images) without stuffing the less-structured stuff into them.

The top hit for “C++ serialization library” on google is Cereal, which looks like it was modeled heavily on Boost.Serialization but is header-only and relies on C++11 instead of Boost itself (much like the relationship between Boost.Python and pybind11). I think it’s worth a close look - I think Boost.Serialization is a good library that we used just a bit too naively last time around, but going back to it now just prolongs our inheritance of the Boost-wide dependency and inheritance problems. I certainly gained a tremendous amount of respect for it after the misadventure of writing my own serialization library.

As a side note, I find it extremely encouraging that reinventing good Boost libraries as standalone C++11 projects on GitHub seems to be becoming A Thing. IMO, that’s exactly what the legacy of Boost should be, and if it’s happened for other Boost libraries it could make it much easier for us to get away from Boost.

benepo · April 28, 2016, 4:57pm

I hadn’t seen Amazon Ion before. Thanks for the tip!

One of the key features of flatbuffers is the ability to access a portion of the data without parsing/unpacking the entire dataset. For EPO, this could be helpful as cloud users are typically only accessing a small portion of the data at any given time and efficient/inexpensive storage is a higher priority than a use case where professional researchers work on the full data set in the NCSA data center. I’m not sure if this feature is the same as the Amazon Ion “binary format [which] supports rapid skip-scanning of data to materialize only key values within Ion streams”

…and speaking of fun wordplay on “cereal”: https://capnproto.org/

benepo · April 28, 2016, 6:17pm

Note: for Flatbuffers and Amazon Ion, there may be a performance hit which depending on your use case may outweigh the load time and storage savings.

And for Ion:

JSON and Ion … aren’t quite as efficient because the general representation has to be a hash map where every value is dynamically typed

For example, the Performing Sparse Reads cookbook entry depicts a potentially expensive loop process for data extraction:

        IonReader reader = SYSTEM.newReader(getStream());
        int sum = 0;
        IonType type;
        while ((type = reader.next()) != null) {
            if (type == IonType.STRUCT) {
                String[] annotations = reader.getTypeAnnotations();
                if (annotations.length > 0 && annotations[0].equals("foo")) {
                    reader.stepIn();
                    while ((type = reader.next()) != null) {
                       if (reader.getFieldName().equals("quantity")) {
                           sum += reader.intValue();
                           break;
                       }
                    }
                    reader.stepOut();
                }
            }
        }
        return sum;

https://probablydance.com/2015/12/19/quickly-loading-things-from-disk/ highlights some advantages of Flatbuffers (or the similar Cap’n Proto solution) over Boost and Protobuf

benepo · May 4, 2016, 8:27pm

One more to add to the list: Parquet

benepo · May 20, 2016, 9:27pm

HDF5 paired with h5serv might be a nice option. Overview: