Butler design proposal for multiple repository support

I’ve written up a design proposal for the multiple repository feature in Butler. @ktl has seen it and now I’d appreciate feedback from anyone else who is interested.

The design proposal is here.

The related Jira stories are DM-4625 and DM-4682

Feel free to reply to this topic or email me directly (npease@slac.stanford.edu) with feedback or questions.

Nate, thanks very much for the initial note. I am just beginning to get my nose into this, so these are Naive comments.

  1. Do you envision the concept to be all encompassing – i.e are all the data in the Survey Archives in a single “repository”? If so we need to talk pretty seriously with Jason and myself. We think of Archives as being anobject store, with it’s own replication system under-the-sheets. There are also many other concerns I’m sure we will think of – think disaster recovery and all that.

  2. More or less apart from concern 1) there are cases where we need to ingest data into the butler Framework. An example is L1 processing, where we will have streams of data from, say a socket, that need to be put into the butler framework so that it can be accessed by the science codes.

  3. Is there a definition of “dataset” I don;t understand if LSST has given this a formal meaning, or if this is just the normal use – collection of data. Here I have a lack of understanding of terminology. in DES we have a “unit of data management” that is sued by software, and has a speical meaning, I am wondering if dataset/repostiroes have some sort of formal meaning.

  4. I see use of UID and GID. Are these consistent (or better put) sufficently general with the authentication and authorization work being done at NCSA?

Thanks!

Hey Don,

Thanks for the quick & thoughtful feedback.

In the context of Butler, a Repository class is an interface to data storage that can be local or remote. The Repository class will have a Storage interface class that implements the connection to the data store.

I don’t think having data stored in your object store would be a problem (the design intends to allow for this), and the replication system should be fine as well, provided that there is a way for the Repository+Storage to connect to the object store (by configuration info passed to Butler from client code), and so long as the Storage can implement its protocol: it needs to:

  • be able to serialize an object and give it to the data storage to be written
  • be able to get the data for an object that has been written, so that it can reconstitute the object
  • store metadata about a repository including details needed to find other related persisted repositories in the object store.

Will the data be pure streams? Or will the stream be read into some kind of object before being passed to butler?

There are some definitions related to Butler on this page. The definition for dataset there is
"A dataset is the persisted form of an in-memory object. It could be a single item, a composite, or a collection. Examples: int/long, PropertySet, ExposureF, WCS, PSF, set/list/dict.", which seems vague at best, but I think in LSST its meaning is at least informal, and I’ve found it can be used as a generic descriptor for a FITS file, and probably for an HDF5 file too.

I don’t know if they relate to authentication or authorization, although if there’s an intersection there it’s something we should think more about. They are intended to be used similar to a git SHA and will be important when we implement “git-style branching” (DM-4520). They’re probably not going to be used for multiple-repository, although they may be implemented in preparation for 4520.

That’s Unique ID and Globally Unique ID. Nothing really to do with authentication, although it’s conceivable that a user ID could make up part of a Unique ID.

yes we image we need to take the stream, make files out of it in a file system.
we need to create a repository and declare the files. API exists, right?

ok that type of id. got it.

Overall looks quite reasonable; just have a few comments:

I think Gregory’s multiple-storage request requires multiple inputs and multiple outputs, but this could be verified, and if only multiple outputs, the scope could be reduced.

FWIW, we have a different use case for multiple inputs (but not multiple outputs): joint processing of data from multiple cameras, in which we expect the low-level reductions to be done in separate repositories (with different Roots, even, at least in the way repository Roots are used currently), and then both used as input parent repositories for joint processing. It also wasn’t clear to me whether this use case messes up your Root concept or not.

Storage is a pluggable protocol (or abstract base class TBD) that defines the api for concrete Storage classes that implement read and write access.

I’d recommend that we put off trying to define too much about the Storage interface right now, and perhaps just use an explicit placeholder that only works for posix filesystems for now. Figuring out the Storage interface will require figuring out how we want to handle persistence of polymorphic objects to different types of Storage (it’s a double-dispatch problem - is it the responsibility of each Storage to know about all persistable classes, or the responsibility of all persistable classes to know about all possible Storages?) And that’s essentially a rewrite of the persistence framework, which I think we want to separate from the Butler overhaul as much as well can, since it’s also a giant can of worms (though it also needs doing).

Triple, actually, if we have persistable class, format (e.g. FITS vs. HDF5), and access protocol (e.g. POSIX file vs. HTTPS).

I have in mind a scheme where the Storage class will simply be a sequencer to chain serializer and writer classes (that are protocol-conformant) that are user-supplied (by naming importable classes in the policy file, and/or at runtime if needed). I talked it over with KT just now; the problem is somewhat more complicated than I’d realized (WRT triple dispatch, even) but I think the idea still holds.

If you would like to see docs or diagrams and I don’t post something earlyish next week feel free to ping me.

I didn’t completely follow this, but it sounds reasonable to the extent I do follow it, and I’m happy to just wait for your design proposal for further clarification.

Could you clarify the relationship between the “previous” repository and the “parent” repository, please? In particular, in the diagrams at the bottom of the page, all repositories except the CompositeRepos (which are, I think, synonymous with InputAggregateRepositorys) have _previous references; the CompositeRepo has multiple _parents. Is a single _parent equivalent to a _previous, or is there some substantive distinction?

Nice catch. There’s no difference; I’d been calling them “previous” and changed to “parent” and didn’t update the diagrams. I’ll update that at some point.

I would like to provide some input on this, as someone whose responsibility is to set up a platform for data release processing at the satellite site.

First of all, thanks for including in the new Butler design the possibilities for using multiple storage technologies (and is associated protocols) and multiple storage formats. That will be very useful for us for investigating what is the most effective combination for every particular use case.

Let me fist start with some motivational work. At CC-IN2P3 we observed that reading CFHT data (in the form of FITS files) using a networked file system (GPFS in our case) was not very efficient for some workflows, or parts of a workflow.

We starting digging to understand the reasons for this and came to the conclusion that when the application consuming that data is mostly interested in reading the HDUs of the FITS files, there was excessive network traffic between GPFS file servers (where the FITS files are stored) and compute nodes (where the application consuming the data runs). We did some modifications to the configuration of the GPFS file system and the situation improved, without being totally satisfactory: the typical block size as reported by the file system is the unit of I/O used by cfitsio, the library actually doing I/O for the application.

We then decided to look at the contents of 9 million FITS files files containing CFHT data to understand better what was happening. Those files all have a single HDU and single data unit, on purpose (the scientists want the data to be organized that way). This is what we found:

  • Each file’s data unit site is about 18 MB (which is basically the amount of data necessary to store data collected by 1 CCD of the Megacam).
  • The 98th percentile of the size of the HDUs is 36 KB
  • The 98th percentile of the number of key-value pairs in a HDU is 438

Looking at the size of the typical HDU (36 KB), one can understand why the excessive network traffic was generated: each read operation issued by cfitsio asked GPFS to transport one block (1MB initially, 256 KB after reconfiguration) from the file server to the client. However, the excess data was discarded as the application was mostly interested (at least in that phase) on scanning the contents of the key-value pairs in the HDU.

All this made us wonder if given the small amount of data in a typical HDUs it would be better to store the key-value pairs in a database and make the application query the database for retrieving any value, instead of reading the FITS file from the file system. The FITS file in the file system would not be modified, but instead a copy of the relevant data would be stored in the database. The next step would be to try this and quantify the benefit (if any) of the two solutions.

My feeling is that if the application reads a local FITS file, there is no significant benefit on having the HDU contents on a database. But if the FITS files are stored in a networked file system, that separation could be beneficial. Reading the data via HTTP (in case of using an object store) should give the application more flexibility since you can download the exact amount of bytes you want in a single request.

So, what all this has to do with the Butler design? Looking at the documents, it is not clear to me if it would be possible to implement a DatabaseStorage class so that to use a database to read the data in the HDUs and use the FilesystemStorage class for reading the contents of the data unit (e.g. pixels). I did not understand by reading the diagrams if at the level of abstraction of a Storage concrete class there is enough information to infer what the application intends to do (e.g. retrieving a key-value pair from the HDU vs. retrieving a data unit) and optimize accordingly.

To conclude, I’m not knowledgeable enough to say if LSST will store its data in a comparable way to CFHT and if the amount of key-value pairs LSST would embed in a FITS file would be comparable either. I also don’t know if all this is specific to FITS format and if we use HDF5 this problem would disappear.

My point on doing this (very long) contribution is that this is the kind of issues that we need to solve when deploying and operating the “plumbing” for a survey like LSST. Having a flexible software layer which allows us to evaluate alternative solutions is extremely valuable. I understand that that flexibility has a development and maintenance cost but I’m sure that effort will pay over the 10+ years that software will be in production.

Sorry for being too long and I’m happy to provide more details about this if that is considered useful.

Hey Fabio,

Thank you for the thoughtful and question and post.

I think the issue you are describing can be treated as wanting to retrieve 2 different dataset types. I think the current design would accommodate it. I think it could even be done in a single repository:

One datasetType could be called be “image” and the other might be “imageHeader”. The intention is that the Policy can indicate for each dataset type:

  • the storage type (filesystem, database, etc)
  • the persistence format (FITS, HDF5, etc)
  • the python type (e.g. afw.image.ExosureF, etc).

This should be enough for a butler.get(datasetType=..., dataId=...) to know which storage to look in, and your code would pass in the correct val for datasetType for the type of data (header or image) you needed to retrieve.

It is still pretty undefined how to implement a storage that can take an object of type x and serialize it to y to write to z. There’s some dependencies, e.g. if you’re writing an object to a database you might want to decompose the object into more than one serialization to be written to different columns. We have some ideas for this, but as of yet nothing has been explored deeply.

Please let me know if you see any red flags there.

update notice: I made some changes to the design document. Mostly it’s clarification and refinement of definitions. There was one substantial change, to the Butler framework relationship diagram. Previously the Butler owned Mapper which owned Repository. The design has been changed; Butler owns Repository with owns Mapper.

thx,
n