The New Butler design, as laid out in a design outline and architecture notes, is a daunting thing to attack, particularly for someone new to the Stack and even astronomy pipelines. It’s finally getting through to me that it’s likely more effective for @n8pease to begin by adding much-desired features to the Old Butler, even if clunky because of the Old Butler’s underpinnings, before starting the transformation to the new one. This topic is an attempt to get input on what people think should be those first “feet-wetting” features.
Ideas for things that should be relatively easy to implement]:
Dataset type aliases.
Custom Mapper subclass for Firefly cache access (could be related to a single-file-repository Mapper that might be useful for processFile). DM-4167
Provenance recording.
Repository versioning and selection (for calibrations, cameraGeom, and other bitemporal datasets; probably also useful for reference catalogs like astrometry_net_data). DM-4168
I think these features from Jim and Robert’s requests from nearly two years ago, while highly desirable and even motivating for the New Butler in the first place, are off the table as too complex for a starter project:
Config-in-repository and Task-defined output dataset types (and PAF replacement)
DM-4170 Butler: move configuration (.paf) file into repository
DM-4171 Butler: change configuration from .paf to something else
DM-4173 Butler: add support for write-once-compare-same outputs
DM-4180 Butler: provide API so that a task can define the output dataset type
The ability to treat a filesystem as a registry seems critical to me. It’s required for
processing DECam data native on the mountain without creating/updating registries
handling camera data without requiring script to update registries (or a butler from end to the SLAC exposure DB)
supporting processFile without a hand-coded top-level script (which is what it is now).
Output is less important than input (i.e. can be delayed to the next release).
My guess is that proper “–rerun” support is a butler not pipe_base issue, and I think this is important/very important too.
These all seem more important to me than provenance and improvements to versioning.
As someone helping building the data processing infrastructure for LSST, I’m interested in exploring alternatives for storing and transporting data. My understanding is that the current Butler supports filesystem hierarchies, POSIX I/O on FITS-formatted files. I don’t know if adding support for other kinds of storage types and even other storage formats would qualify as “relatively easy to implement” in the current version of the Butler, which internals I’m not familiar with.
The use cases I’m interested in are:
Using an object store (e.g. OpenStack Swift or Amazon S3) as a LSST data repository (e.g FITS files)
Using an in-memory database (relational or NoSQL) for storing the kind of metadata contained in the HDU sections of FITS files. The goal is to relieve the storage infrastructure (i.e. a networked filesystem) of the task of serving this kind of data (typically a few kilobytes per file) to the processing workflows and instead keep those data in memory in a central database.
I don’t expect the Butler to support all possible combinations of storage and transports alternatives but to provide the possibility to plug in specific implementations for several backends so that we can explore things until we find suitable solutions.
We’ve definitely talked about supporting HDF5 and I personally feel we should include HDF5 so that we can investigate whether we get better performance than CFITSIO.
It’s worth noting, though, that storage formats at present aren’t really the province of the Butler, at least as it’s designed at present - it delegates all of that work to methods on the objects themselves that do the actual I/O. That would obviously have to change if we wanted to try more exotic backends, but I think there’s be something of a dual dispatch problem here in terms of where to put the smarts of how to serialize a particular kind of object to a particular kind of storage.
The current mapper class hierarchy depends on posix VFS semantics at multiple levels. Additionally, butler client code does not have I/O fully abstracted and accesses the filesystem directly. This essentially mandates a shared filesystem between compute nodes. I am highly concerned that if an object store backend (s3 or hdfs) is not implemented early on, we will become locked into the shared filesystem model.
One intermediate step that we are considering that lies between an object-store-native implementation and the current Posix-filesystem-based implementation would involve staging from an object store to a local non-shared filesystem. This tends to increase latency (although some may be recovered with sufficient preparatory usage introspection), but it may be adequate for many uses.
YEs it’s be useful to have the stage-in stage out as a start,
Being the gray-hared guy in all this, I’ve he’d core beliefs for a long time that POSIX IO s in the way, and that getting the right abstractions in scientific IO libraries is the key to jailbreaking for posix.
so can anyone articulate a program of work that would let us test an objectstore?
For example CFITIO has a gsiFTP mode (drvrgsiftp.c) that at a glance supports reading and writing
with involvement with temporary files). I think that’s an existence proof of a kind.
OR does the HDF5/Alternate to FITS path have traction at the moment – I’m pretty sure that HDF5 has a driver archietcture…