The New Butler design, as laid out in a design outline and architecture notes, is a daunting thing to attack, particularly for someone new to the Stack and even astronomy pipelines. It’s finally getting through to me that it’s likely more effective for @n8pease to begin by adding much-desired features to the Old Butler, even if clunky because of the Old Butler’s underpinnings, before starting the transformation to the new one. This topic is an attempt to get input on what people think should be those first “feet-wetting” features.
Ideas for things that should be relatively easy to implement]:
Dataset type aliases.
Custom Mapper subclass for Firefly cache access (could be related to a single-file-repository Mapper that might be useful for processFile). DM-4167
Repository versioning and selection (for calibrations, cameraGeom, and other bitemporal datasets; probably also useful for reference catalogs like astrometry_net_data). DM-4168
As someone helping building the data processing infrastructure for LSST, I’m interested in exploring alternatives for storing and transporting data. My understanding is that the current Butler supports filesystem hierarchies, POSIX I/O on FITS-formatted files. I don’t know if adding support for other kinds of storage types and even other storage formats would qualify as “relatively easy to implement” in the current version of the Butler, which internals I’m not familiar with.
The use cases I’m interested in are:
Using an object store (e.g. OpenStack Swift or Amazon S3) as a LSST data repository (e.g FITS files)
Using an in-memory database (relational or NoSQL) for storing the kind of metadata contained in the HDU sections of FITS files. The goal is to relieve the storage infrastructure (i.e. a networked filesystem) of the task of serving this kind of data (typically a few kilobytes per file) to the processing workflows and instead keep those data in memory in a central database.
I don’t expect the Butler to support all possible combinations of storage and transports alternatives but to provide the possibility to plug in specific implementations for several backends so that we can explore things until we find suitable solutions.
It’s worth noting, though, that storage formats at present aren’t really the province of the Butler, at least as it’s designed at present - it delegates all of that work to methods on the objects themselves that do the actual I/O. That would obviously have to change if we wanted to try more exotic backends, but I think there’s be something of a dual dispatch problem here in terms of where to put the smarts of how to serialize a particular kind of object to a particular kind of storage.
The current mapper class hierarchy depends on posix VFS semantics at multiple levels. Additionally, butler client code does not have I/O fully abstracted and accesses the filesystem directly. This essentially mandates a shared filesystem between compute nodes. I am highly concerned that if an object store backend (s3 or hdfs) is not implemented early on, we will become locked into the shared filesystem model.
One intermediate step that we are considering that lies between an object-store-native implementation and the current Posix-filesystem-based implementation would involve staging from an object store to a local non-shared filesystem. This tends to increase latency (although some may be recovered with sufficient preparatory usage introspection), but it may be adequate for many uses.
YEs it’s be useful to have the stage-in stage out as a start,
Being the gray-hared guy in all this, I’ve he’d core beliefs for a long time that POSIX IO s in the way, and that getting the right abstractions in scientific IO libraries is the key to jailbreaking for posix.
so can anyone articulate a program of work that would let us test an objectstore?
For example CFITIO has a gsiFTP mode (drvrgsiftp.c) that at a glance supports reading and writing
with involvement with temporary files). I think that’s an existence proof of a kind.
OR does the HDF5/Alternate to FITS path have traction at the moment – I’m pretty sure that HDF5 has a driver archietcture…