Requirements & design for composite datasets in butler

natepease · August 8, 2016, 11:25pm

As part of DM-6226 F16 Butler Composite Dataset Design I’m working on defining the requirements for “composite datasets”; to deserialize data into a python object where the persisted data can come from more than one source (fits file). I also have an initial design idea/proposal based on my current understanding of the requirements. Both the requirements and the proposal are on confluence

If you’re interested, please take a look. (some people are explicitly on the hook to review this, I’ve been communicating with them separately). Let me know if you have any issues with the requirements or the initial design proposal.

@KSK and @swinbank you may be interested; it was proposed that we may end up asking your team to write some custom object serializers & deserializers or do other afw object support for this feature.

swinbank · August 9, 2016, 1:59pm

Hey @natepease — thanks for the heads-up. This looks nice!

A couple of comments. First, I’m worried that in some sense this isn’t a true expression of requirements. That is, while it makes some statements about what the software is required to do, it does not motivate those statements by flowing them down from how we’ll use these capabilities within the Science Pipelines, or the SUI, or elsewhere. I worry that this means things can fall between the gaps (and, indeed, perhaps those requirements would answer my questions below).

I’m not sure if I’m reading the document correctly, but it seems to suggest that composite datasets can be persisted either as single objects or as a set of components (and, indeed, there are plugins that will read a single object and write the set of components, and vice versa). Is this necessary? Wouldn’t it be simpler to always persist everything as its simplest components, and only reconstitute a complex object upon request?

The above relates somewhat to the concept of “pure composites”, ie composites that do not have member data. Why not make everything either a simple (non-composite) dataset or a pure composite which aggregates non-composites?

Writing (de-)serializers is certainly something that sounds as though it might be in-scope for the Science Pipelines groups, but we should get a better feeling for exactly what that means and the timescale on which it’d be required before we definitively commit to anything.

natepease · August 9, 2016, 7:01pm

It’s a good point that to understand requirements you need to understand the use cases. @jbosch, how can we capture more description of how this feature will be used? We have a couple examples, but it sounds like the picture is not complete enough.

It’s an interesting idea. I’m not sure it’s feasible to say that e.g. datasets that are represented by an ExposureF and only want to replace one of their MaskedImage’s images must become a pure composite. On the other hand it may simplify other design & implementation. I’ll put a note under the Pure Composite section.

Definitely. Also, while we define timescale and scope we should consider who will do what work and what their loading is.

natepease · August 10, 2016, 8:30pm

@swinbank @jbosch let’s take some time at the AHM to talk about what we need to figure out what needs to be defined regarding use cases, and at least make an initial pass at writing down those use cases.

ctslater · August 11, 2016, 4:52am

I initially had the same reaction as @swinbank regarding the unclear use cases, but then I re-read the discussion on “How to subsection a butler data repository” and it gave me a much better perspective on how this could be very useful. Maybe some of the discussion in that thread could help motivate a clear statement of the problem this is meant to solve, and then lead into why this design is the best solution for that problem.

That said, I’m not sure the discussion in that thread completely converged, and we don’t have a concrete proposal for splitting up an exposure into constituent pieces. Does that need to be developed further on the Science Pipelines side, synchronously with this design?

jbosch · August 11, 2016, 4:04pm

I’m personally not that worried about providing an exhaustive set of use cases; I think we can identify a couple things we need now, and any kind of generic feature that supports those would probably do most of what we need in the future (and trying to more than that is probably premature).

In particular, I’m thinking:

We frequently want to write new Exposure components (e.g. a new Mask or a new Wcs) without rewriting the whole thing. We’ll then later want to access the updated Exposure as if it were a single dataset.
We want to be able to load individual Exposure components without loading the whole thing.
We may want to consider some Exposure components to be effectively shared by an Exposure and its associated SourceCatalog (e.g. we could imagine adding a Calib to the in-memory SourceCatalog class, but that this Calib would be the same one that’s in the related Exposure).
We need to maintain the ability to write a complete Exposure to a single FITS file, though this may become an “import/export” mechanism rather than how we do things internally in butler-managed repositories.

While there are a lot of choices in there in terms of how we would actually define composite datasets, I don’t think those decisions need to be made for @natepease to proceed.

There’s another more complex use case that I’m not sure it’s worth trying to address: a coadd Exposure's CoaddPsf is composed largely of all the Psfs and Wcss of the CCD-level Exposures that went into the coadd. We currently just duplicate all of those, and while we could use a more sophisticated composite datasets feature to normalize this and save some space, it’s not clear that’s a win if it means that reading a coadd Exposure would involve reading many more small files.

swinbank · August 11, 2016, 11:25pm

Capturing the right level of requirements documentation here is challenging — it’s neither useful nor practical to attempt to enumerate ever possible use case in advance, and we want to preserve lots of flexibility in both design and implementation.

However, it is helpful for folks who haven’t taken part in previous detailed discussions to see the big picture of how this fits into what we’re building. Certainly, I sometimes find it hard to keep track of the big picture when thinking about the Butler’s internals. Already, the comments from Jim and Colin help in this case: thanks!

It would be great to find a few minutes next week to chat about how to make sure these dots remain joined. I’ll probably be tied up with the NSF review most of the week, but hopefully we’ll find time on Friday if we don’t get to it earlier.