How the Exposure class and afw::table::io wrecked the codebase

jbosch · November 1, 2018, 2:23pm

This is another blog-like post like my last one, but this one is more relevant (despite being a bit historical), and it involves a problem I still think we have a chance to solve.

Once upon a time, the afw.image.Exposure class was just a MaskedImage with a PropertySet header and a Wcs. That made it pretty simple - everything about it had a nice, FITS-standard representation (we only had simple, FITS-standard WCSs in those days) - but a bit inconvenient. If you wanted to use some other object that represented some other aspect of our knowledge about the image, like a Calib, you had to pass it around separately. So it was a bit inevitable that we ended up attaching Calib to Exposure, but that still left things pretty simple - it’s fundamentally just a zero-point, and we could put that in the headers, too.

When we added Psf and Detector classes to the codebase, the practice of attaching everything we know about an Exposure to that Exposure was pretty well established (I think; this was a long time ago), but those presented a new problem: there wasn’t a FITS-standard way to save them in the same file. We didn’t save the Detectors at all - instead we had the Butler create a full Camera object and attach a Detector to every Exposure on read. For a long time we just wrote the PSFs into separate files using Boost.Serialization, which wrote non-standard, vaguely-human-readable text files than none of us really liked (it could also do XML, which I suppose we liked even less?). That meant it was up to whoever was working with the Exposure to load the Psf separately and attach it.

That all changed in 2013, when we decided to write CoaddPsf, a Psf class that was implemented by lazily evaluating all of the Wcss and Psfs of all of the input images every time a PSF image was requested at a point on the coadd. That made it a much more complex and much larger data structure than any previous Psf. That made persistence very tricky, and I didn’t want to try to solve that tricky problem with Boost.Serialization.

That was, in retrospect, probably a mistake: Boost.Serialization is actually excellent at saving complex data structures, and while there were some real problems with the code we used to interoperate with it and its output format was non-ideal, I didn’t really consider the possibility of trying to understand and fix those problems. Writing a new persistence framework from scratch sounded much more promising, and having just written afw::table the year before, it made for very shiny hammer that made this problem seem very nail-like.

But even that wasn’t quite enough for me to decide that inventing a new persistence framework was a good idea. What really sold me on the idea was the thought of addressing three long-standing problems at the same time; in addition to implementing persistence for CoaddPsf, we could:

Start saving coadd provenance (what became the CoaddInputs class) - if you think about it, that’s essentially exactly the same information that CoaddPsf wants.
Start saving Psfs and other complex objects to the same FITS file as the Exposure. This is something @RHL did with his PSF models in SDSS, and we were all looking forward to not having to pass those separate PSF files around with the corresponding Exposure FITS files.

And so afw::table::io::Persistable was born, a home-brew persistence framework that writes almost-arbitrary data structures (including polymorphic objects and shared pointer relationships) to a sequence of FITS binary table HDUs. And while the home-brew-newss of it was a mistake, it does have some very nice features. Writing Exposure components to same FITS file is a big win for convenience, and the binary tables themselves are at least a little bit self-describing and language-agnostic; I have already written pure-Python, non-DM code to read some things from them, and I’m sure sufficiently-dedicated other people could do so as well (especially with some slightly better high-level docs from me). What I should have done was to adopt a mature serialization library (like Boost.Serialization), and try to write a FITS binary table archive back-end for it. I didn’t realize that until I was pretty far along, however, and by then it felt too late to back out.

Anyhow, along with with that new persistence framework came some big improvements to Exposure: we split off managing the non-image components into a separate object, ExposureInfo (which also had the new code for reading and writing them with afw::table::io). We also added the ExposureRecord and ExposureCatalog classes to make it possible to pass around Exposure components and metadata in table form.

All of that work seemed to have left us in a much better state, and on the surface that was true. But at this point, I think we’d made a enormous mistake in structuring the code (in addition to writing our own persistence framework), and we’ve been paying for it ever since. Consider:

The new persistence framework (afw.table.io) was part of afw. It requires persistable objects to inherit from a base class (in C++) and use data structures defined in afw::table. That means persistable objects need to be defined in or above afw in the package dependency tree, and they need to be written in C++. That’s already pretty bad, but it gets worse.
The Exposure class is also part of afw, and it holds its components as regular, strongly-typed data members. That means anything held by Exposure needs to be defined in or below afw in the package dependency tree (or at least have a polymorphic base class that is - but most of these components are concrete, nonpolymorphic classes). You can probably guess where I’m going here.
For a class to be convenient (and, at some level, be non-surprising) as an Exposure component, it also needs to be persistable with Exposure, and that means it needs to use afw::table::io. And hence any class that represents the properties of an image must be be in afw, and written in C++. It’s no wonder that package has gotten so big.

But wait, that’s not all:

afw.table.SourceRecord plays a similar (albeit milder) role for individual astronomical objects; any data structure beyond tabular data that we want to associate with those objects needs to be known to it and persisted with it (right now that’s “just” Footprint/HeavyFootprint/SpanSet/PeakRecord). So anything that plays that role needs to go in afw (and be written in C++), too.
afw.table.io (a package lower bound) and afw.table.ExposureRecord, and afw.table.SourceRecord (package upper bounds) are all not just part of afw, but part of afw.table. So any class that describes an image or an astronomical object in a persistent way actually needs to go in afw.table (or have a circular dependency with afw.table, which is what we’ve actually done).

And here’s the kicker:

Any time we add a new class that describes an image or an astronomical object in a persistent way we need to modify at least one of our core low-level primitives, Exposure and SourceRecord, and that includes updating their persistence code (and unless we have a very good reason not to, maintaining backwards compatibility with files we’ve already written).

We got to this point by doing what seemed convenient and trusting our own intuition and judgement as to what made for good interfaces, design, and code organization. The lesson I’ve learned since 2013 is that our judgement and especially our intuition in those areas are often terrible, especially given that most of us (in Science Pipelines, anyway) think like scientists, not software engineers.

Good design principles are often counter-intuitive, at least at first, and that’s especially true of the classic OO design principles we’ve very explicitly violated four (of five!) of here (which and how are left as an exercise to the reader). This is made more difficult by the fact that those design principles often encourage splitting software up into smaller units, and that exacerbates very real problems with packaging tooling and ecosystems that often (and legitimately!) feel much more important in the real world than adherence to abstract design principles (I think the discussion on RFC-539 is a perfect example of this).

Anyhow, I think there are a lot of ways out of this mess, and to first order I think they’re pretty straightforwardly derived from actually applying some dependency-inversion-principle in particular, but the devil is of course in the details…and the disruption. Given where we’re at now, any changes to these core classes is going to be quite disruptive. But if we ever want to get our package organization in order, or write more stuff in Python, or avoid churn at the bottom of the dependency tree, this is the hurdle we’ve got to overcome.

ctslater · November 2, 2018, 10:36pm

I’ve learned a lot from these “blog posts”, thanks for writing this up.

parejkoj · November 7, 2018, 10:37pm

I too appreciated the candor of these posts.

But I found this one to be less clear regarding the possible ways forward: some more specific examples of how to apply those design principles to the design of a new Exposure-like object and/or some discussion of how that would improve either the existing codebase, or our ability to extend it, would be much appreciated. It’s clear that you have some more concrete ideas in mind.

jbosch · November 13, 2018, 2:57pm

“Straightforwardly” may have been premature. Here are a bunch of ideas that I think would help, in no particular order; some are orthogonal, some are complementary, some really fix the problem, and some are band-aids that just give us a bit of breathing room.

Move afw.table core classes (Schema, BaseRecord, Catalog, etc) and afw.table.io to separate packages below afw in the dependency tree. (Moves package lower bound lower.)
Make it possible to implement afw.table.io persistence for a class without inheritance. (Moves package lower bound lower.)
Replace afw.table.io with a new persistence framework that doesn’t depend on afw.table, and instead depends on a third-party cross-language serialization library. (Moves package lower bound lower, removes one barrier to pure-Python components.)
Move “bucket” classes (Exposure, SourceRecord, ExposureRecord, etc) to separate packages above afw in the dependency tree. (Moves package upper bound higher.)
Make the bucket classes into dynamic heterogenous dictionaries instead of static strongly-typed structs, but leave them in C++. That’s a bit tricky because C++ is bad at heterogeneous containers, and components need to at least be persistable and perhaps have a bit more common behavior. So this would involve a new very minimal C++ ABC (ideally inheritable from Python) for those components, and some cleverness and/or dynamic_cast to make an ergonomic heterogeneous container interface. (Removes package upper bound, removes another blocker on pure-Python components.)
Remove the bucket classes entirely, and convert code using them to either (A) pass around just the components they actually need separately, or (B) pass around more locally-defined bucket classes with the subset of components relevant for a group of related APIs. (Removes package upper bound, fully enables pure-Python components.)
Remove bucket classes from C++, but retain them in Python, where heterogeneous containers are easy, and use the (A) and (B) approaches from the previous bullet in C++ only. (Removes package upper bound, fully enables pure-Python components.)

The last two ideas may seem extreme - surely we can’t live without Exposure! - and they probably are, given how much code we have that relies on those bucket classes, and Exposure in particular. But looking back I think they always traded away maintainability and stability for the convenience they genuinely do provide (they very much violate the Interface Segregation Principle), and if I was starting over (or advising Astropy on what to do with NdData) I’d want to see how far we could get without any general-purpose buckets.

kfindeisen · November 13, 2018, 6:28pm

Thanks for the ideas, it’s nice to have something concrete to think about. My initial reactions:

Persistence actually seem to me to be a sensible use of inheritance (at least, in the form of Persistable rather than PersistableFacade, which breaks substitution). So I’d say the right solution is to put an interface class somewhere very low (maybe even utils), and have that interface not depend on how persistence is actually implemented like the current framework does. Essentially, the only thing a class would need to do to implement NewPersistable is define a serialized form (perhaps as a sequence of primitives or something like that). Of course, I’m also all for using an existing library with good Python support as the implementation.

I think this would also get around the restriction that persistable classes need to be in C++, as long as the code that actually interacts with NewPersistable (possibly a bridge to a third-party library) is in Python. [Edit: @jbosch’s post below says we could do this with C++ persistence code as well].

I think an alternative to a Persistable-like mixin might be having classes return and be constructible from some kind of standard object that represents the serialized form, but I’d have to think about it some more. It’s not a standard pattern, so there might be some pitfalls that aren’t obvious at first glance.

I suspect this solution amounts to renaming afw, given your previous statement that afw is in many ways the set of all classes needed by Exposure and table.

I’ve always been annoyed at the fact that we have 2-3 completely independent heterogeneous containers, none of them type-safe. I’ve implemented a type-safe heterogeneous map in a strongly-typed language before, so I’d be happy to help if we decided to go this route (my previous solution involved including type information in the keys, but this can be easily hidden in the pybind11 layer at the cost of not being type-safe from Python). Though I suspect that your “bucket classes”, at least Exposure, should depend on such a container through composition (is-implemented-in-terms-of), not inheritance (is-a).

The main disadvantages I can see are that this approach might be too open-ended (surely there are some things that must be part of an Exposure), and that the keys and values must be C++ classes.

I dislike (A) because it would lead to parameter list bloat, something that’s also a problem in much of our code (especially afw!), and violates abstraction (you need to know exactly what information constitutes an Exposure concept, and need to update interfaces to reflect changes).

Option (B) seems kind of interesting – depending on how it’s done, it could be some combination of direct interface segregation and the Facade pattern. The main disadvantage of (B) is it won’t work unless we have well-segregated modules with clear interfaces…

jbosch · November 13, 2018, 7:31pm

I would love to see if we (@swinbank?) can carve out some time for you to work on this in particular - not only do I strongly agree with everything you’ve said about how you’d approach the problem, I think the changes that remove the package upper bound (like this one) have by far the biggest up-side,

I actually think there aren’t any components that can be expected to always be present on an Exposure, but we probably do need some way to define a kind of standard set of keys for the things that usually are present.

If the values have a common abstract base class, we could eventually inherit from that in Python via some pybind11 functionality we’re not currently using (https://pybind11.readthedocs.io/en/stable/advanced/classes.html#overriding-virtual-functions-in-python). But that might be blocked now by not being able to implement the persistence API in pure Python.

On that note, I’ve been thinking more about alternate-persistence framework ideas, too (and even coded a tiny bit along those lines in my spare time). I think we’ll want to tackle that eventually; more on that later. But I think it makes sense to tackle the heterogeneous container part first, both because I think it’s easier to do first and because I think it brings the most immediate benefit.

kfindeisen · November 13, 2018, 7:51pm

Standardized keys seem like a very error-prone API (at least, I find Mask very hard to work with because of the need to guess what they are). If we really want to both make everything optional and implement everything as a heterogeneous map, maybe we could still have getters for standard elements (implementing them as key lookups)?

I thought that if you have a Python class that inherits from a C++ interface, the class cannot be used from C++ even if you access it only via the interface, because there’s no way to call the (Python) implementation code. Isn’t that why the current Persistable framework forces all persistable classes to be in C++?

jbosch · November 13, 2018, 8:04pm

Getters for standard elements definitely works for me, too. I was thinking more of defining public constants for the standardized keys, but I’m by no means wedded to that (and haven’t even thought it through in any kind of detail).

Having the C++ call into Python definitely is possible. We’ve just never done it in our codebase, as it brings some overhead in both code and runtime for any hierarchy you add it to.

Along those lines, I suppose I should have included “Wrap afw::table::io with pybind11” in my bullet list of ideas to improve things; I don’t think there’s anything fundamentally blocking us from making the current afw::table::Persistable inheritable from Python.