FITS persistence for cell-based coadds

jbosch · October 28, 2021, 3:44pm

I’m currently working on data structures to hold “cell-based coadds” - coadds that are built in small, sub-patch cells, in which all pixels in each cell have exactly the same set of input images, and hence discontinuities in the PSF and other properties only appear on cell boundaries. I have a good sense for how to handle the in-memory data structures, even though there’s still plenty of prototyping work to be done, but there are some big choices to be made about how to handle the (FITS) persistence of these coadds.

First off, I do not think writing them via afw.image.Exposure is a good idea; we are adding a few image planes beyond the image-mask-variance triplet that really should be coequal with image-mask-variance, and ExposureInfo components like Psf, SkyWcs, PhotoCalib, ApCorrMap, and CoaddInputs don’t naturally map well to the cell-based structure. Even more importantly, the cells will have overlaps, so we can’t represent the full multi-cell content as a piecewise image. We can and will provide a Exposure view that is a piecewise image, with piecewise components, as that will play an important role in being able to use existing algorithms on these new coadds, but this will be a lossy view, and that makes it unsuitable for persistence.

Here are a few ideas for how we might want to save these multi-cell coadds instead to seed discussion. I am assuming FITS because FITS is perfectly capable of representing all of these forms, and we’re all familiar with it, and we have some requirements involving providing FITS versions of at least some data products. I think using e.g. HDF5 or ASDF instead is mostly orthogonal to the data-organization questions I’d prefer to focus on here.

Exploded piecewise images: Save full cells for each plane in a single piecewise image HDU (different HDUs for each plane), with the right positions in the grid but overlaps duplicated. This is similar in spirit to how raw amplifier images (including overscan regions) are stitched into a single image on disk for HSC and DECam (but not LSST, though we simulate this form on read for LSST as well). We would presumably do the same for PSF postage stamp images (one per cell), even though these have different dimensions, with non-image components in more-or-less opaque binary tables (as they are for Exposure components today). The actual geometry would be written to the headers somehow.

This form is pretty good for humans who want to look at images by eye; they have to mentally clip out the overlap regions, but the eye is pretty good at that.
This form is pretty good for advanced measurement algorithms that want to use the full cell regions and handle overlaps themselves; they don’t have to stitch anything together, but they do have to seek around the file quite a bit to find all of the data relevant for a single cell, if they want to read only some cells.
This form is bad for naive algorithms that just want to work on the stitched-together piecewise image.
This form is fine for compression: a careful tile compression configuration could handle this well (edited).

Per-plane data 4-cubes: Same as “Exploded piecewise images”, except each plane is a 4-d array with the index of each cell in the grid the other two dimensions. Pros and cons seem to be pretty much the same, but with a bit more of the geometry encoded in the array rather than the header, and for human inspection it makes looking at cells independently a bit easier while making it nearly impossible to view them together.

Big binary table: Save full cells for each image as 2d array columns in a binary table, where each row is a cell and each column is a different image plane. Non-image components could be stored in other columns in the same row. The actual geometry could be written to the header and/or stored (with some duplication) in columns.

This form is bad for humans who want to look at images by eye, because most FITS viewers aren’t designed for binary tables that contain images.
This form is good for advanced measurement algorithms that want to use the full cell regions and handle overlaps themselves; all information for each cell is close together on disk.
This form is the easiest to map to our in-memory form, and may require fewer temporaries or computation in I/O.
This form is good for providing a complete description of the FITS data model. Such a description would be possible for other forms, but I think this one is particularly easy to fully describe and understand.
This form is bad for compression (FITS binary table compression is experimental, not standard).

Piecewise stitched images with overlaps separate: Save the inner, non-overlapping regions of each cell together as a piecewise image, with one plane per HDU. For each plane, add another image HDU that has the alternate, non-primary pixels from those overlap regions, with zero or NaN where there is no alternate. The PSF images would be another plane, and it would probably make sense to use the same grid with some zero or NaN padding (it is safe to assume that PSF images are smaller than the other planes for cells). Non-image components would be saved in more-or-less opaque binary tables.

This form is very good for humans who want to look at images by eye; the image is as continuous as it can be, and the alternate pixels can be easily blinked with the primary pixels.
This form is bad for advanced measurement algorithms that want to use the full cell regions and handle overlaps themselves; these need to stitch together information from multiple HDUs to construct full-cell images for even a single plane.
This form is good for naive algorithms that just want to work on the stitched-together piecewise image, as this image is directly available as natural FITS image per plane.
This form requires at least lossless compression to avoid inefficiencies due to the zero/NaN padding it implies. It is very bad for lossy compression, because the regions we would want to compress together are the full-cell images, and those are spread across HDUs.

beckermr · October 28, 2021, 3:52pm

Thanks Jim!

I think the option called “Exploded piecewise images” is best. fpack can be configured to handle the compression correctly. Internally if you did that, the data would actually be a FITS table with one entry per slice which would have good i/o too.

price · November 1, 2021, 6:12pm

I was in favour of “Big binary table” (based on the similarity with the in-memory representation, with an expectation that there’d be a tool to convert the file to a human-viewable image, and lsst.afw.display support) until I read “bad for compression”, which I think immediately eliminates it. “Exploded piecewise images” is then on top.

parejkoj · November 1, 2021, 9:46pm

I don’t have a strong opinion on the different options, but I do wonder whether the question of “human inspection by eye” is best handled by a Rubin-developed image viewing tool that allows the user to emphasize a given cell in the overlap regions, and takes care of the tiling. I don’t know that we have a group available to do that development though.

jbosch · November 2, 2021, 2:43am

Certainly, and I’d even say that many of the other considerations here aren’t really relevant for internal-to-DM software either, because that software will be operating on the in-memory object, just like our own display code would. Compression probably is relevant for our own usage - if we decide to do lossy compression of coadds at all - and subset I/O performance might be, sometimes. But for the most part I think this is really a question about what we deliver to science users who prefer not to use our code.

erykoff · November 2, 2021, 4:08pm

Off the top of my head, I agree that “Exploded piecewise images” is best. This sounds a lot like the DES MEDS format which has been used successfully. As Matt points out this fpacks nicely which also provides easy i/o for each cell.
As you say, the drawback is that you can’t just pop this into Source Extractor (or other non-DM code) and have everything work out of the box. But I don’t think this is a dealbreaker as long as we provide the tools to convert this into a non-overlap stitched-together image.
At the same time, what are the drawbacks of also persisting the “stitched-together-patch” images which are the same format as our current deepCoadd images? Is this something that is definitely not on the table?

jbosch · November 2, 2021, 4:13pm

Big drawback is extra storage cost, especially if we’re talking about making both available to science users.

Another benefit of not saving these is that it might save us from having to write serialization code for the piecewise Psf, BoundedField (for ApCorrMap), TransmisionCurve (etc) implementations we’ll need to provide that Exposure view - the cell structure makes it easier to save per-cell versions of all of those, and that’s what we’ll do natively, but we need piecewise implementations to stuff into Exposure. If we always make those Exposure objects on the fly, then we migth not need to be able to persist their components.

Overall, I can’t say that storing the stitchted-together image is out of the question, but I don’t think we want to assume that we will be able to store both forms and certainly shouldn’t assume we can deliver both forms to science users.

ktl · November 2, 2021, 7:41pm

Just wanted to point out that if you want to “deliver to science users who prefer not to use our code”, you’re not going to be able to get away with “more-or-less opaque binary tables”. At some point the contents will need to be documented.

ktl · November 2, 2021, 7:51pm

While coadds are currently modeled as being large (PB range) in terms of storage, they don’t grow and so become subdominant in the sizing model.

I don’t think we’ve ever projected a need for using quantized floating point compression (“lossy” may be unnecessarily pejorative) for coadds, but if we had two equivalent representations, using such compression on one of them might be reasonable. On the other hand, with the amount of ancillary non-image data per cell, better compression of the image/mask/variance pixels might not reduce file size that much.

We will also have HiPS versions of the coadds; could these be effective for some of the use cases of the persistence strategies?

jbosch · August 30, 2022, 7:32pm

Coming back to this old thread after some new developments:

I’ve spent tiny fractions of several months “working” on implementing the exploded persistence for our current in-memory data structures, and made no progress for reasons essentially unrelated to the questions raised here.
I’ve now asked @arunkannawadi to take over that work, but to go with the “big binary table” format instead. Here’s why:
- Rather than being full of “opaque blobs”, the binary table format is probably actually the most self-describing and easiest for external code to understand; all of the table’s columns are relatively self-explanatory and it’s easy to associate each cell’s pixels with its geometry in the larger grid. And we don’t have to struggle with FITS headers to achieve that level of self-description, so it’s probably the simplest to implement.
- One of the big problems we want to solve with cell-based coadds is performance scaling for large numbers of visits, not just for coadd processing but coadd creation as well (e.g. @jchiang 's scary problem reported on Slack, which better use of scratch filesystems would probably mitigate but not fully address). The binary table format is clearly the best for that I/O, to an extent I hadn’t appreciated in my first post: it’s the only candidate format in which you can read a full cell (a binary table row) without having to jump around between different HDUs on disk. That makes it exactly what we’d want to use for cell-based warps, even if we ultimately go with some other layout for cell-based coadds, because we’ll really want to read the warps one cell (or a few full complete cells) at a time.
- As long as we’re thinking primarily about object stores for long-term dataset storage and staging to more-local POSIX caches for processing, doing full-file compression in long-term storage and full-file decompression when we stage seems like it makes a lot of sense. That makes FITS tile compression much less important, since we can’t read subimages directly from object stores directly anyway, at least not without writing a lot of our own low-level FITS code to work out byte offsets. And if we did want to do that (and forgo the compression), the binary table format is probably the one with the simplest byte offsets to compute, if you can rephrase all subimage requests as requests for specific cells.
- As @ktl noted, for science users HiPS images will also be around to serve some of the visualization needs. And while we’ll still need visualization tooling for the cell-based coadds themselves, it’s clear we’ll need something bespoke to do that regardless, and then I don’t think the binary table approach is really any worse than the others.

I’m not opposed to having other formats down the line, but right now this one feels like a pretty good first choice.