Gen3 Instruments, registries, and ingestion

parejkoj · November 27, 2019, 1:20am

These are notes from a discussion @timj and I had today about the gen3 Instrument class, how Instruments are registered with a Butler, and how to handle ingesting “instrument signature data” (brighter-fatter, defects, transmission curves, camera definitions), “calibration products” (master bias, flat), and “raw data” into a butler. There are a fair number of open questions here that will need to be answered before we can complete the gen3 ingestion system.

From Tim and my discussion, I believe I have a path forward for how to generically ingest Instrument signature data (DM-21016), which is sketched out below. The big remaining questions involve calibration data and how to update a butler registry’s understanding of an Instrument's dimensions.

One question that comes up from the below: should we name Instrument something else, given that its primary use case is registering and ingesting data?

The Instrument python class

What distinguishes stuff that defines the instrument (i.e. goes into the Butler registry) vs. ingested instrument signature data?

When we register an Instrument, we add a these Dimensions: “instrument”, each “filter”, and a “detector” for each detector in the Instrument. The latter is somewhat redundant with the Camera Geometry, but is needed for looking up “detector” Dimensions in a repository. At present, it doesn’t let us identify detectors by their serial number, so if a sensor is swapped out, we can’t distinguish that in the registry.
Camera Geometry appears to be the only thing with a fuzzy boundary here.
Changes to instrument signature data are ingested into distinct Collections, which takes up more disk space, but will greatly simplify the how we use the updated data.

What do you need an Instrument for during ingestion?

Putting instrument data into the registry (filters, detector definitions).
Specifying the path to obs-specific data: obs_blah_data/something/.
Ingesting instrument signature data.
Identifying the formatter for raw data.

When would you need an Instrument after ingestion?

You need to get the formatter for the instrument (could the formatter be registered in the Butler itself?).
You need to get a camera from somewhere without a Butler (versioned Cameras may exist in the Butler at some point?).
Where to find config overrides.

Steps to load new Instrument-related things.

Registry

Registering the Instrument the first time.
- Handled by Instrument.register().
Registering new instrument data (e.g. updated filter, changed detector).
- Could this also be handled by Instrument.register()?
- Does insertDimensionsData raise if called with already-existing values?

Instrument signature data (changes slowly with time)

Writing new instrument signature data.
- The user must specify a unique collection every time, when instantiating a Butler to ingest the data into.
- Taken care of by writeInstrumentSignatureData. What calls that?
- Need a Task that takes a repo, an Instrument, and a collection. When called, everything is ingested into the new collection: that makes provenance easier.
- Create a new IngestInstrumentSignatureDataTask to do this.
What about HSC’s ingestStrayLightData - we’ll ignore it for now, since it isn’t called anywhere, and it’s just an HSC thing.

Calibration data

Writing calibration data (nothing exists for this yet).
Regardless of where the data is coming from, the validity range likely has to come from somewhere other than the files themselves.
How do we go from the calibration production system producing a master bias to getting a “butler-metadata-aware” master bias?
- In theory, cp_pipe should generate all the necessary metadata?
What to do with externally-provided calibrations?
- Can we use RawIngestTask for this?
- If so, would it be as simple as writing a new Formatter?

raw data

Writing raw data (including raw calibration data) is handled by RawIngestTask. No more work should be necessary here?

jbosch · November 27, 2019, 1:49pm

Thanks for taking the initiative on this; I’m more than happy to let some combination of the two of you take the lead on future-proofing Instrument/obs_* and pulling a big piece of Gen3 design out of my grubby little hands. I know he’s about to be on vacation for a bit, but it’d be worthwhile to loop @czw on a lot of this too; he’s currently the expert both on how cameraGeom is evolving and how traditional master calibrations (flats/darks/biases) are being built in Gen3. Anyhow, in the spirit of trying not to be a critical part of this work, I’ll try to limit my replies to providing information and steer clear of pure opinion, but I’ll also try to make those pretty complete brain dumps.

This was intentional and I’d like to keep it that way (I don’t think you were proposing otherwise, but wanted to clarify regardless): it’s important that Registry dimensions be static, though:

their relationships with visit and exposure can be used to represent the state of the observatory as a function of time (e.g. we pretend all physical_filters have always existed and will always exist, because what we care about is when they’re associated with an observation)
they can be used to label datasets that also have validity ranges (e.g. a physical_filter is a natural dimension for an afw.image.TransmissionCurve “instrument signature” dataset that tracks the actual throughput of the glass over time).

Versioning the camera (cameraGeom) dataset - an example of the latter - was how I had planned to track detector swapping, and I think that probably would work well enough for on-sky instruments, though it’d be especially nice if we could make component datasets that did actually use the detector dimension, i.e. not just

camera = butler.get("camera", instrument="DECam",
                    ...something about a validity range...)
detector = camera[15]

but also

detector = butler.get("camera.detector", instrument="DECam", detector=15,
                      ...something about a validity range...)

That doesn’t work right now because some parts of the system don’t let a component dataset have different dimensions than the parent dataset, but I hope we’ll get there eventually. I’ll get back to the fact that we don’t know how validity ranges appear in data IDs later.

Anyhow, @timj and I have discussed adding a physical_detector dimension that could carry the serial number and represent a physical detector rather than just a slot; like physical_filter it would be associated with a visit or exposure as well as the existing detector (slot) dimension. This may turn out to be important for the test-stand datasets in particular. I still hope we won’t need it at all, but that’s how I think we’d want to add serial number (etc) to the Registry.

After data has been ingested (or put) into a Gen3 Registry, the formatter that was used for each dataset is saved in the datastore, so there should be no need to obtain the raw formatter from the Instrument after ingestion. The fact that the formatters are per-dataset is why the API for getting the raw formatter from the instrument allows it to return different formatter classes for different raw data IDs (e.g. HSC’s different formatters for differently-rotated chips). It’s important to maintain that flexibility, but if we get into using different formatters to handle significantly different versions of raw data as a function of time, it may be worth thinking about whether the data ID -> raw formatter mapping should be treated as data (and go in obs_data) rather than an API implementation that goes in obs_.

The camera dataset in Gen3 already formally has a validity range; it just happens to be infinite for all obs package right now. I’d feel most comfortable if that was (eventually) the only way to get a full afw.cameraGeom.Camera object, and if we have use cases for a Camera that didn’t change with time, it would be retrieved via butler.get on some other dataset that returned some other skeleton-of-camera class without any of the numbers that change. I’m not sure it’ll ever be worth our time to refactor cameraGeom to make that split between temporal and non-temporal stuff. In any case, from my perspective Instrument.getCamera is currently a useful convenience in the Gen2->Gen3 transition, and I’d love to be able to get rid of it, but it may be a pain to find and fix all of the code that currently expects to be able to get a Camera without knowing the epoch that Camera should correspond to.

I believe it will, but with ugly, low-level SQLAlchemy exceptions. In the future we could either:

give it better error messages;
skip inserts that clash with existing records;
overwrite existing records that clash with inserts.

I’d be interested in getting opinions on which behavior would be best from the perspective of typical ingest/repo-setup workflows, though the “overwrite” case is at least scary and probably untenable from a provenance perspective. Another catch is that I’m not sure we can realistically define “clash” as anything more than “have conflicting primary key values”. In other words, we can detect when you try to insert the same visit ID twice, but may not be able do different things depending on whether the other information about that visit is the same or different.

Right now, @czw has a PR in review where new master calibrations are produced one-at-a-time with a dataset type (e.g. “cpBiasProc”) that isn’t the same one that’s used later for ISR (“bias”). That’s because they have different dimensions:

cpBiasProc: instrument, detector
bias: instrument, detector, calibration_label

The calibration_label is what maps to a validity range. In @czw’s workflow, once a new master bias (etc.) has been produced and validated, he runs a separate script (blessCalibration.py) that adds the validity range by copying cpBiasProc to bias.

I already have some plans to make the representation of these datasets in the Registry more natural, while keeping the high-level workflow (“produce then bless”) the same; the idea is that validity ranges would go in the record that associates a dataset with a collection (of a special type) and the calibration_label dimension would go away (or at least cease to be tied to a validity range). That would let us use the same dataset type when producing and using master calibs (e.g. just “bias” instead of both “bias” and “cpBiasProc”), and it would let the same dataset exist in different calibration collections with different validity ranges. I don’t think it’s worthwhile yet for anyone else to worry too deeply about this change - I need to get the Registry house in order a bit more first, and @czw’s current approach works fine for now - but I hope it will smooth some things out eventually.

Anyhow, this matches the workflow use cases that we gathered at PCW in August and discussed further with the CPP team (@rhl, @merlin, @plazas) at a later meeting in Princeton, in which you’d only ever produce one master calibration at a time, and you’d validate it and explicitly bless it before using it to build any other master calibrations. At a recent middleware telecon, @MichelleGower and Robert Gruendl noted that in operations they’d also like to be able to build a suite of master calibrations together as part of the same Pipeline. I’ll need to extract some more information about the details of that use case before trying to fit it in; in general, you might want to combine both old and new master calibrations of one type (e.g. bias) when making another (e.g. flat), and hence I think this approach would need to require the user to provide in advance not just the validity ranges of what they’re building now, but how the validity ranges of existing calibrations would change when they’re associated with the new calibration collection.

If they’re truly external (e.g. DECam CP), a new Formatter is probably in order. If they’re coming from Gen2, our vanilla MaskedImage or Exposure Formatters will probably work. I think the trickier bit is associating them with validity ranges. I’d be ideal if we could make ingestion look just like the “produce” part of “produce-and-bless”, and then use the exact same “bless” step we’d use when making our own master calibrations, but that may be very inconvenient if you’re ingesting a ton of external calibrations, given that the (current) “produce-and-bless” workflow involves making only one master calibration at a time.

A better option may be to extend CalibRepoConverter, which is what I’ve been using to pull HSC master calibrations from Gen2 into Gen3, using the Gen2 validity ranges. While I’ve been recommending people like @isullivan instead follow @czw’s lead in using a crude YAML importer instead to get up and running quickly, we do ultimately want to make that work for all other obs packages that have data repos we want to preserve through the transition. That system was written without a whole lot of knowledge about how the Gen2 calib registries worked and (in particular) how camera-specific they were, so it will probably involve some refactoring of the core stuff when we make the second obs package work with it.

Anyhow, it might be viable to make that system able to work with different sources of external validity ranges in addition to Gen2 calibration registries, and I think that option is at least worth exploring before anyone dives into making any custom external-ingestion drivers.

RHL · November 27, 2019, 6:02pm

I think that we need to be able to add more calibration data to a given collection; you take biases, build a master bias, validate it, then add it to the collection. Then you use that collection to do the same for darks, then flats, then …

You could have separate collections for each step, but that would require us to merge them later to produce the calibration collection that we’d use to reduce data.

RHL · November 27, 2019, 6:07pm

I’m not sure this in scope, but it is closely related.

We also need to think about how we track calibration data that is in cameraGeom such as gains and biases. This is OK with versioned cameras, but we need to think about how we update this. I would very much like to be able to do a detector.setGain(6.66) in code and use that value, and then have a way to persist the value in a comprehensible way – at present, that would require writing the updated gain value back to the yaml files, rather than just back to the easily-read camera that the butler usually reads (at least, I think that’s the plan). For cameras such as lsst_cam where the camera .yaml files are built out of smaller pieces, we’d have to push the updated gain back to the per-detector files (or is it per raft?)

merlin · November 28, 2019, 12:58pm

I think/hope that @KSK’s recent post on the extension of the standardized curated calibrations system will help address this nicely, though I’ll confess to posting this before having read the rest of this giant post in full.

jbosch · November 28, 2019, 1:51pm

On this particular subject, I think @parejkoj is only talking about the kinds of calibrations (what he’s called “instrument signature data” here) for which the ultimate source of truth is human-curated in obs_data packages (including validity ranges for all time). For those, I think it’s at least not unreasonable - and certainly simpler - to just dump the entire thing to a new collection rather than trying to identify what had already been ingested from the obs_data package in exactly the same form. After that dump, it would still be possible to merge that collection with one containing pipeline-produced (bias, dark, flat, fringe, …) calibrations to have one calibration collection for users.

Once a calibration collection is in regular use, I think it’s a convention/policy question whether we should be able to update it; the butler can certainly have that capability, but exercising it limits our ability to reproduce pipeline runs. We will of course have exact provenance that will let us exactly re-run DAGs that were produced (regardless of whether they were run or not) before the calibration change, but we wouldn’t be able to reproduce building those DAGs. I don’t think that matters, but it’s worth thinking about.

I am a bit skeptical that this is worth the effort. It’d be elegant, but I don’t see as necessary, and I think round-tripping between expanded and normalized data structures is a fundamentally hard problem unless you hard-code a lot of the transformation both ways.

Maybe we’ll run into that problem either way, but I’m very worried those normalized obs_lsst YAML files represent the easy 80% of a problem we’re setting ourselves to spend a lot of time fighting in the future. I don’t have a good alternative, though.

parejkoj · November 28, 2019, 6:17pm

Unfortunately, not really. His system deals with how to read in that data, which is very useful, but it doesn’t answer the question of what to do with it in a gen3 butler. Most (all?) of what I’m calling “Instrument Signature Data” (to try to move us away from the overloaded word “Calibrations”) can be user curated via the system @KSK describes there, and with the current Instrument design, there’s a writeInstrumentSignatureData method to put any such data into a Butler. But there is no Task yet to wrap that method, and the bigger question of calibration data (bias, flat) does need to be sorted out.