jointcal outputs and butler object histories
Notes from a discussion among @hsinfang, @jbosch, @laurenam, @natelust, @natepease, @parejkoj, @rowen about what we do with jointcal outputs (e.g. new WCS, updated source catalog coordinates, new spatially-varying calib photometry) and how do we manage processing history.
We will need to produce:
- a list of use-cases
- two (or more) example proposals
and then compare whether our proposals can meed the use-cases.
Initial notes
- What do we do when we update pieces of an exposure, but we don’t want to write the whole thing.
- Bosch: clear answer is don’t write full exposures. But we shouldn’t have to care as the butler will have to just handle it.
- rowen: why not just always save things as pieces.
- The butler will have to provide a mechanism to produce self-contained files for “sharing”.
photometry
- We need to save a calib object with a spatially-varying transmission to persist the new photometric fits.
- Bosch: “We don’t update the image, we write a new calib.”
astrometry
- Source catalogs contain sky coordinates.
- Could we not persist sky coordinates at all, and build them as necessary from the pixel positions?
- “versioned” WCS. Thus the butler can persist each WCS and keep a “default”.
- Numbered version might be a problem (what is “version 5 WCS”). Go by name? e.g. “version=jointcal”.
- In-memory objects don’t know what they are. The butler knows what is what via the dataId.
- What if someone wants to compare different one (e.g. single frame vs. jointcal WCS)?
- We can put “markers” after every pipeline stage (“post-singleFrame”, “post-jointcal”, “post-blah”). Like commits on a git branch: a new version may not have touched everything, but it will contain everything up to that point.
Make supertask behave like git
Every new rerun it makes an atomic update. You don’t see all the reruns in the change, but you can go back to any commit. To rerun a pipeline, you can just rerun from the point at which things hadn’t yet changed (e.g. previous singleFrame/jointcal with new coadd).
-
Not a 1-1 relationship between supertasks and datasets, but there is a 1-1 supertasks and elements in the data repository chain (i.e. reruns).
-
How to choose the parent is TBD.
-
Associate number of uses/most recent use with the repositories to help identify orphan branches.
-
This may need a “repo of repos”.
-
In general, you point at the tip of the rerun and that’s what you want.
-
npease: Provenance needs to go into the output repositories. We need to be able to go back to find “what was valid on this day?”
-
Post-jointcal, do we overwrite the WCS in a given dataset (and use the “repo of repos” to find the old ones)? Or do we write a new dataset? Or do we give the new one a new name?
-
Bosch: we’ll need two layers of names: “the real name of things” and “the convenience name”.
-
example: “image characterization WCS” or “jointcal WCS” are both conveniented: “WCS”
-
Instead of the convenience names, each stage could be identified by the supertask names that were run.
-
nlust: Why can’t we just use the branches to identify things?
-
rowen: What happens if we reprocess 3 frames and one of them fails? What is the “correct” WCS for the failed one?
-
npease: supertasks should be able to say “I failed”, so that the butler can know one stage is bad.
-
This also needs to deal with the case where one piece failed, but most of the processing is ok.
-
If you name the WCS differently from each supertask, if you have a generic supertask that just wants “the best”, it tries the most recent repository and fails if that doesn’t have what it wants.
-
We want earlier datasets to be replaceable. But once we’re in production, each step will deeply care about what the previous step did.
-
If jointcal failed on one frame, when we read it in, the butler shouldn’t look earlier in the history for the failed WCS. It should just say that frame has an invalid WCS.
-
Bosch: What if we want to run tasks A and B again (with a previous configuration) on a new tract?
-
If the repository hashes include the dataIds, we will get a new hash so we’ll have to be able to merge if we want to do something across those tracts later on.
-
If the repository hashes don’t include the dataIds, we have to be able to update a previous repository (unchanged hash). Are we alloed to change any of the other data in that repository?
-
More than writing “this has failed”, we need to write “this should have succeeded” when you start something.
-
This is the same as having a lock.