Requirements for overhauled calibration task?

We are about to refactor the calibration task pipe.tasks.CalibrateTask DM-4255, and this posting is a request for changes you want to see in that task.

Here is what we have so far:

Refactor the task so it is easier to subclass (mentioned in DM-3150). The run method of the current task is quite long, so subclasses tend to require too much code duplication. Examples include CfhtCalibrateTask and SdssCalibrateTask.

DM-3150 CalibrateTask should optionally not compute initial astrometry.

In addition, DM-2880 Create transformation task for CalibrateTask outputs hints at desired changes to the CalibrateTask, but does not specify them. More information would be much appreciated.

If you have other desired changes please specify them here, or as new issues linked to DM-3759.

I actually think we should generally discourage custom CalibrateTasks for individual cameras, and find another way to allow cameras to customize what they need, whether that’s loading PSF models from elsewhere (as in SDSS processing) or finding and repairing CRs at the Snap level instead of at the visit level (for LSST processing). As you’ve noted, the main logic of CalibrateTask is too complex and too important to expect cameras to rewrite the whole thing, and I think the best way to fix that is to have the substitutions happen at a lower (i.e. subtask) level.

I also think we need to take a look at the sketch of the longer-term plan we put together here: Single-frame processing sketch

I’m not sure how much we can move towards that model now, but any steps we can take that won’t cause problems in the short term would be welcome. That would include:

  • Splitting image characterization from Source measurement (breaking up ProcessCcd, and possibly promoting CalibrateTask to a CmdLineTask). We need to do image characterization before we can do anything else, but from there we may want to do other things before we do source measurements (and we frequently may not care about doing source measurement at all).

  • Having two stages of PSF estimation. The first would be more or less like the one we have now, but with no need to use a catalog for star selection, and hence no need for preliminary astronometry - I think that actually simplifies things considerably. The second would be a totally new stand-alone task which would only temporarily use the same algorithms and interfaces, and would use a catalog of stars that isn’t just the external reference catalog (it’d be something we define ourselves in another task, using our own processing results from multiple visits and possibly external catalogs). I think it’d be okay to defer implementing the second one for now, which means this action item might just mean temporarily removing support for CatalogStarSelector and taking advantage of the simplification that offers.

  • Re-detect after determining the initial PSF model instead of reusing the detections from before. This gives us two different icsrc outputs, but it’d remove all the SchemaMapper complexity that’s in the current CalibrateTask, which is a huge win for readability. We should also consider running the deblender between the second stage of detection and the measurement stage that follows, and at least have an option to run it after the first stage of detection.

  • We’d like to do matching to an external catalog (and hence initial astrometry and photometry) over the full visit rather than within individual CCDs. I think we probably want to tackle this after everything else is done, as I’m not sure what parallelization approach we’ll take here, but I don’t think we can call the CalibrateTask overhaul complete until this is done, too.

@KSK suggested the following initial refactoring of the calibration task. It’s not an exact fit for your longer-term plan, but not too bad.

Simon’s Calibration Task Proposal:

Change CalibrateTask.run to call sub-tasks that do the following:

  1. Image characterization
  • iterate N times:
    • estimate PSF
    • interpolate over cosmic rays
    • estimate background
  1. Detection and measurement (basically sextractor)
  • iterate N times:
    • detect sources (using background estimated above, if available)
    • estimate background
  1. Astrometric calibration

  2. Photometric calibration

Russell adds these comments:

Initially astrometric and photometric calibration will both use the same catalog. Longer term we will support different catalogs for each, but I hope that won’t be until we improve how we persist and unpersist such catalogs.

If we break processCcd.py into two tasks, I assume we’ll have something like this:
calibrateCcd

  • ISR
  • image characterization

measureSources

  • detection and measurement
  • astrometric calibration
  • photometric calibration

I am not proposing to attempt to support any full visit work in this initial refactoring

New outline looks good overall, but I have a few small comments:

  • In section (1), I think there are implicit detection and measurement stages, because those are needed to feed PSF estimation. Those can go either before or after PSF estimation - that just depends on what you consider the first bootstrapped PSF.

  • I actually don’t think we need to iterate over section (2); even if we did, we’d only need to iterate over detection and background estimation, since measurement won’t affect the background.

  • Your split into two top-level tasks makes sense to me, though I’d avoid including “calibrate” in the name for the first one, given that the two subtasks that include “calibration” in their names are in the second top-level task. Maybe call the first top-level task “image characterization”, with “ISR” and “PSF estimation” subtasks?

I suggest a slightly different breakup of processCcd. It would be very useful to have a command line tasks that just performs ISR, basically the only portion of processCcd that touches/corrects the pixel data in the images (the rest is metadata or source catalogs). The rest of the steps in processCcd could go in a separate “calibrate” task or “calibrate” and “measurement”. I opened a ticket for this (DM-4635) but Russell suggesting adding it to this thread.

@jbosch thank you for your suggestions. I would appreciate some clarification:

You say “In section (1), I think there are implicit detection and measurement stages, because those are needed to feed PSF estimation”. I think you are referring to what I call “1. Image characterization”. I had assumed that we would start with a configured crude initial estimate of PSF that would not have to close. Perhaps the best seeing that the telescope is capable of, or average seeing-something that could truly be hard-coded per camera. If that doesn’t seem reasonable the we indeed need more work on this proposal.

You say “I actually don’t think we need to iterate over section (2); even if we did, we’d only need to iterate over detection and background estimation, since measurement won’t affect the background.”. I am confused because “2. Detection and measurement” only performs those two steps; it doesn’t do any measurement. I’m happy to not iterate even that, but if there’s any chance it would help then it makes sense to me to support it even if the default is 1 iteration.

Regarding names…will people be surprised if image characterization also performs ISR? If not then that name seems fine to me. However, I am uneasy about the task “1” being called “estimate PSF” because it estimates both PSF and background and not sure what else to call it “EstimatePsfAndBackgroundTask” seems a mouthful.

@nidever Are you sure you want ISR to be performed separately from image characterization? I’m surprised these would not normally be performed as one operation, even if camera-specific versions of ISR or PSF+background estimation were used.

Note: if we do offer ISR as a top-level task we run into the issue that top-level tasks cannot be retargeted. Many cameras have their own version of ISR task and so will have to provide their own executable (in obs_*/bin) to run them, each with a unique name, e.g. obs_cfht/bin/cfhtIsrTask.

I think it’ll be simpler if I just write out my counter-proposal (which I think isn’t that different from @rowen’s original).

I think I agree with @nidever that having ISR as a top-level task would be useful; maybe what we want is to split processCcd.py into three CmdLineTasks, but continue to have ProcessCcdTask to aggregate them (there’s nothing that says a subtask can’t be a CmdLineTask). We may even want to make ProcessCcdTask inherit from the new lsst.ctrl.pool.BatchPoolTask, but that’s really a separate issue (DM-3368).

Anyhow, here’s my proposal for the subtask hierarchy:

  • ProcessCcdTask(CmdLineTask)
    • IsrTask(CmdLineTask)
    • ImageCharacterizationTask(CmdLineTask)
      • SourceDetectionTask (just detect bright stars; includes background estimation)
      • SourceDeblendTask (maybe unnecessary, but we should include it until/unless we can prove that)
      • SingleFrameMeasurementTask (just algorithms needed to select stars)
      • MeasurePsfTask (then go back to SourceDetectionTask.run and iterate as necessary)
      • RepairTask
    • CalibrateTask(CmdLineTask)
      • SourceDetectionTask (detect to 5-sigma limit; re-estimate background)
      • SourceDeblendTask
      • SingleFrameMeasurementTask (*all algorithms needed for Source table and calibration)
      • AstrometryTask
      • PhotoCalTask

Maybe reusing the name “CalibrateTask” here isn’t a great idea; I’m open to other ideas for the name, but I also feel like “MeasureSourcesTask” is a bit too easy to confuse with the old “SourceMeasurementTask” or “SingleFrameMeasurementTask”.

@jbosch your proposal seems fine to me.

I suggest IsrTask be made a cmd-line task as a separate ticket, e.g. DM-4635, but that is a minor detail.

Jim, your plan looks good to me.

This seems good to me too, and tractable. I like having IsrTask broken out. I think it’s worthwhile to make it a requirement that CalibrateTask be runnable on ISR’d data that haven’t been put through ImageCharacterizationTask. Even if it’s not common in the future, it is very common today for people to want to run a single command line and get a reasonable (though not correct) list of detections.

Note that we’d be relying on ImageCharacterizationTask to provide a PSF model in this case. So I only think it’d only be feasible for CalibrateTask to work on data that skipped ImageCharacterizationTask if an LSST Psf instance was attached to the input image some other way. I could imagine an SDSS script that did this by loading and attaching the SDSS Photo PSF, for instance. An image with a dummy Gaussian Psf could also be used, but with understandable degradation in measurement quality.

Note that this new design of ProcessImageTask affects ProcessCoaddTask because both inherit from ProcessImageTask, which does the following:

  • source detection
  • deblending
  • single frame measurement
  • match sources to reference sources

The first three steps of this match the first three steps of ImageCharacterizationTask (which iterates those steps) and CalibrateTask, so I suggest the following:

  • Implement CalibrateTask as a subclass of ProcessImageTask (possibly removing matching from ProcessImageTask, depending on how many subclasses use it).
  • Implement ImageCharacterizationTask to call ProcessImageTask as a subtask (so it can call the subtask multiple times).
  • Change ProcessCcdTask to no longer be a subclass of ProcessImageTask, since it no longer shares significant behavior.

It would probably make sense to save the output from ImageCharacterizationTask as a new dataset type. Suggestions for names would be most welcome.

This complicates running CalibrateTask without running ImageCharacterizationTask, since the latter would normally use the new dataset type as input. @jbosch’s suggestion of having a step that adds a PSF might make sense in this context, preferably as a dummy image char task (hence starting from post-ISR data), since saving images with bad PSFs as if they had image characterization properly run sounds iffy.

I’m just looking to replicate basic SExtractor functionality. In the case where no PSF is attached, a user supplied one is used (or the default). I think we may be getting somewhat into a philosophical debate, but I would like to see command line tasks run on data without knowledge of previous processing step as much as possible.

I actually think it’s getting to be time to drop ProcessImageTask and ProcessCoaddTask. We should now be using the multiband processing tasks on coadds whenever possible, and to the extent we still want to support single-band processing on coadds, I think we should reimplement that as a simplified version of the multiband tasks.

That will eliminate the need for ProccessImageTask, which only existed to share code between ProcessCcdTask and ProcessCoaddTask, and I’m sort of happy about that, because I think ProcessImageTask's relationship to it subclasses is a very big reason why the logic in ProcessCcdTask is hard to follow. Essentially, I think we’ve reached the point where processing CCDs and coadds is different enough that it doesn’t make sense to try to share code above detection/deblending/measurement subtask level.

But this obviously requires an RFC, and that would probably require coming up with a plan on how to make the multiband tasks more user-friendly on generic single-band coadd images. That’s otherwise fairly orthogonal to the CalibrateTask work and a fair amount of effort in its own right. So I guess I’ll just say that we shouldn’t worry too much about code duplication in ProcessCoaddTask, as I think we can consider it deprecated even if we can’t remove it. That might mean we can just remove ProcessImageTask right now, but I’m happy to leave that as an implementation question to @rowen.

What about making IsrTask responsible for installing an initial default Psf? That will make the IsrTask and ImageCharacterizationTask outputs the same in terms of which objects are present in the Exposure (they’d differ only in the quality/correctness of those objects).

And if we have a script for running these tasks on images that didn’t go through our own ISR, that’s going to need to do some other manipulation of the external images anyway to get MaskedImage and Exposure objects; it might as well install a default PSF (or whatever better one might be available) as well.

Yeah, I don’t want to dive too far into philosophy on this right now either, but I will say that I agree that this is a worthwhile goal as long as it doesn’t get in the way of making our pipelines as good as they can be on well-characterized data, and I do think the more we do on that the less relevant it will be to completely generic data.

Given the desire to be able to run CalibrateTask on the outputs of either IsrTask or ImageCharacterization (and, downstream, a desire to run coaddition on the outputs of any of these), I think we want to think of all the datasets produced by these tasks (including the one we currently call postIsrCcd) to be versions of calexp. I think we may want to think about what sort of features we’d want in a “versioning” feature intrinsic to the Butler, and put that in as a feature request to @npease. At present, I think we have two options, and I’m not a huge fan of either of them:

  • Just add a string to the front or back of “calexp”, indicating at what stage it was produced, and use a configuration parameter on the tasks to tell them which ones to read and write. This is obviously quite similar to how we deal with types of coadds.
  • Add a “version” data ID key (maybe a string or a number), and just use the “calexp” dataset for all of these inputs and outputs.

One key question would be whether we want to consider any kind of overwriting existing files, and whether we can consider any updates to subcomponents (e.g. do we have to write the whole image if we’re just updating one mask plane?).

It would be nice if users didn’t even need to run Isr if they didn’t want to (SDSS, community pipeline DECam, etc.). What if this was something the data repository builder did?

A Butler hook that adds it when a dataset is loaded would be another option, and maybe a better one. I certainly agree that we don’t want to require users to run ISR if they have calibrated datasets from elsewhere, and a dummy ISR task has the significant disadvantage that it’d (probably) involve copying the image data just to put it in the right format and add the PSF.

This discussion does make me a bit worried about provenance, though, so @jbecla may want to pay attention: up to now, there’s essentially just been one way to produce a particular dataset on a given camera, and hence it’s fairly easy to trace what processing has been done (just using the configuration files stored in the data repositories). Once we have multiple ways to produce a calexp, we need to have a way for all downstream processing to remember which of them was utilized.

Another option is to make it the task’s responsibility to deal with a missing PSF. I like this because some tasks may have more tolerance for this than others, plus it needs no new mechanism. For example the CalibrateTask could have a defaultFWHM config param that is used to instantiate a simple Gaussian PSF if the exposure doesn’t have one.

Given the frustrations @nidever has recently documented with initial FWHM configuration options whose default values are not very good, I’d like to have as few as possible in the stack, especially ones that are rarely exercised.

My philosophy is that users should almost always want to let us model the PSF, and if they’re not doing that, we should be waving all kinds of flags to force them to think about what they should do instead, rather than making it easy to get a bad default.