Requirements for overhauled calibration task?

I don’t see where this concern is coming from; if we replace boolean doThing flags with no-op subtasks that we can retarget to in order to disable an option, the butler should be completely unaware of that change, since it only has datasets for top-level tasks.

Granted, our decision to make some subtasks into CmdLineTasks that can also be run independently does require some new mapper datasets (and it definitely complicates provenance - see my previous post on this). But the number of such CmdLineTask subtasks is sufficiently small that we shouldn’t distort our design for these Tasks just because Butler functionality isn’t available yet.

I decided to try this out by writing a null astrometry task. I found it quite messy. To simulate it well every public method must be overridden and it must do something sensible, then callers must be careful with the results. @jbosch and I discussed it and agreed that it’s not a good plan in general, though it may well be an excellent plan for some tasks, such as IsrTask that have simple, well defined null implementations (e.g. load postISRCCD).

@jbosch suggested we consider making lsst.pex.config.ConfigurableField allow a default of None as an easy way to disable tasks. I agree it sounds interesting, but want to make it a new ticket.

I’d like to move deep detection into ImageCharacterizationTask and am proposing just one new data product ‘icExp’ (and its associated background icExpBackground).

The current plan for ProcessCcdTask is as follows:

  • ProcessCcdTask(CmdLineTask)
    • IsrTask(CmdLineTask)
    • CharacterizeImageTask(CmdLineTask)
      • Repeat config.psfIterations times:
        • InstallSimplePsfModelTask: install a simple PSF model;
          the default implementation uses a Gaussian whose sigma matches the existing PSF model
          (if the exposure has one) else is take as a config parameter
        • RepairTask (do not remove cosmic rays)
        • Subtract background (since detection messes up, otherwise)
        • DetectAndMeasureTask (bright stars only)
          • SourceDetectionTask (just detect bright stars, 10-sigma limit; includes background estimation)
          • SourceDeblendTask (maybe unnecessary, but we should include it until/unless we can prove that)
          • SingleFrameMeasurementTask (just algorithms needed to select stars)
        • MeasurePsfTask (then go back to SourceDetectionTask.run and iterate as necessary)
      • RepairTask (remove cosmic rays)
    • CalibrateTask(CmdLineTask)
      • DetectAndMeasureTask (deep detection, measure and apply aperture correction)
        • SourceDetectionTask (detect to 5-sigma limit; re-estimate background)
        • SourceDeblendTask
        • SingleFrameMeasurementTask (all algorithms needed for Source table and calibration)
      • AstrometryTask
      • PhotoCalTask

I would like to move the deep detection run of DetectAndMeasureTask from CalibrateTask to CharacterizeImageTask. It seems a more natural fit because it can be argued as part of image characterization, and image characterization already performs similar operations.

The result is:

  • ProcessCcdTask (a CmdLineTask)
    • IsrTask (a CmdLineTask)
    • CharacterizeImageTask (a CmdLineTask)
      • Repeat config.psfIterations times:
        • InstallSimplePsfModelTask: install a simple PSF model;
          the default implementation uses a Gaussian whose sigma matches the existing PSF model
          (if the exposure has one) else is take as a config parameter
        • RepairTask (do not remove cosmic rays)
        • Subtract background (since detection messes up, otherwise)
        • DetectAndMeasureTask (bright stars only)
          • SourceDetectionTask (just detect bright stars, 10-sigma limit; includes background estimation)
          • SourceDeblendTask (maybe unnecessary, but we should include it until/unless we can prove that)
          • SingleFrameMeasurementTask (just algorithms needed to select stars)
        • MeasurePsfTask (then go back to SourceDetectionTask.run and iterate as necessary)
      • RepairTask (remove cosmic rays)
      • DetectAndMeasureTask (deeper detection, measure and apply aperture correction)
        • SourceDetectionTask (detect to 5-sigma limit; re-estimate background)
        • SourceDeblendTask
        • SingleFrameMeasurementTask (all algorithms needed for Source table and calibration)
    • CalibrateTask (a CmdLineTask)
      • AstrometryTask
      • PhotoCalTask

I was also tempted to eliminate CalibrateTask and have ProcessCcdTask call AstrometryTask and PhotoCalTask directly, because CalibrateTask does very little and the name is easily confused with CharacterizeImageTask. However, a requirement is to make calibration independently runnable as a command-line task, so I retained it.

To make CalibrateTask independently runnable requires that save the exposure that would be fed to it as a new dataset type.

Also, it is common to want to disable calibration, and if it is disabled then at present the resulting exposure is still saved as calexp. This is very useful because it means calexp is the best available version of the science exposure; tasks that take an image like that always can ask for calexp without needing to mess with the name.

I suggest saving the pre-calibration exposure as icExp and also saving it as calexp if calibration is disabled. That way CalibrateTask can be run later, if desired, but meanwhile other tasks can run without adding extra configuration.

I suggest not saving any other data products (such as the exposure after bright star detection and PSF modeling), because the changes are so small between that and after deeper detection.

I welcome suggestions for better names than CharacterizeImageTask and CalibrateTask. @ksk suggested the first one could be called MatchedFilterMeasurementTask. Names that include the steps quickly get too long (e.g. DetectMeasureAndModelPsfTask which still omits background modeling, repair and the final deeper detection and measurement pass).

I have heard complaints about the existing dataset names icSrc and src, but unless somebody can suggest good alternatives, I would rather keep them.

This looks quite good. A few comments/questions:

  • On the second (and greater) iteration will InstallSimplePsfModelTask use the “current best” PSF found during the previous run of the MeasurePsfTask step?
  • Is it a problem that astrometry is happening after the CharacterizeImageTask where the PSF is defined? What if we want to use an external catalog to select PSF stars? We’ll need decent astrometry to match those up.
  • Does the measurement task need the photometric zero point that PhotoCalTask produces? If so, wouldn’t we need measurement after the photometric calibration is run?
  • Can we run detection/measurement “SExtractor style”, i.e. independently on the command line? It’s not clear to me if DetectAndMeasureTask is a CmdLineTask.
  • Have you guys thought about adding a step (in the 2+ iteration) to subtract stars near the PSF stars to produce a cleaner PSF on the next round (this is a common step in packages like DAOPHOT)? Or is this all taken care of by ignoring pixels that have the detect mask set? The latter wouldn’t work well in crowded fields (because that would mask out most of the pixels in the image), but the formed could help.

I haven’t been following this discussion as closely as I would have liked, so apologies if you already answered some of these.

I will answer the ones I can. I hope others who know more about data characterization can handle the others.

Unfortunately not. That seemed like the right thing to do. Unfortunately I found that the PSF computation simply would not converge; successive iterations had wild swings in sigma, and with enough iterations the computation would fail. So instead I start each iteration using a simple Gaussian PSF model whose sigma matches the sigma of the most recently computed PSF model.

The intention is that CharacterizeImageTask can be run standalone, which is similar to running PSFEx followed by SExtractor. There are no current plans to make a standalone DetectAndMeasure.

So what is actually updated from iteration to iteration?

Is there a technical reason for this, or just not a priority now?

The sigma of the simple Gaussian PSF model changes each time, and that model is used to repair the initial exposure and estimate background before performing detection, deblending and measurement.

Mostly the latter. Writing a simple task that read icExp and wrote src would not be difficult, though until we have the new butler it would require not persisting the config or adding more entries to various obs_* .paf files. An alternative I’m not keen on is to add a switch to CharacterizeImageTask that disables the initial detect/measure/fit PSF loop and call that our SExtractor. Switches tend to really complicate the task code, so we’re trying to avoid them.

@nidever Here’s a variant that might answer your issues:

  • The code that was called CharacterizeImageTask remains as it was (just iterate to find a PSF and background model). It acts like PSFEx and can be retargeted if you want to use some other technique to find PSF stars.
  • The deep detection pass in ProcessCcdTask is performed by a direct call to DetectAndMeasureTask, which becomes a command-line task and is like SExtractor

This means adding a few more more dataset types, which will be a naming headache, but keeps CharacterizeImageTask simpler.

I’ll try to reply to many things at once here.

First off, I’m not going to reply on dataset names yet. I think @npease has implemented some sort of aliasing scheme for dataset names that I think should inspire us to revisit the names of all of our datasets. But I don’t know enough about how it works to have an intelligent proposal right now, and I could easily be persuaded that this change should be done separately from @rowen’s work on restructuring the tasks.

Here’s my proposal for now:

  • I’m okay with moving the deep DetectAndMeasureTask run to CharacterizeImageTask.
  • CalibrateTask and CharacterizeImageTask should both write a full exposure dataset when they’re run independently. The same would be true of IsrTask, of course.
  • When run as part of ProcessCcdTask, config overrides there would cause IsrTask and CharacterizeImageTask not to write out their exposure datasets by default. I expect this to be the approach taken whenever we run data at large scale, and hence the only version we need to worry about when worrying about wasting disk space or I/O time.
  • Downstream processing would always depend on only calexp, and never icExp, and I don’t think that CharacterizeImageTask should ever write a calexp. That’s because while it’s important to be able to process an image without needing a reference catalog, there aren’t actually any downstream processing steps that can do anything with the image unless it has a Wcs and a Calib.

I think the right fix for this requires configuration features we don’t yet have: we need a way to configure each step in each iteration separately.

Of course, we could probably also do some work on the existing algorithms to make them more robust in this particular case, but that should be orthogonal to the refactoring. This approach sounds like it’s okay for now.

This is actually a very important tension I’d never really appreciated in previous debates about “reimplementing SExtractor with the stack”. I don’t think we want to think of any of these command-line tasks as a SExtractor replacement. And that’s because I think we want these tasks to be specific parts of an overall production, while the subtask components (e.g. DetectAndMeasureTask) are the reusable components, like SExtractor. I do think DetectAndMeasureTask essentially is SExtractor, albeit with a Python interface instead of a command-line one. And while it’d probably be very useful to have a command-line interface for DetectAndMeasureTask, I’m not sure it that interface should be a CmdLineTask, as that implies all sorts of Butler, workflow, and provenance connections that aren’t necessarily appropriate for one-off, ad-hoc use. It’s also not clear to me what relationship a command-line driver for a reusable subtask should have to the obs_* packages. I can imagine a close connection being very useful in some contexts, but being required to have an obs_* package before using such a tool is clearly not good in others.

In the eventual DRP pipeline I’m imagining, we’ll have to produce a preliminary PSF model without the aid of such a catalog in all cases before we can generate a final one later. In that context, all we’re talking about is that preliminary model - but it’s also all we have right now. If anyone can think up a case where they think it might be impossible to come up with any usable PSF model without the aid of an external catalog, I’d be very interested to discuss that.

It mostly does not need the zeropoint or other CalibrateTask outputs (since all outputs are in raw units). We do have a single algorithm (CModel galaxy fitting) that does use a zeropoint so it can make use of a Bayesian prior, but I’ve come to the conclusion that this was a mistake, in that I need to also provide a version of that algorithm that can work without a zeropoint in order to solve exactly this problem.

I’ve been thinking about how to fit it into the future pipeline sketch I’m working on, but I hadn’t planned to address it in this refactoring. But you make a very good point that adding a subtraction step to the iteration @rowen is adding here could be a natural way to try solve that problem. I also think that subtracting the extended profiles of bright stars fits in around the same place.

I agree. I had intended that CharacterizeImageTask write icExp. Then CalibrateTask would write calexp, which is the same as icExp if astrometric and photometric calibration are not run.

I wonder if it would be best to make DetectAndMeasureTask a separate step of ProcessCcdTask. The primary benefit is that it simplifies CharacterizeImageTask (which should probably be called MeasureInitialPsfTask) and makes it more modular. A drawback is that ProcessCcdTask is no longer a sequence of command-line tasks that can be run independently to produce the same result (unless we make DetectAndMeasureTask a command-line task and add some data products). I suppose this could also be taken as an argument for leaving the deep detection and measurement in the calibrate task. That might be best for now, with more refactoring later.

Thanks for the clarification. I had misunderstood this statement in your earlier post:

Now that (I think) I understand you better, I do still think it would be better to make CalibrateTask solely responsible for writing calexp, so calexp is not written when CalibrateTask is not run, and for ProcessCcdTask to warn in its config validatation when CalibrateTask is not being run but icExp is not being written. Does that make sense? I’m worried about us writing something that claims to be a calexp, but is actually an icExp, and actually isn’t useful for any of the downstream processing that needs a true calexp.

None of the options here jump out to me as obviously better than the others.

If we put the deep DetectAndMeasureTask run in CharacterizeImageTask, I’m worried that CharacterizeImageTask gets too complex in all the schemas and DetectAndMeasureTasks it needs to juggle, especially because that last DetectAndMeasureTask run is really only needed to feed CalibrateTask. But that may all be outweighed by the convenience it provides in allowing users who don’t want to run CalibrateTask to easily get a deep measurement catalog.

So if we think about how things fit together in a full pipeline, the deep DetectAndMeasureTask run clearly goes in CalibrateTask, since we’re doing it simply to feed those calibration routines (and, eventually, a bigger calibration routine like the one in meas_simastrom). I think I have a slight preference for this approach, but I’m willing to be overruled by @nidever if he’s concerned about not having an easy way to get to deep detection and measurement without going all the way to astrometric matching and photometric calibration.

I like this argument. Also, having one schema per task is much nicer.

As to running calibrate or not, and writing calexp, I know of these uses cases:

  • The image has excellent astrometry; don’t do astrometric calibration
  • We don’t have a reference catalog so we can’t do astrometric or photometric calibration

These suggest the following solution:

  • CalibrateTask contains two flags: doAstrometricCalibration and doPhotometricCalibration
  • CalibrateTask always writes calexp even if those two calibration steps have not been performed

I am open to other ideas.

The only case I’m aware of where the astrometry is already good is SDSS, for which the photometric calibration is also already good. If that is indeed the only such case, I’d be tempted to continue to have an SDSS-specific CalibrateTask to load and attach those to the exposure (it might now be more appropriate to do it here rather than earlier). Since CalibrateTask doesn’t have to do nearly as much as it used to, I think the downsides of having a camera-specific override would then be considerably lessened.

I had assumed that if we don’t have a reference catalog, we would just not run CalibrateTask itself at all (perhaps by running IsrTask and CharacterizeImageTask separately, perhaps by having a doCalibrate flag in ProcessCcdTask), but I’m not opposed to your proposal with two flags within CalibrateTask, and it does perhaps provide a better solution for users who do want deep measurements but don’t have a reference catalog.

Do all combinations of those two flags (e.g. doAstrometricCalibration=False, doPhotometricCalibration=True) make sense?

photoCalTask is not yet independent enough to do its own matching, though in the long run it surely will, since we want to be able to use separate astrometric and photometric catalogs. We can still have two flags if we want, but they won’t be fully independent yet.

Note that the current CalibrateTask has 4 flags: doAstrometry, doPhotoCal, requireAstrometry, requirePhotoCal. The latter two cause the task to raise an exception if the step fails This seems rather complex and I don’t see any code or config overrides that set the require flags true. Really these are tri-state: don’t do it, do it on a best-effort basis, do it and give up if it fails.

How about we do matching if either flag is set, but only solve for and attach a new WCS if doAstrometricCalibration=True?

I’d be happy to do that if desired. It clutters the task a bit, but not horribly so. The clutter could go away when we support separate astrometry and photometry catalogs. Do you have an opinion about the “required” flags?

I’m actually surprised to hear that the default behavior is to continue (with a warning, I’d guess) if any of the calibration steps fail. In a steady-state, full-scale processing mode, I think we want to fail by default and catch that at a higher level, because it doesn’t make sense to produce a calexp when you haven’t actually managed to determine its Wcs or zeropoint. But I can also imagine that being frustrating for everyone trying to get the stack working on new cameras.

I think that means we should probably leave those “require” configuration options, and have the debate about what the default should be some other day.

I propose to change the meaning of the “require” flags slightly. They will default to True and will be ignored if the corresponding “do” flag is False. That makes it trivial to enable or disable astrometry and photometry, but if they are enabled then they must run, by default.

Interesting. Why are you designing it in that way? And will there then be a second stage that does PSF determination with a catalog of “PSF stars”?

Thanks for all the other responses. Those answered my questions.

Yes, there will be a second stage that does PSF determination, and the reason we’re doing things this way is because I think we’ll need to generate that catalog of PSF stars at least to some degree from our own processing. And I think that’ll need to happen after or as part of our big internal relative astrometry fit (i.e. meas_simastrom), because at that point we’ll be able to:

  • get secure more star/galaxy classifications by combining information from multiple epochs;
  • get some estimate of the SED from the colors (needed for wavelength-dependent PSF modeling);
  • improve our estimates of the true positions of the stars, including proper motion.

So once we’ve done the internal relative astrometry fit, we’ll be able to go back and make better PSF models. I’m hoping that we don’t actually have to do any further processing (e.g. coadds) to build the full PSF star catalog, but I think we’ll have to go at least through the relative astrometry fit.