Requirements for overhauled calibration task?

rowen · December 17, 2015, 11:02pm

I see your point, but remember that @nidever’s frustrations at least partly stemmed from using bad data (bad enough that it’s worrisome that any images made it through processing). I think some tasks can probably manage with a guess. My understanding is that sextractor basically works this way, and as @KSK pointed out to me, astronomers are used to that and find it convenient.

jbosch · December 17, 2015, 11:04pm

I’m being a bit facetious here, but I don’t actually think that’s a good thing.

rowen · December 17, 2015, 11:32pm

jbosch:

I actually think it’s getting to be time to drop ProcessImageTask and ProcessCoaddTask. We should now be using the multiband processing tasks on coadds whenever possible, and to the extent we still want to support single-band processing on coadds, I think we should reimplement that as a simplified version of the multiband tasks.

That will eliminate the need for ProccessImageTask, which only existed to share code between ProcessCcdTask and ProcessCoaddTask, and I’m sort of happy about that, because I think ProcessImageTask’s relationship to it subclasses is a very big reason why the logic in ProcessCcdTask is hard to follow. Essentially, I think we’ve reached the point where processing CCDs and coadds is different enough that it doesn’t make sense to try to share code above detection/deblending/measurement subtask level.

I was hoping to retain ProcessImageTask (at least approximately) because the things that it does are the first three steps of ImageCharacterizationTask and CalibrateTask. I definitely don’t want to have ProcessCcdTask inherit from it any longer. I suspect ProcessCoaddTask could still inherit from it. In any case, its primary purpose becomes: detect sources, deblend, then run single-frame measurement. Perhaps a different name might be better.

I had already suggested this before you suggested eliminating the task, so I’m a bit concerned. Did I miss a reason to get rid of it, rather than simply repurposing it?

jbosch · December 17, 2015, 11:41pm

I just hadn’t noticed, but anyhow, repurposing is fine. The detect/deblend/measure combination is indeed a common one.

rowen · January 8, 2016, 10:25pm

Can somebody please explain what injecting fake sources is supposed to do? If I am reading the code correctly, the current ProcessCcdTask performs astrometric and photometric calibration first (by calling the Calibrate Task) before injecting fake sources and then calling ProcessImageTask.process, which performs one last detect/measure sequence. That is not possible using @jbosch’s proposed new sequence, which I will summarize here, since photometric and astrometric calibration are the final steps:

perform ISR
perform image characterization to estimate PSF and background:
- iterate N times:
  - detect (at high S/N), deblend and single-frame-measure
  - estimate PSF
- repair image
perform astrometric and photometric calibration
- detect (at low S/N), deblend and single-frame-measure
- astrometric calibration
- photometric calibration

jbosch · January 8, 2016, 10:37pm

It’s a useful way of testing detection, deblending, and measurement algorithms under more realistic conditions than tests on simulated data, with a bit more truth information than runs on real data.

You’ve identified a pretty clear problem with fitting it into the new sequence, and because fake source injection is not something we’ll be running in normal production mode I’d be inclined to turn it into something that’s completely an afterburner: a separate CmdLineTask (maybe still available as part of the new top-level Task we were considering) that adds the fake sources then runs detect/deblend/measure again with the same configuration as the pre-fake-injection detect/deblend/measure.

I imagine we’d want to write that out as a new catalog dataset, and I think we’d want to write the modified image as a new exposure dataset (rather than update src and calexp or their new counterparts in-place). But if we do make new datasets we’d want to make sure downstream processing (coaddition in particular) can run on the images with fakes, since we also want to see what happens to those sources in the downstream processing. Having multiple datasets that contain almost the same information is also a bit problematic from an efficiency standpoint, but that’s probably something we have to live with unless we can find clever ways to compress them (e.g. save the fake sources only in a separate image with mostly zeros, and reconstruct the full image with fakes on-the-fly by adding them).

ktl · January 8, 2016, 10:46pm

Are you sure about this? I was under the impression that DMS-REQ-0097 (LSE-61 1.2.10), which requires “detection efficiency for point sources vs. mag for each utilized filter,” implied a requirement to run fake source injection in the Alert Production at least.

rowen · January 8, 2016, 10:55pm

Fake sources sounds tricky. Pragmatically it would be easy to run a final detect/deblend/measure sequence after astrometric and photometric calibration – thus providing the current fake source behavior.

But then would we only do this when fake sources are added, or always. The latter gives more consistent output, but is that output better than the proposed technique of ending with astrometric and photometric calibration, or worse, or does it just waste time?

I’m a bit leery of accidentally getting fake sources where we don’t want them, which using special dataset names solves. I realize we can be careful how we save repos, and presumably that’s how its being handled now.

In any case, I will either want fake sources to be an after-burner or part of the calibration task, since I now have the calibration and image char tasks persisting data (because they are now cmd-line tasks), rather than the process CCD task.

rowen · January 8, 2016, 11:31pm

Another question is about propagating flags from data product “icSrc” to “src” (done in processCcd task). If I understand correctly, in the new design “src” is the result of characterizing the image and “icSrc” is the result of running calibration (in particular the final pass of detect/deblend/measure with lower S/N than the passes in image characterization).

Do we still want to propagate those flags? If so, I think we’ll want another source catalog data product, “charSrc”:

“charSrc” is written by image characterization, and is fully self-consistent with “charexp”, in that any sources marked as used for PSF were actually used to fit the PSF in “charexp”
“src” is “charSrc” with flags propagated

jbosch · January 9, 2016, 2:10pm

I was not aware of this, and while I think it sounds like a good idea to run in parallel to the actual Alert Production (if we have the compute power to do so), I’m worried that injecting fake sources into the processing we use to produce real alerts could be harmful.

jbosch · January 9, 2016, 2:16pm

I think we only do this when fake sources are added, as an afterburner. I’m worried about that processing being subtly inconsistent with the previous step too, but I don’t see any other option. To help maintain consistency, I think we should load the previous step’s configuration via the Butler rather than relying on a separate configuration for detect/deblend/measure steps of the fake-sources afterburner. And I think adding a unit test that runs the afterburner but actually adds no fake sources (or fake sources with zero flux) would be invaluable.

jbosch · January 9, 2016, 2:26pm

I didn’t quite follow all of the new dataset definitions, but I think that if we keep the “src” name, it should refer to the deepest catalog (lowest detection threshold), and if we keep both “src” and “icSrc”, “src” should be deeper than “icSrc”, and “icSrc” should refer to Image Characterization (“ic”), not Calibration.

We could also just drop the current names to avoid a lot of confusion (and maybe aid in a transitional deprecation period for the old version of the tasks, if we want one). How about this:

“charSrc” - associated with ImageCharacterizationTask. Would have an extra “iteration” data ID key (another name would be fine) since we will run detect/deblend/measure multiple times here, and we should at least have the option to persist them all.
“calSrc” - associated with CalibrateTask
“fake_calSrc” - associated with fake sources afterburner task; not actually used for calibration, but otherwise mimics “calSrc”.

rowen · January 9, 2016, 10:41pm

@jbosch I like your new names; they seem more self-documenting than what we have now (which brings up a butler suggestion: it would be useful to be able to associate a description with with dataset types).

That said, I think “icSrc” for sources from image characterization and “src” for the final sources (which now come out of the calibration task) is adequate, if folks would find it more comfortable to keep what we have. If we do go that route than I will propose “icexp” instead of “charexp” for the exposure output by image characterization, for unity.

I propose not saving each iteration of image characterization until the butler offers prototypes instead of predefining every dataset type.

I also propose making fake sources a new ticket, which means we will lose the capability temporarily: after merging the new calibrate task and before adding the fake source functionality back in.

jbosch · January 10, 2016, 6:24am

I personally have a slight preference for totally new names, but I’m also not strongly opposed to using the old ones (with icexp).

I agree with making fake sources a new ticket.

rowen · January 11, 2016, 6:07pm

Here is a heads-up for a possibly controversial plan.

First of all, for cameras that load pre-existing data instead of performing ISR, I am proposing we use a custom ISR task to perform the loading, rather than subclassing ProcessCcdTask. This seems cleaner and more natural to me. Affected packages include:

obs_sdss
obs_decam will presumably have two ISR tasks: one that loads the ISR output from community pipeline and one that actually performs ISR. Users will choose by retargeting.
obs_lsstSim’s ProcessEimageTask

Also, I am reluctant to offer doIsr, doImageChar, doCalib flags in the new ProcessCcdTask, because the meaning is ambiguous. Not performing a subtask could mean load the persisted results from an earlier run, or skip the stage entirely, and in some cases this leaves the next subtask without needed data. Instead of using boolean flags I propose we retarget the tasks in question. For instance if we want to skip image characterization and go straight from ISR to calibration then the retargeted image char class could add a PSF model to the post-ISR image, and create a blank background.

jbosch · January 12, 2016, 1:13am

I very strongly support this proposal (thanks for saving me from having to push this myself in the future).

Your proposal to drop these flags in favor of retargeting sounds sensible to me. If for some reason that doesn’t work, I strongly believe we should avoid the having any options to load persisted results from a previous run of the same Task. That really complicates the provenance and makes it hard to track down bugs. In fact, as long as we avoid having config options that do that, I’m okay with any approach to the ambiguity here.

rowen · January 12, 2016, 6:42pm

Upon further reflection, it may make sense to offer a “doIsr” flag that, if false, makes the task load “postIsrCCD” for image characterization. There are two issues at play:

First, ISR is a fairly stable operation. Users may prefer to run it once, then run characterization and calibration multiple times with different settings.

On the other hand, we don’t even really do ISR for some cameras. In particular:

For SDSS we load other files to generate the post-ISR data. To avoid duplicating data we don’t save this as “postISRCCD”, but regenerate it as needed.
For DECam we may want to perform our own ISR or load the results from the “community pipeline”.

There are several ways to support all this, including:

Make ISR an optional stage of ProcessCcdTask. For SDSS and DECam-with-community-pipeline use a special ISR stage that loads pre-existing data, and enable this flag. For DECam starting from raw data, use a different ISR stage.
Make ISR a separate command-line task and never have it be part of ProcessCcdTask. This requires handling SDSS and DECam differently, since the data product “postISRCCD” may not exist. I have heard the suggestion to handle this with special code in the camera camera mapper (e.g. when one requests “postISRCCD” in SDSS one runs the code that assembles the exposure from other data). DECam would require more care, since we may want to generate our own post-ISR data or use the community pipeline, but we could handle that by having two post-ISR data products. That requires adding a new config parameter to ProcessCcdTask.
Do both. Offer ISR as a command-line task and an optional component of ProcessCcdTask.

I have a slight preference for (1), but can see how (2) or (3) would also be appealing.

jbosch · January 13, 2016, 1:06am

I have a fairly strong preference for (3), but my reading is that this (along with (1)) wouldn’t need a separate doIsr option, since that would be fully specified by whether the IsrTask subclass that’s being used is a real one or a dummy one.

This sort of warm-start behavior is a very valid feature request in general that I think we want to find better ways to support, but it also adds a lot of complication to provenance tracking (and config persistence in particular). I don’t want to get into to too many of the details right now, but the root of the problem is that using configuration to change how much processing to do (rather than just how to do it) conflicts with our goal of having a consistent configuration tree within each output repository; if you run once with one configuration, and then with another (even just to do a warm-start), you’re forced to use --clobber-config, which disables provenance tracking and opens the user up to a multitude of subtle bugs in the Butler. There are discussions underway in how to fix it, but until that’s addressed, configuration options like this are fraught with peril.u

And in this particular case, ISR is so fast that I don’t think re-running it will be considered a practical burden for users, so I think it’s much safer to just avoid supporting this request for now.

rowen · January 14, 2016, 9:30pm

jbosch:

This sort of warm-start behavior is a very valid feature request in general that I think we want to find better ways to support, but it also adds a lot of complication to provenance tracking (and config persistence in particular). I don’t want to get into to too many of the details right now, but the root of the problem is that using configuration to change how much processing to do (rather than just how to do it) conflicts with our goal of having a consistent configuration tree within each output repository; if you run once with one configuration, and then with another (even just to do a warm-start), you’re forced to use --clobber-config, which disables provenance tracking and opens the user up to a multitude of subtle bugs in the Butler. There are discussions underway in how to fix it, but until that’s addressed, configuration options like this are fraught with peril.u

Thank you for the clear explanation of your concern about boolean flags. However, there is also a problem with proliferation of tasks: every task that is going to save provenance requires a new entry in all the obs_* mapper configuration files. That problem will go away when the butler is enhanced to support dataset prototypes (or whatever they will be called). Meanwhile it is not very practical to add a bunch of new tasks to the .paf files. I would rather live with boolean flags for now, despite the pain. Others may disagree.

timj · January 14, 2016, 9:48pm

@gpdf may want to comment further but one of the aims of the “superTask” work is to enable the provenance handling to be handled outside of the algorithmic code when the tasks are chained.