Missing from the DRP Plan

Tags: #<Tag:0x00007fb37df77708> #<Tag:0x00007fb37df775f0>

While @jdswinbank works on completing a first complete version of the DRP long-term plan, I’m going to be spending my effort trying to record what’s not in that plan:

  • areas where we lack sufficient detail to do a good job estimating resources (well, that’s probably everywhere, but I want to highlight areas where it’s particularly egregious);
  • areas where I’m already worried that our baseline algorithms are not going to work;
  • dependencies on other teams where our requirements are likely more demanding than they probably expect;
  • requirements we simply haven’t included in the plan.

I’ll be posting individual replies to this thread myself on individual topics; comments/questions welcome.

1 Like

First up: some long-desired projects that affect all of DM (perhaps mostly AP and DRP) that have the potential to have a big effect on what other software development we can get done concurrently:

  • API improvements, Pythonification, potential Deswigification
  • Package reorganization
  • Conversion to more lightweight threading parallelization (I’m now pretty convinced this is when, not if).

I think it’s likely the combined cost for all of these is at least a year of full-team effort. It could easily be more if we don’t handle the transitions well. And if we don’t make at least some progress in these areas then I think we’ll have failed to provide a pipeline capable of supporting Level 3 at the level we ought to, because it’ll either be too unfriendly or too inefficient on non-LSST machines.

There is also the elephant in the room of Astropy integration. You can imagine that in 2020 Astropy will be quite an entrenched part of the astronomer toolkit.

Single Frame Processing

While this is formally a dependency from DRP on AP, the relationship in terms of who “owns” what in the codebase or a particular pipeline is quite a bit more complicated, and I think that complication makes this difficult to plan:

  • Doing ISR and photometric/astrometric calibration successfully (AP responsibilities) depends very much on the details of the calibration products pipeline (DRP responsibility). I’m fairly certain there are a ton of requirements from the calibration team that haven’t really been fully developed, let alone announced or included in any plans.

  • PSF Estimation (DRP responsibility) is a subcomponent of Single Frame Processing (AP responsibility). But we’ll actually have at least two rounds of PSF estimation in Level 2 (see here), and I think the most naive reading of the current LDM-151 is that the first one is AP’s responsibility as part of SFM, and the second one is DRP’s responsibility. But we’ve been putting plans for both in DRP’s plan on JIRA. And there will probably be another round in the actual Level 1 processing.

  • There are two versions of Single Frame Processing, one run during AP and one run during DRP. It’s not clear how different these need to be, but I worry that in LDM-151 we effectively consider them identical, and that the DRP variant may be falling between the cracks. We had a preliminary discussion to clarify what we want to do here, but that’s not been accounted for yet in the plan we have right now.

I also think that there are a lot of requirements that are nominally on higher-level products where the low-level work is actually going to be in single-frame processing (especially calibration requirements that depend quite a bit on ISR), so there may be a lot of emergent work here.

Something related to DRP planning: as far as I can tell, plans for internal DRP database are not documented anywhere. My first guess is that it should be in LDM-135, I am sure @ktl or @timj will correct me if I am wrong.

Good point. I’m guessing we’re responsible for providing some requirements on that at least.

yes, that would help :). Related near-term epic: DM-2038

Crowded Fields

As @nidever has recently emphasized, our current stack isn’t really capable of handling crowded stellar fields. Improving this essentially isn’t in our recorded plans at all right now, and the vague mentions we have of it in e.g. the DPDD essentially put the burden all on measurement (02C.04.06), while the work will really have to happen all over the stack. Even our discussion sof the algorithms we might use haven’t really progressed past hand-waving and reference to prior art (though I think prior art is in pretty good shape, as is the experience we have on the team).

Some of the areas where crowded field processing will require potentially big changes:

  • PSF estimation (especially the initial version that may be in 02C.03.01 or 02C.04.03; see post above on Single-Frame Processing) will need to work when there are very few (or even no) isolated stars. That may in turn require integrating initial PSF estimation with detection and deblending, since we may not even be able to do those effectively before we have a PSF model. This may also make doing full-visit or physical-model PSF estimation earlier than we might otherwise, as both of these could provide additional constraints on the PSF that could help when we have a small number of usable stars per CCD.

  • Background estimation (again, especially the first pass) will have to work when there are relatively few background pixels; we may have to rely less on masking and more on source subtraction.

  • Astrometric and photometric matching - potentially lots of false positives.

  • Deep detection and deblending: I think this is actually where we’ve started to think about this the most, but I’m still worried that processing crowded fields effectively will break the “clean” split we have between detection, peak association, and deblending; we may need to just merge all of these options together, which could in turn require having a larger number of images (probably different flavors of coadds) than we’d anticipated in memory at once.

These are all great points and I’m glad this is being discussed.

Bob Benjamin of the Spitzer GLIMPSE team recently told me that for the GLIMPSE (right in the MW midplane) data processing it was important for background estimation to take into account and subtract all sources, even 1-2 sigma detections (that were consistent with being point stars). They didn’t put those marginal detections into their final output catalog, but they found it important to subtract them out to find the background otherwise their fluxes were wrong. The expert on this is Brian Babler at UW-Madison. Just thought I’d pass that bit of info along.

For PSF estimation/definition in crowded fields, it is probably best to use an iterative procedure. At the beginning, use a simple analytic PSF model for the brightest stars and only use pixels close to the centroids. Then use that plus the list of detected sources to subtract-off fainter stars, but leave in the bright ones and use them to get a better estimate of the PSF. Iterate until you converge. Since there are faint stars undetected and/or very blended in the original image so you probably need iterative detection/subtraction to find all of those as well. Therefore, you might need two nested iterative loops: one for the PSF determination (the outer loop), and one for detection/subtraction (inner loop).

Having PSF estimation/determination, detection, measurement and deblending more closely linked is probably a good idea. To facilitate that it would be nice if these steps weren’t all so baked-in to processCcd, and you could more easily slap them together as needed.

When @jbosch refers to “API improvements”, I imagine he’s talking about us improving the API that we offer to end users, particularly in Level 3, and our own developers.

However, we’ll also need to allocate effort to handling change in APIs and infrastructure that the DRP uses itself. In particular, it seems unlikely that the pipe_base framework will survive unchanged into production: we already know that work is underway on a “SuperTask” system (although I’m not clear exactly what that involves), and I imagine that it will continue to evolve into the future.

We should also anticipate spending time on building out end-to-end systems for testing and commissioning. The boundaries between SQuaRE, NCSA and DRP/AP are perhaps fuzzy here, but at a minimum we should expect DRP to provide detailed descriptions of the required processing flow and support for deploying and testing the algorithms we’ve created.

Note that “SuperTask” is kind of a misnomer I think. Task2.0 might be more appropriate. It’s a reinvention/reimagination of tasks that will allow for them to be more easily stuck together in a workflow.

I think there are two separate things going on here:

  • I suspect @jdswinbank’s “super tasks” refers to what’s being brought over from the HSC side. It’s probably fair to call those “super tasks”, because they aggregate several existing tasks and allow them to be submitted easily to batch systems. They make running at scale a lot easier, but they’re not really a complete workflow solution. And I think it’s fair to say that from the perspective of people thinking seriously about our future workflow management system, this may actually be a step in the wrong direction, because these don’t provide enough information to allow the workflow system to manage jobs effectively.

  • I’m guessing @nidever’s “Task2.0” refers to the work @gpdf and @mgckind are thinking about, which is a chang to the way existing tasks are run that will (as he said) allow them to be used more effectively in other contexts, including (I think) a true workflow management system.

On the contrary, my point was precisely that one of the sources of uncertainty in the DRP plan is that we’ll have to adapt to changes elsewhere in the stack which are not under our control and which haven’t yet been clearly specified. One of those changes is likely to be a rethought workflow system, and the first stirrings of that is the SuperTask effort from @gpdf and @mgckind.

It’s reasonable to suppose that we will have to allocate effort at some point to support this kind of work. Arguably, amortized over the length of the project, we could simply build in some plausible constant factor to account for this. At the moment, though, it’s not even clear how much effort we’ll have to devote to adapting to SuperTasks in the next few months: that puts a huge uncertainty on any short- to medium-term planning.

Interesting. I imagine I’ll have to talk to them directly, but while I understand the problem, I don’t see how subtracting the 1-2 sigma detections could be a solution; it seems like enough of those are going to be false positives that they’ll just end up biasing their background low instead. I’m more inclined to think of the background as being defined as the mean of everything you don’t detect, and then treat the effect of undetected objects as a source of additional (correlated) noise, but that’s not based on any practical experience.

I think in crowded regions there is a high probability that even 1-2 sigma detections are real objects. So, I think, the idea was that subtracting them out gets you close to the “true” background. But I agree with your point as well. It would be good to the full scoop from Brian/Bob on this matter.


I’m worried that we haven’t really found all the ways that including various wavelength-dependent effects will affect the pipeline, including:

  • wavelength-dependent PSFs and WCSs
  • detailed wavelength-dependent photometric calibration
  • different transmission curves for different epochs

To my knowledge, all of these are completely new territory for a large-scale pipeline, and while we’ve tried to account for them in several components already, I don’t think we’ve captured all of the components they will touch, and the degree to which they affect certain components probably won’t be well known until we’re much further along in building them.