We’ve currently got two projects in the works that involve aggregating
CmdLineTasks into higher-level
@rowen’s overhaul of
ProcessCcdTaskon DM-4255. (see related discussion in this topic)
- the transfer of the HSC
BatchPoolTasks that aggregate coaddition and multi-band coadd processing.
One consequence of this is that the configuration options for all of these tasks now have two logical locations, one relative to the outermost
Task's config root, and one relative to the lower-level config root. For instance, in the HSC side’s
stack.py, we have an option
config.assembleCoadd.doMatchBackgrounds, which is just
AssembleCoaddTask is run directly.
It’s obviously highly desirable that we get the same results regardless of whether we run a
CmdLineTask directly or through a higher-level
Task aggregator, but these different names make it difficult to ensure that.
There’s also a complication that we don’t actually want all of the configuration options to be the same in both cases, because some of these options control I/O or warm-start details that we may in fact want to be different:
When we run lower-level
CmdLineTasks directly, they may need to read and write intermediates that could simply be kept in memory when they’re run as part of an aggegator.
When output products are already present for some data IDs, we may want to skip those data IDs rather than reprocess them.
The configuration options that relate to these questions are hence not about what the processing does, and they should have no effect on the ultimate science outputs; they’re strictly about how and when to execute the code. I think we need to split our current configuration for each
Task into two clear categories, and stop requiring that the “how and when to execute the code” configurations be consistent between runs.
At present, the choice of whether to run a
CmdLineTask directly or as part of a higher-level aggregate falls into the same category; it’s just about how to run the code, not what we actually want to run. That means we could identify exactly one aggregator for each
CmdLineTask, and add some mechanism to always include that in the configuration hierarchy (i.e. we’d always use
config.assembleCoadd.doMatchBackgrounds). In other words, for any
CmdLineTask, we have only one outer context in which we’d run it, so it’s safe to assume that context and include it in the configuration hierarchy.
Eventually, that won’t be the case for many of our current top-level tasks. For example, we’ll be building multiple kinds of coadds, so
AssembleCoaddTask may well be run in many different contexts and hence it won’t be able to assume what its outermost context is. But that implies that we won’t be able to run it at all without an outer context, and hence it’s not really a top-level
Task anymore. So maybe that idea still holds up.
But I think this probably needs some more creative ideas than I’ve come up with so far, and my experience on the HSC side suggests that this will cause us a significant amount of pain relatively soon if we don’t deal with it. We avoid it on the HSC side mostly by just using the aggregator tasks almost all the time, but it’s still a big problem there when we use
CmdLineTasks directly for debugging, and it will only get bigger on the LSST side where running
CmdLineTasks directly is what everyone is used to, and what’s supported by
ctrl_orca, and what all the
SuperTask work is focusing on.