We’ve currently got two projects in the works that involve aggregating CmdLineTask
s into higher-level Task
s:
-
@rowen’s overhaul of
ProcessCcdTask
on DM-4255. (see related discussion in this topic) - the transfer of the HSC
BatchPoolTask
s that aggregate coaddition and multi-band coadd processing.
One consequence of this is that the configuration options for all of these tasks now have two logical locations, one relative to the outermost Task
's config root, and one relative to the lower-level config root. For instance, in the HSC side’s stack.py
, we have an option config.assembleCoadd.doMatchBackgrounds
, which is just config.doMatchBackgrounds
when AssembleCoaddTask
is run directly.
It’s obviously highly desirable that we get the same results regardless of whether we run a CmdLineTask
directly or through a higher-level Task
aggregator, but these different names make it difficult to ensure that.
There’s also a complication that we don’t actually want all of the configuration options to be the same in both cases, because some of these options control I/O or warm-start details that we may in fact want to be different:
-
When we run lower-level
CmdLineTask
s directly, they may need to read and write intermediates that could simply be kept in memory when they’re run as part of an aggegator. -
When output products are already present for some data IDs, we may want to skip those data IDs rather than reprocess them.
The configuration options that relate to these questions are hence not about what the processing does, and they should have no effect on the ultimate science outputs; they’re strictly about how and when to execute the code. I think we need to split our current configuration for each Task
into two clear categories, and stop requiring that the “how and when to execute the code” configurations be consistent between runs.
At present, the choice of whether to run a CmdLineTask
directly or as part of a higher-level aggregate falls into the same category; it’s just about how to run the code, not what we actually want to run. That means we could identify exactly one aggregator for each CmdLineTask
, and add some mechanism to always include that in the configuration hierarchy (i.e. we’d always use config.assembleCoadd.doMatchBackgrounds
). In other words, for any CmdLineTask
, we have only one outer context in which we’d run it, so it’s safe to assume that context and include it in the configuration hierarchy.
Eventually, that won’t be the case for many of our current top-level tasks. For example, we’ll be building multiple kinds of coadds, so AssembleCoaddTask
may well be run in many different contexts and hence it won’t be able to assume what its outermost context is. But that implies that we won’t be able to run it at all without an outer context, and hence it’s not really a top-level Task
anymore. So maybe that idea still holds up.
But I think this probably needs some more creative ideas than I’ve come up with so far, and my experience on the HSC side suggests that this will cause us a significant amount of pain relatively soon if we don’t deal with it. We avoid it on the HSC side mostly by just using the aggregator tasks almost all the time, but it’s still a big problem there when we use CmdLineTask
s directly for debugging, and it will only get bigger on the LSST side where running CmdLineTask
s directly is what everyone is used to, and what’s supported by ctrl_orca
, and what all the SuperTask
work is focusing on.