Levels of top-level Tasks and the configuration hierarchy

We’ve currently got two projects in the works that involve aggregating CmdLineTasks into higher-level Tasks:

  • @rowen’s overhaul of ProcessCcdTask on DM-4255. (see related discussion in this topic)
  • the transfer of the HSC BatchPoolTasks that aggregate coaddition and multi-band coadd processing.

One consequence of this is that the configuration options for all of these tasks now have two logical locations, one relative to the outermost Task's config root, and one relative to the lower-level config root. For instance, in the HSC side’s stack.py, we have an option config.assembleCoadd.doMatchBackgrounds, which is just config.doMatchBackgrounds when AssembleCoaddTask is run directly.

It’s obviously highly desirable that we get the same results regardless of whether we run a CmdLineTask directly or through a higher-level Task aggregator, but these different names make it difficult to ensure that.

There’s also a complication that we don’t actually want all of the configuration options to be the same in both cases, because some of these options control I/O or warm-start details that we may in fact want to be different:

  • When we run lower-level CmdLineTasks directly, they may need to read and write intermediates that could simply be kept in memory when they’re run as part of an aggegator.

  • When output products are already present for some data IDs, we may want to skip those data IDs rather than reprocess them.

The configuration options that relate to these questions are hence not about what the processing does, and they should have no effect on the ultimate science outputs; they’re strictly about how and when to execute the code. I think we need to split our current configuration for each Task into two clear categories, and stop requiring that the “how and when to execute the code” configurations be consistent between runs.

At present, the choice of whether to run a CmdLineTask directly or as part of a higher-level aggregate falls into the same category; it’s just about how to run the code, not what we actually want to run. That means we could identify exactly one aggregator for each CmdLineTask, and add some mechanism to always include that in the configuration hierarchy (i.e. we’d always use config.assembleCoadd.doMatchBackgrounds). In other words, for any CmdLineTask, we have only one outer context in which we’d run it, so it’s safe to assume that context and include it in the configuration hierarchy.

Eventually, that won’t be the case for many of our current top-level tasks. For example, we’ll be building multiple kinds of coadds, so AssembleCoaddTask may well be run in many different contexts and hence it won’t be able to assume what its outermost context is. But that implies that we won’t be able to run it at all without an outer context, and hence it’s not really a top-level Task anymore. So maybe that idea still holds up.

But I think this probably needs some more creative ideas than I’ve come up with so far, and my experience on the HSC side suggests that this will cause us a significant amount of pain relatively soon if we don’t deal with it. We avoid it on the HSC side mostly by just using the aggregator tasks almost all the time, but it’s still a big problem there when we use CmdLineTasks directly for debugging, and it will only get bigger on the LSST side where running CmdLineTasks directly is what everyone is used to, and what’s supported by ctrl_orca, and what all the SuperTask work is focusing on.

I think we want some kind of configuration database (which may just be a flat file, rather than a full-on MySQL database). It would contain the default values, overrides for different Task flavors, and overrides for different cameras. Instead of building a configuration tree for each Task separately, there would be one configuration tree living outside of Task land, and each Task can query into it by some symbolic name, have some sensible chain of overrides applied, and receive their configuration.

This would also help deal with the problem of maintaining repeated configurations in a tree (e.g., background subtraction in processCcd) which has long been an annoyance.

I had viewed this as a one of several solutions to an orthogonal problem - that of tying the together related configuration options scattered all over the hierarchy. I’d imagined that whatever we use to allow users to set high-level configuration options would still propagate down to the type of configuration we have right now, and that we’d still be persisting and comparing the latter. But if we re-architect the configuration system to the extent where we don’t have a low-level hierarchy organized by subtask anymore at all, then this problem does indeed go away.