Nobody is satisfied with the current
CmdLineTask. But they’re a marked improvement (at least in some important respects) from what we had before, and as some recent criticisms have suggested moving in perhaps a similar direction to what we had before, it was suggested at this week’s DMLT meeting that it’d be very useful to describe how we got to this point.
What Came Before
Once upon a time, the DM Stack was organized into
Stages, using a package called
pex_harness. These communicated using an object called a
Clipboard, to which each
Stage could attach its outputs and retrieve its inputs. This was all quite friendly to a workflow management system: a full pipeline could be described as just a sequence of
Stages, and all the execution and I/O was very fully abstracted. That description was in a custom configuration format called
Policy (which you can still find in use in a few dark corners of the stack today).
This was a bit of a pain to run outside a workflow management system (maybe it was impossible? I don’t recall), but the real problem was that it made the algorithmic code impossible to follow, because the logic was split up across multiple
Stages, and the links between those stages were captured in policy files, not the code itself. And it turned out those
Stages weren’t really all that composable. The idea that they were really distinct units of processing was in some sense a fallacy, because the algorithmic responsibilities of each
Stage depended quite a bit on the details of what was done in the previous
The alternative that ultimately grew into
Task started with a project called “pipette” that began life on the HSC side. This simply connected the same algorithmic code in the
Stage-based pipeline into a very small number of concrete Python scripts. This made things less pleasant for workflow systems, but it made life much easier for developers.
Task then started as a transfer of this idea back to the LSST side with a common base class and some conventions for the callable objects that composed the pipeline. In particular, it added:
- The ability to construct a hierarchy of
Tasks that delegate to each other in a consistent way, and a configuration hierarchy that mirrors it.
- Standardized access to a logger.
- Distinct setup and execution phases, allowing some code (i.e. schema definition) to be executed once before running on any data units.
At the same time, we added
CmdLineTask, a subclass of
Task that added some convenience code to run a
Task from a command line, assuming its inputs and outputs could be captured in a single
But the philosophy of
Task was very simple: we just wanted to write pipelines using regular Python language constructs, and
Task itself is just a callable that adds some configuration and utilities.
Task Went Next
Since then, things have evolved in several ways. The most important one was the introduction of
ConfigurableField, which allows any subtask in any level in a hierarchy to be swapped out via a “retarget” option in a config file. This is an extremely powerful feature, and we’ve been using it as key piece of our customization system without really stopping to define the interfaces of those swappable components. That has made the system quite fragile and hard to follow. In addition, in contrast to
RegistryField, our other way of configuring plugin code,
ConfigurableField doesn’t properly preserve configuration settings when components are swapped, which can lead to hard-to-find bugs. We’ve also evolved to a state in which low-level non-
Tasks do not use the
Butler. This makes their inputs and outputs more strictly-defined, and it also makes it possible to use them in contexts where a
Butler is not available.
CmdLineTask themselves have also evolved quite a bit, often in ways that highlight the lack of defined interfaces for specific concrete
Tasks - there are many algorithmic features that have added implicit dependencies across
CmdLineTasks, making it harder to swap out components safely. That of course was an even bigger problem with
Stages, but it’s clear
CmdLineTasks have a similar problem, largely because their inputs and outputs are still too vaguely defined. It’s even true (to a lesser degree) of lower-level
Tasks as well, because a Python signature doesn’t fully define an interface.
We’ve also never really settled the problem of how to organize a
Task.run() method. Having multiple methods that
run() delegates to allows one to replace an aspect of a component more easily, because you can inherit from the
Task and reimplement only one of these methods instead of
run(). One can do the same thing by delegating directly to a subtask in
run() and then retargeting the subtask, and we haven’t really evaluated the pros and cons of these two approaches. And the result is a lack of consistency across concrete
CmdLineTasks have also become a lot less readable as we’ve struggled with code reuse challenges when running similar algorithms on very different datasets - particularly in what’s arguably the most important task,
ProcessCcdTask. Most of the work there is delegated to its base class
ProcessImageTask, so it can be shared with
ProcessCoaddTask, and that makes it harder to follow. I actually think a lot of that ugliness will go away as the algorithms for single-frame and deep processing naturally diverge, but to the extent it doesn’t we should look for other ways to share code.
HSC Batch/Pool Tasks
We’re currently in the process of moving over some code from HSC that includes an even higher-level collection of
Tasks which aggregate multiple
CmdLineTasks and run them in parallel on batch systems. This has exposed some subtleties in the differences between running these
CmdLineTasks individually and all together, and a real deficiency in our ability to warm-start processing (we’re currently relying on config options that tell us whether to reuse data products that already exist on disk instead of using them, which almost inevitably leads to confusion). The batch tasks are also difficult to read, because they’re mostly about setting up pools of processes and passing jobs to them, rather than the algorithm flow. We might be able to fix this with some (fairly intense) Python syntactic sugar, but it also might be a sign that the code-readability advantage of using direct Python calls to link up pipeline units largely disappears when we get to parallel execution.
Tasks were a big step up from what we had before in terms of readability and reusability (and hence developer efficiency) at a low level, precisely because they used standard Python constructs for calling units of code (i.e. the function call operator), and used Python constructs (function signatures) for defining inputs and outputs. They were a almost certainly a step back in terms of being able to connect to workflow systems. They were probably a step back for introspection, but I think that was more of an oversight in the concrete implementation rather than a flaw in the approach.
I think it’s clear at the high level we’ll need more generic interfaces to allow the pipeline to be managed more effectively by workflow systems, and that probably needs to happen at or below the level at which we start doing parallelization - but we need to be very careful to avoid running into the same readability/hidden-dependencies/unusable-without-workflow-tools problems we had before we switched to