A Brief History of pipe.base.Task

Nobody is satisfied with the current Task and CmdLineTask. But they’re a marked improvement (at least in some important respects) from what we had before, and as some recent criticisms have suggested moving in perhaps a similar direction to what we had before, it was suggested at this week’s DMLT meeting that it’d be very useful to describe how we got to this point.

What Came Before

Once upon a time, the DM Stack was organized into Stages, using a package called pex_harness. These communicated using an object called a Clipboard, to which each Stage could attach its outputs and retrieve its inputs. This was all quite friendly to a workflow management system: a full pipeline could be described as just a sequence of Stages, and all the execution and I/O was very fully abstracted. That description was in a custom configuration format called Policy (which you can still find in use in a few dark corners of the stack today).

This was a bit of a pain to run outside a workflow management system (maybe it was impossible? I don’t recall), but the real problem was that it made the algorithmic code impossible to follow, because the logic was split up across multiple Stages, and the links between those stages were captured in policy files, not the code itself. And it turned out those Stages weren’t really all that composable. The idea that they were really distinct units of processing was in some sense a fallacy, because the algorithmic responsibilities of each Stage depended quite a bit on the details of what was done in the previous Stages.

Beginnings of Task

The alternative that ultimately grew into Task started with a project called “pipette” that began life on the HSC side. This simply connected the same algorithmic code in the Stage-based pipeline into a very small number of concrete Python scripts. This made things less pleasant for workflow systems, but it made life much easier for developers. Task then started as a transfer of this idea back to the LSST side with a common base class and some conventions for the callable objects that composed the pipeline. In particular, it added:

  • The ability to construct a hierarchy of Tasks that delegate to each other in a consistent way, and a configuration hierarchy that mirrors it.
  • Standardized access to a logger.
  • Distinct setup and execution phases, allowing some code (i.e. schema definition) to be executed once before running on any data units.

At the same time, we added CmdLineTask, a subclass of Task that added some convenience code to run a Task from a command line, assuming its inputs and outputs could be captured in a single Butler dataref.

But the philosophy of Task was very simple: we just wanted to write pipelines using regular Python language constructs, and Task itself is just a callable that adds some configuration and utilities.

Where Task Went Next

Since then, things have evolved in several ways. The most important one was the introduction of ConfigurableField, which allows any subtask in any level in a hierarchy to be swapped out via a “retarget” option in a config file. This is an extremely powerful feature, and we’ve been using it as key piece of our customization system without really stopping to define the interfaces of those swappable components. That has made the system quite fragile and hard to follow. In addition, in contrast to RegistryField, our other way of configuring plugin code, ConfigurableField doesn’t properly preserve configuration settings when components are swapped, which can lead to hard-to-find bugs. We’ve also evolved to a state in which low-level non-CmdLine Tasks do not use the Butler. This makes their inputs and outputs more strictly-defined, and it also makes it possible to use them in contexts where a Butler is not available.

The concrete Tasks and CmdLineTask themselves have also evolved quite a bit, often in ways that highlight the lack of defined interfaces for specific concrete Tasks - there are many algorithmic features that have added implicit dependencies across Tasks, especially CmdLineTasks, making it harder to swap out components safely. That of course was an even bigger problem with Stages, but it’s clear CmdLineTasks have a similar problem, largely because their inputs and outputs are still too vaguely defined. It’s even true (to a lesser degree) of lower-level Tasks as well, because a Python signature doesn’t fully define an interface.

We’ve also never really settled the problem of how to organize a Task.run() method. Having multiple methods that run() delegates to allows one to replace an aspect of a component more easily, because you can inherit from the Task and reimplement only one of these methods instead of run(). One can do the same thing by delegating directly to a subtask in run() and then retargeting the subtask, and we haven’t really evaluated the pros and cons of these two approaches. And the result is a lack of consistency across concrete Tasks.

Some CmdLineTasks have also become a lot less readable as we’ve struggled with code reuse challenges when running similar algorithms on very different datasets - particularly in what’s arguably the most important task, ProcessCcdTask. Most of the work there is delegated to its base class ProcessImageTask, so it can be shared with ProcessCoaddTask, and that makes it harder to follow. I actually think a lot of that ugliness will go away as the algorithms for single-frame and deep processing naturally diverge, but to the extent it doesn’t we should look for other ways to share code.

HSC Batch/Pool Tasks

We’re currently in the process of moving over some code from HSC that includes an even higher-level collection of Tasks which aggregate multiple CmdLineTasks and run them in parallel on batch systems. This has exposed some subtleties in the differences between running these CmdLineTasks individually and all together, and a real deficiency in our ability to warm-start processing (we’re currently relying on config options that tell us whether to reuse data products that already exist on disk instead of using them, which almost inevitably leads to confusion). The batch tasks are also difficult to read, because they’re mostly about setting up pools of processes and passing jobs to them, rather than the algorithm flow. We might be able to fix this with some (fairly intense) Python syntactic sugar, but it also might be a sign that the code-readability advantage of using direct Python calls to link up pipeline units largely disappears when we get to parallel execution.

Bottom Line

Tasks were a big step up from what we had before in terms of readability and reusability (and hence developer efficiency) at a low level, precisely because they used standard Python constructs for calling units of code (i.e. the function call operator), and used Python constructs (function signatures) for defining inputs and outputs. They were a almost certainly a step back in terms of being able to connect to workflow systems. They were probably a step back for introspection, but I think that was more of an oversight in the concrete implementation rather than a flaw in the approach.

I think it’s clear at the high level we’ll need more generic interfaces to allow the pipeline to be managed more effectively by workflow systems, and that probably needs to happen at or below the level at which we start doing parallelization - but we need to be very careful to avoid running into the same readability/hidden-dependencies/unusable-without-workflow-tools problems we had before we switched to Task.

3 Likes

The batch tasks are much better than they used to be. It used to be that MPI objects were passed around, and the code was full of if comm.rank == comm.root: foo = bar; else: foo = None. I think the addition of the process pool has cleaned this up, and allows the developer to concentrate on the algorithm (which includes the scatter/gather).

Thank you. I remember most of that, Including pipette, but it is really helpful to have written out. We’ll want to follow up with you on some of the detailed use cases that you mentioned.