Changes to PipelineTask connection definitions and skip/clobber behavior

jbosch · June 17, 2021, 2:58pm

Last night (prior to the w_2021_25 release) I merged DM-30649, which makes a number of changes to PipelineTask execution and QuantumGraph generation, some of which are slightly backwards compatible.

Our code now defines a Quantum to have been run successfully based on whether its <task>_metadata dataset was successfully written, regardless of whether all other predicted outputs were also written. This opens the door to a lot of useful functionality:

A PipelineTask can now succeed even if it has no work to do, which is a common occurrence in cases where the butler’s conservative spatial relationships overpredict overlaps between skymap regions and observations. A task with no work to do can simply exit without error without writing anything, and it can also raise pipe.base.NoWorkFound, which will be caught by the execution system with the same effect: only metadata can be written. A task can also exit early or raise NoWorkFound after writing only some of its predicted outputs.
On the input side, Input and PrerequisiteInput connections now have a minimum parameter (defaults to 1) that sets a lower bound for the number of datasets per quantum in that connection. It can be set to zero only for PrerequisiteInputs (we already never make quanta with zero datasets in for regular Input connection), and greater than one only for connections with multiple=True. When a quantum does not satisfy this condition due to a missing PrerequisiteInput when building a QuantumGraph, we’ll raise FileNotFoundError, addressing the long-standing confusing behavior of prerequisites not actually being required. And when a quantum does not satisfy this condition due to a missing Input during execution (which can only happen because some upstream task did not produce one of its predicted outputs), the execution harness will skip it whiling still writing its metadata dataset (and logging accordingly), in effect raising NoWorkFound on its behalf.
The options to skip or clobber existing outputs during execution now behave more consistently, and --clobber-partial-outputs has been renamed to --clobber-outputs to better reflect its new behavior. Both --clobber-outputs and --skip-existing apply only to datasets in the output RUN collection (so they’re only useful with --extent-run), and they can now be used both in QuantumGraph generation and execution.
- During execution, passing --skip-existing and --clobber-outputs will cause successfully-run quanta (those with a <task>_metadata dataset in the output RUN collection) to be skipped and incomplete quanta (those with other datasets, but not metadata, in the output RUN collection) to be run again after first deleting existing outputs. Passing --skip-existing alone makes incomplete quanta an error, and passing --clobber-outputs alone will clobber and re-run even successful quanta. Passing neither makes both successful and incomplete quanta existing an error.
- During QuantumGraph generation, passing --skip-existing will cause successfully-run quanta to be left out of the graph entirely, essentially skipping them in advance and allowing downstream quanta to use ther existing outputs (if there are any). Passing --clobber-outputduring QuantumGraph generation just informs the algorithm that it can expect that option to be passed during execution, and hence it should not raise exceptions when it sees existing datasets that will need to be clobbered in the output collection.

In addition to NoWorkFound, there are two more new exceptions that can raised by PipelineTask to invoke special behavior:

pipe.base.RepeatableQuantumError can be raised to indicate a true failure that should block execution of all downstream quanta, but one that should be repeatable given the same software versions, configuration, and data, and hence never automatically retried by the workflow system. It should not be used for environmental issues like out-of-memory conditions, or for cases where we think downstream tasks should proceed but skip the output of this task (that’s actually best described as a NoWorkFound condition right now, though we plan to add support for more subtle conditional successes in the future). It is the right exception to raise for algorithmic problems that we hope to fully eliminate from the kinds of data we regularly process; we want to be forced to track down and investigate these when they occur. A failures to fit a PSF model or an astrometry matching problem are probably good examples.
pipe.base.InvalidQuantumError can be raised to indicate a logic problem in the configuration or construction of the pipeline. If possible, workflow systems will kill entire submissions (not just fail downstream quanta) when this exception is encountered, as it’s the kind of thing that will probably force the user to reconfigure and re-submit. Whenever possible, checks that could raise this error should instead be performed during configuration validation (in Config.validate methods) or during QuantumGraph generation (by overriding PipelineTaskConnections.adjustQuantum), and in those contexts any exception can be raised; it’s only for the hopefully-rare cases where we cannot perform those checks until execution is underway.

Finally, it’s worth noting that this metadata-dataset defintion of success is probably just a placeholder; eventually we will want to record more detailed status for each quantum, beyond what can be represented by the presence or absence of a dataset (in particular, we will want to save different error states, and we can’t save those in the metadata if use the absence of metadata to indicate an error). It may be a placeholder that lasts for a long time, because the middleware team has a lot of other priorities, but if you’re interested in building anything new on top of this success definition, please come talk to us in #dm-middleware-dev on Slack; we’ll probably want to define some more stable APIs for checking quantum status for which metadata existence is just the current implementation.

kfindeisen · June 17, 2021, 5:26pm

While the secondary features sound very nice, I’m a bit confused about the “definition of success” aspect of it. If a pipeline task fails (and raises a generic exception rather than RepeatableQuantumError), does this error now get suppressed? If so, how do we determine there was a problem?

jbosch · June 17, 2021, 5:44pm

The behavior when other exceptions are raised is actually unchanged: pipetask will mark all downstream quanta as failed, while BPS will consider them “not ready”. The traceback is printed immediately, and processing of unaffected quanta may continue (if --fail-fast is not passed). This is actually the behavior with RepeatableQuantumError, too - it just causes pipetask run --fail-fast to emit a special exit code that BPS may recognize and act on in the future when it is deciding whether to retry automatically.

The difference in behavior in our concrete pipelines on this ticket comes from actually changing a few of our Tasks to raise NoWorkFound (or not raise at all) instead of raising some other exception, hence reclassifying some failures as successes so downstream quanta can be run (or skipped).

kfindeisen · June 24, 2021, 5:08pm

A subtle point that came up in the RFC-788 discussion, that I’m reposting here for the record:

Also note that using PrerequisiteInput will also block the [task] from being run in the same pipeline as the task that produces that dataset

In other words, optional inputs must be external to the pipeline(s) in which that task might participate.