Last night (prior to the w_2021_25 release) I merged DM-30649, which makes a number of changes to
PipelineTask execution and
QuantumGraph generation, some of which are slightly backwards compatible.
Our code now defines a
Quantum to have been run successfully based on whether its
<task>_metadata dataset was successfully written, regardless of whether all other predicted outputs were also written. This opens the door to a lot of useful functionality:
PipelineTaskcan now succeed even if it has no work to do, which is a common occurrence in cases where the butler’s conservative spatial relationships overpredict overlaps between skymap regions and observations. A task with no work to do can simply exit without error without writing anything, and it can also raise
pipe.base.NoWorkFound, which will be caught by the execution system with the same effect: only metadata can be written. A task can also exit early or raise
NoWorkFoundafter writing only some of its predicted outputs.
On the input side,
PrerequisiteInputconnections now have a
minimumparameter (defaults to
1) that sets a lower bound for the number of datasets per quantum in that connection. It can be set to zero only for
PrerequisiteInputs(we already never make quanta with zero datasets in for regular
Inputconnection), and greater than one only for connections with
multiple=True. When a quantum does not satisfy this condition due to a missing
PrerequisiteInputwhen building a
QuantumGraph, we’ll raise
FileNotFoundError, addressing the long-standing confusing behavior of prerequisites not actually being required. And when a quantum does not satisfy this condition due to a missing
Inputduring execution (which can only happen because some upstream task did not produce one of its predicted outputs), the execution harness will skip it whiling still writing its metadata dataset (and logging accordingly), in effect raising
NoWorkFoundon its behalf.
The options to skip or clobber existing outputs during execution now behave more consistently, and
--clobber-partial-outputshas been renamed to
--clobber-outputsto better reflect its new behavior. Both
--skip-existingapply only to datasets in the output
RUNcollection (so they’re only useful with
--extent-run), and they can now be used both in
QuantumGraphgeneration and execution.
During execution, passing
--clobber-outputswill cause successfully-run quanta (those with a
<task>_metadatadataset in the output
RUNcollection) to be skipped and incomplete quanta (those with other datasets, but not metadata, in the output
RUNcollection) to be run again after first deleting existing outputs. Passing
--skip-existingalone makes incomplete quanta an error, and passing
--clobber-outputsalone will clobber and re-run even successful quanta. Passing neither makes both successful and incomplete quanta existing an error.
--skip-existingwill cause successfully-run quanta to be left out of the graph entirely, essentially skipping them in advance and allowing downstream quanta to use ther existing outputs (if there are any). Passing
QuantumGraphgeneration just informs the algorithm that it can expect that option to be passed during execution, and hence it should not raise exceptions when it sees existing datasets that will need to be clobbered in the output collection.
In addition to
NoWorkFound, there are two more new exceptions that can raised by
PipelineTask to invoke special behavior:
pipe.base.RepeatableQuantumErrorcan be raised to indicate a true failure that should block execution of all downstream quanta, but one that should be repeatable given the same software versions, configuration, and data, and hence never automatically retried by the workflow system. It should not be used for environmental issues like out-of-memory conditions, or for cases where we think downstream tasks should proceed but skip the output of this task (that’s actually best described as a
NoWorkFoundcondition right now, though we plan to add support for more subtle conditional successes in the future). It is the right exception to raise for algorithmic problems that we hope to fully eliminate from the kinds of data we regularly process; we want to be forced to track down and investigate these when they occur. A failures to fit a PSF model or an astrometry matching problem are probably good examples.
pipe.base.InvalidQuantumErrorcan be raised to indicate a logic problem in the configuration or construction of the pipeline. If possible, workflow systems will kill entire submissions (not just fail downstream quanta) when this exception is encountered, as it’s the kind of thing that will probably force the user to reconfigure and re-submit. Whenever possible, checks that could raise this error should instead be performed during configuration validation (in
Config.validatemethods) or during
QuantumGraphgeneration (by overriding
PipelineTaskConnections.adjustQuantum), and in those contexts any exception can be raised; it’s only for the hopefully-rare cases where we cannot perform those checks until execution is underway.
Finally, it’s worth noting that this metadata-dataset defintion of success is probably just a placeholder; eventually we will want to record more detailed status for each quantum, beyond what can be represented by the presence or absence of a dataset (in particular, we will want to save different error states, and we can’t save those in the metadata if use the absence of metadata to indicate an error). It may be a placeholder that lasts for a long time, because the middleware team has a lot of other priorities, but if you’re interested in building anything new on top of this success definition, please come talk to us in #dm-middleware-dev on Slack; we’ll probably want to define some more stable APIs for checking quantum status for which metadata existence is just the current implementation.