I am trying to use some combination of --skip-existing-in, --extend-run, and --clobber-outputs in pipetask quantum graph generation so that I can skip quanta associated with failed tasks from previous executions of the same pipeline.
Say I run CharacterizeImageTask task on a set of images and the task fails for one detector. If I re-run this pipeline, and I don’t change any configurations, I expect this task to fail again so I would want to skip it. In my current processing set up, I have collections and runs set up like so:
parent CHAINED
parent/characterize/time RUN
parent/calib CALIBRATION
parent/raw RUN
instrument/calib CALIBRATION
Potential output dataset characterizeImage_log...already exists in the output run parent/characterize/time, but clobbering outputs was not expected to be necessary.
I think in this case, quantum graph generation sees that an icExp (etc.) does not exist for a given data ID, so it tries to generate a quanta for it. However, some outputs (the characterizeImage_log) do exist for that data ID, and it doesn’t expect to have to clobber that output.
To skip failures, what you’ll want to do is specify a pipeline that doesn’t include the tasks whose failures you want to accept. I’m not quite sure what that would look like in this case, because from the name of your pipeline YAML file it seems you might be just running characterizeImage, and in that case there’s actually nothing more to do - pipetask will process all of the “unblocked” quanta even if it sees some failures, unless you tell it not to.
If you are actually running this with other tasks, I’m afraid you’ll have to list the other tasks you do want to run in your -p argument, e.g.:
-p some-pipeline.yaml#calibrate
It would be nicer if there was a way to say “everything in this pipeline except characterizeImage”, but that’s a feature we haven’t had time to add.
Finally, while using --extend-run and explicit --output-run names like this to keep your collection chain small and tidy can work, it is not something we’d recommend and it’s not the convention we use in our own processing campaigns. That’s what makes clobbering logs necessary here.
I’m running something like DRP step1, step2, etc. so characterize is grouped with isr and calibrate.
That’s interesting to hear this issue is related to how I am organizing collections, as yes I am running this way explicitly to keep collections/chains tidy. A lot of our processing monitoring/followup is done by walking through a filesystem, opening and looking at pipetask log files among other things. This led to me to decide to organize “group” processing (i.e. a data subset) as a parent collection chain with step-wise runs inserted into the chain with the step number/task inside in the name of the child run e.g. parent/step1 or parent/calibrate because these lead naturally to hierarchical and human readable directories in the datastore and bps submit submit directories.
I’m also concerned about skipping failures because I’m executing on a system where processing can be interrupted: some tasks in a workflow can be executed, some never execute, and (importantly) the final_job.bash of a bps submit may never be executed. I noticed that during an interruption like this, datasets for the interrupted run are still put in the datastore, but they aren’t associated with any run in the registry. I’m concerned about storage space, so I decided to restore access to these datasets by executing final_job.bash manually–I don’t want to unnecessarily execute and store datasets associated with a single quanta twice. In this paradigm, I can’t use the existence of a run collection with a certain step label to indicate that that step does not need to be re-run since multiple runs of a single step might exist: one that was interrupted and recovered via running final_job.bash and one that finished what was never executed in the last run.
I could just choose to take the hit on storage in the event of an interruption and only worry about runs where the workflow completed through final_job.bash. i.e. the run is in the registry, and that run existing indicates that the pipeline executed all it could minus failures. And do a clean up of the datastore later by comparing paths that exist in the datastore to runs that exist in the registry. In this paradigm though, I’m effectively ignoring failures in execution regardless of their cause. Perhaps I do not care about failures due to pipeline configuration (I lose what I lose to deterministic failures), but in the execution/computing environment I am using there can be non-deterministic failures such as the node shutting down mid execution, or (in the case of a remote datastore+quantum back butler/any datastore+execution repo) network errors. Perhaps this is rare enough of an event to ignore, I’m not sure.
How is step-wise processing organized in DM’s processing campaigns? Any references you can point to?
A lot of our campaign collection management is focused on splitting up extremely wide processing (one step, many data IDs), but if you strip that out, it’s more or less:
Run a pipeline step with an explicit --output (CHAINED) collection and the --output-run collection name constructed automatically from the CHAINED collection name and a timestamp. If a hard failure prevents the final job from running, run it manually.
If there are any failures and it’s not clear from the error messages whether some of them are potentially transient, run the same step again, with the same --output argument and --output-run still defaulted, with --skip-existing-in <output> (without --extend-run) to make a QG that skips quanta that were already run successfully and reattempts those that failed or were blocked by those that failed. We call these “rescues”. This frequently also involves bumping up the BPS requestMemory parameter to deal with out-of-memory errors. In rare cases it can be useful to run a rescue with a subset of the pipeline (as in my last post on this thread) to explicitly skip some filed tasks but not others, but if you run a whole step at time, more frequently there’s nothing for a downstream task to do if a preceding one failed (the exceptions are when the downstream task gathers results across the detectors in a visit, across bands, or across all patches in a tract).
Continue running rescues for that step until either everything is successful or we’re convinced that what remains is a repeatable error that we can’t recover.
Proceed to the next step. At this point one can choose to use a new --output collection and use the previous one as input, if you want the collection/directory structure to tell you about what step was run directly. But it’s also just fine to keep using the same --output (which at this point has the original inputs baked in, so that doesn’t need to be provided anymore).
There is very little storage inefficiency here, because tasks almost always defer all writes to the end, and hence failed tasks almost never write anything but logs, and those are the only things that end up duplicated. It can add a lot more RUN collections to the output CHAINED collections than what you’ve been doing, and it sounds like that might be a problem for your filesystem-based tools (we try to avoid filesystem tools, but I understand that they can be hard to replace). It might work to do rescues with --extent-run --clobber-outputs to put all of those rescues into the same output RUN collection and hence keep the directory structure simpler, especially if you’re invoking pipetask directly, but it’s not a well-trodden path, especially with BPS.
Hi @stevenstetzler – it was brought to my attention that we never confirmed whether the first reply is a solution that solved your problem. I marked the solution for this topic since it sounded like it did address your initial issue, but if that’s not right, let us know!