Data ID tables and task expressions in Quantum Graph builds

jbosch · April 16, 2025, 3:16pm

I’d happy to announce the availability of two bits of long-promised pipeline functionality (since the weekend, actually, but some docs are only landing in today’s d_2025_04_16):

You can now pass --data-id-table <filename> when building a quantum graph, where <filename> is any type supported by astropy.table (ECSV is recommended). Columns are data ID keys and rows are the values associated with those dimensions; dimensions that are fully specified in the --data-query don’t need to be included (e.g. if you say --data-query "instrument='LSSTCam'", you don’t need an instrument column in the table). This is the recommended path for building QGs that filter on quantities that are in the ConsDB but not the butler metadata.
When building pipeline graphs or quantum graphs, you can now pass --select-tasks "<expression>" to filter the tasks based on their dependency graph. This mini expression language can do general set operations on tasks and subsets via |, &, and ~ as well as ancestor and descendant traversal starting from task labels and dataset type names with <, >, <=, and >= (e.g. <X means “all tasks that must run before X”).

You can find more docs and some examples here. Note that task expressions act on the pipeline after it has been turned into a graph, which is after the *.yaml#<labels> subsetting has been applied, and hence it won’t work well if those labels specify a bunch of scattered tasks rather than a full pipeline or step. I think it’d be good practice for us to start using --select-tasks to select individual tasks and use *.yaml#<labels> only for step-like subsets, and eventually we may deprecate the more limited *.yaml#a..b syntax.

One of the places I hope --select-tasks will be useful is in recovering from task execution failures with new runs, and I’ve added a new entry to the middleware FAQ on that subject.