Pairwise processing in CPP in Gen3

jbosch · August 14, 2020, 3:44pm

Discussions with @czw have revealed that CPP needs to do pairwise processing of raw calibration frames. This wasn’t completely surprising to me, but it’s not something I had a fully-formed plan ready for.

My understanding is that the steps are basically:

define pairs for all or many raws within some validity epoch;
process each of those pairs independently;
aggregate results from all pairwise processing into a single per-validity-epoch dataset.
run other (well-understood and already-supported) CPP tasks with that single dataset as an input.

There are a few different ways to map this to the PipelineTask framework. These are presented in various sections below.

Option A

Do pair definition (1), pairwise processing (2) and aggregation (3) in the run method of a single PipelineTask.

Advantages:

No additional middleware work needed.
Pairs can be determined fully at runtime, using the properties of the data, if desired. All other approaches would require each image to be read at least twice (once during group-definition, once during processing) if the images themselves are used to determine pairs.

Disadvantages:

No QuantumGraph-based parallelism.
No persistence of per-pair data products.
Number of pairs aggregated together is constrained by available memory.

It’s doable but a bit awkward to pass in a complete list of pairs from an external file in this scheme; this could be done by making the file containing the pairs a (butler-managed) input dataset, or if the file is just a config override of (e.g.) a list option in the task’s config class.

Options B

Do pair definition (1) during QuantumGraph generation, pairwise processing (2) in the run method of one PipelineTask, and aggregation over pairs (3) in the run method of another PipelineTask.

The advantages and disadvantages are mirrors of those for Option A.

Advantages:

Natural QuantumGraph-based parallelization for step (2).
Persistence of per-pair data products.
Number of pairs aggregated over is unconstrained (or constrained only by disk space).

Disadvantages:

Needs additional middleware work.
Pairs must be defined before execution, using Registry or external metadata, not the data itself; during execution, pairs may be dropped, but not created or otherwise changed.

Some of the middleware work needed to enable this approach is giving individual PipelineTasks more control over QuantumGraph generation. This work is already underway on DM-21904; the plan for that work was not put together with this use case in mind, but it should nevertheless address most of the problem.

The rest of the middleware work involves a design choice on how to label pairwise quanta and datasets: data ID keys in Gen3 are dimensions, and are generally pre-declared to the system when a repository is created (because dimensions are often associated with tables, adding or changing the set of dimensions is in general a schema change, and not something we want to do often). The data ID values (rows of those tables) are also pre-declared, but later, in steps that run after repository creation but before processing that uses those data ID values. We do not currently have any dimension that could be used to label pairs of raw calibrations.

We do, however, have a dimension, visit, that can already be used to labels pairs of other raws. At present, visits are explicitly and only for back-to-back on-sky science images with identical pointing and sufficiently similar observing conditions that it is reasonable in many contexts to consider them a single observation. This is restrictive, but that is part of what makes them useful.

Option B1

Add a new dimension analogous to visit, but distinct from it.

This would make group-definition (i.e. pair-definition, but the system could support groups of more than two with no additional work) a separate step run before QuantumGraph generation (let alone execution). Exposures could belong to multiple groups, and we could (as with visit) provide group systems in which each exposure is in only one group.

The advantage of this approach (relative to B3, primarily) is that group-definition can be done up front, and groups are then not just reusable, but consistent - data IDs that identify per-group data products are guaranteed to have the same meaning in different processing runs, avoiding a lot of potential confusion.

The disadvantage of this approach is that group definition must be done up front, in a separate, pre-QuantumGraph-generation step. And while that could be run very frequently (even before every processing run), it’s not really designed for that - this would lead to a lot of redundancy in the database representation of things, because the stuff intended to enforce consistency across processing runs isn’t being used; a relationship designed to be many-to-one would be nearly one-to-one in practice.

Note that there is nothing here about groups being defined from header information. That is of course a possibility, but it is in no way a requirement (well, except maybe in OCPS or something, but then that’s an OCPS constraint, not a middleware one).

In particular, passing complete group definitions from a file is easy in this scenario, because the group-definition command is not a PipelineTask and has complete control over its command-line UI. Using config override files would make this particularly easy, but may be a somewhat awkward fit.

Option B2

Extend the visit concept to include calibration frame groups, removing the requirement that visits represent on-sky science observations.

This is a lot like B1, but probably worse in several respects:

it means code that currently relies on the visit concept as a way to declare that it works on on-sky science images would need a different way to do that;
it means the database columns that hold regions for visits need to have a lot of nulls, or we need to define another dimension/tablr to hold those regions (and probably deal with the messiness of some kind of “inheritance” system for dimensions);
visit IDs are expected to be defined in such a way that they are unique across all visit systems (though I think the LSST visit IDs we generate do not obey this property, and are hence a looming problem), and hence one does not need to include a visit_system key as well in data IDs. That choice is appropriate in the limit that group definition is very rare compared to processing of already-defined groups - a limit that I think visit is in, but CPP pairs may not be. In contrast, with B1 we could instead allow group IDs to be unique only within a system (while requiring system IDs in data IDs), hence making it easier to define new systems.

Overall, the only context in which I think this option is worth considering would be if we had PipelineTasks we wanted to run on both on-sky visits and CPP pairs (I can’t think of any). And even then, I’d lean towards a variant of B1 where we extend the CPP group concept to permit on-sky images instead of extending the visit concept to permit CPP pairs.

As with B1 (and with visits!) there is no requirement (from the middleware, at least) that the group definitions in this scenario be based on headers.

Passing complete group definitions from a file in this scenario is the same as in B1 - we have a define-visits command-line UI already, it doesn’t provide that functionality right now (except via config), and there’s nothing preventing us from adding it other than just doing the work.

@timj has expressed an interest in expanding the visit concept to all exposures in the past, so I’m hoping the above will either explain why it’s not a good idea or give him an opportunity to explain why it might still be a good idea.

Option B3

Add and use support for free-form, no-relationship identifiers in data IDs.

PipelineTasks would be able to declare and use new dimensions, with the restriction that these dimensions never have built-in relationships or predefined values, and hence don’t have database tables and don’t guarantee any kind of data ID consistency across processing runs. Using them would also require the concrete PipelineTask to provide an implementation of the QuantumGraph-generation hook that most other PipelineTasks can inherit from the base class.

This option requires more work in the middleware system than the others (so we’d probably need to live with an Option A approach for longer), but it would add a lot of flexibility to the system that would probably be used in other ways in the future.

This is much more natural than B1 or B2 if pair-definition is expected to happen essentially every processing run and pairs are rarely the same. Whether it’s enough better than A to justify the extra work probably depends on whether we can provide intra-PipelineTask parallelization primitives to speed up A (or whether we even need to) and whether we actually do have other use cases (e.g. pairwise diffim?) that would take advantage of B3 functionality.

Passing group definitions from files would hopefully fit naturally into this scheme as a way of passing data IDs from files (which we need for other reasons), and would just be part of the work to get it up and running at all. But some design work is needed on that front, and it’s hard to say exactly how it would work.

B3 most resembles the approach @RHL has advocated in similar contexts; I’m hoping this makes the tradeoffs more clear so we can see whether it’s actually the right choice for CPP pairs (as opposed to just something it would be good for the system to be able to do - which I agree with, but don’t otherwise consider a high priority right now). I’m also hoping this makes it clear why I think B1/B2-style pre-definition of groups is the right approach for visit, and would be even if we had B3-style functionality available.

ktl · August 14, 2020, 6:50pm

It’s not clear to me if you are expecting the B3 dimensions to be persisted forever. I would think they would have to be for provenance purposes, even if they are never reused unlike B1 groups. I’m not sure why it’s up to the PipelineTask to declare (and presumably generate) these dimensions; why can’t they be generated by an external utility like define-visit?

timj · August 14, 2020, 6:52pm

I’m fine with a generic grouping label that can be used to arbitrarily combine exposures. Presumably a task would then have to exist for combining all the things in that exposure group. The only issue is the permanence of the grouping. If it really is arbitrary and different each time the processing runs then that’s going to be hard to keep track of. I don’t mind this being distinct from the more formal visit grouping that requires on-sky. It’s clear that all observations could benefit from some grouping concept, not just on-sky.

B2 and B3 seem fine to me. B3 is a transient run time thing that is an external override of quantum graph generation and B2 is a more generic approach that is semi-permanent.

I know that the calibration plan will involve external grouping definitions from text files.

timj · August 14, 2020, 6:52pm

Why? The provenance knows what the inputs were and doesn’t really care how that grouping was determined.

jbosch · August 14, 2020, 7:27pm

I was imagining that the values would be persisted in the dataset table (to connect the integer dataset ID to the data ID and DatasetType), but nowhere else. The quantum tables would still provide full provenance - those link datasets to task executions purely via the integer dataset IDs, so they don’t care about data IDs at all. Those would contain sufficient information to reconstruct the group definitions, but it would look quite different from how other dimension relationships are represented in the Registry, and that might limit (or at least add difficulty to) support for some “run almost but not quite the same processing” use cases.

jbosch · August 14, 2020, 7:33pm

Did you mean to say B1 here, too, or instead of B2? I think of B1 as the “define a visit-like grouping concept in addition to letting visit stay as it is” option.

ktl · August 14, 2020, 8:02pm

I guess I’m seeing B3 as a kind of “ingest” of a new dataset type which happens to be a list of data IDs of other datasets. Does that sound correct? The only “dimension” is the arbitrary unique IDs for each of the new “list” datasets.

jbosch · August 14, 2020, 8:25pm

I don’t think that’s the same as what I was thinking; my model was that the associations would flow directly from the user (in a text file, or embedded logic) to a PipelineTask method that’s responsible for returning (simplifying a bit here) the Quanta it wants to run, given information about the data IDs the user indicated on the command-line. Those return Quanta would contain the input and output data IDs, and hence relate them, but the fact that there was some kind of grouping used to create them would be completely internal to the PipelineTask - Butler would treat those data IDs as totally opaque and ask no questions about how the Quanta came to be when it saves them.

That doesn’t mean what you have in mind wouldn’t work; I don’t have a complete picture of how it might work in my head by any means, but uploading that list into a temporary table and (temporarily) treating that like a regular dimension seems plausible as a way to implement it. It seems like it’s a bit more explicit about the grouping instead of encapsulating it in the PipelineTask (which is probably good and bad, with no obvious lean right now).

timj · August 17, 2020, 3:41pm

Ok yes. I was saying that generic grouping concept distinct from on-sky visit is fine with me. The more specific we can be with what “visit” means the better from my point of view.