Discussions with @czw have revealed that CPP needs to do pairwise processing of raw calibration frames. This wasn’t completely surprising to me, but it’s not something I had a fully-formed plan ready for.
My understanding is that the steps are basically:
- define pairs for all or many raws within some validity epoch;
- process each of those pairs independently;
- aggregate results from all pairwise processing into a single per-validity-epoch dataset.
- run other (well-understood and already-supported) CPP tasks with that single dataset as an input.
There are a few different ways to map this to the PipelineTask
framework. These are presented in various sections below.
Option A
Do pair definition (1), pairwise processing (2) and aggregation (3) in the run
method of a single PipelineTask.
Advantages:
- No additional middleware work needed.
- Pairs can be determined fully at runtime, using the properties of the data, if desired. All other approaches would require each image to be read at least twice (once during group-definition, once during processing) if the images themselves are used to determine pairs.
Disadvantages:
- No
QuantumGraph
-based parallelism. - No persistence of per-pair data products.
- Number of pairs aggregated together is constrained by available memory.
It’s doable but a bit awkward to pass in a complete list of pairs from an external file in this scheme; this could be done by making the file containing the pairs a (butler-managed) input dataset, or if the file is just a config override of (e.g.) a list option in the task’s config class.
Options B
Do pair definition (1) during QuantumGraph
generation, pairwise processing (2) in the run
method of one PipelineTask
, and aggregation over pairs (3) in the run
method of another PipelineTask
.
The advantages and disadvantages are mirrors of those for Option A.
Advantages:
- Natural
QuantumGraph
-based parallelization for step (2). - Persistence of per-pair data products.
- Number of pairs aggregated over is unconstrained (or constrained only by disk space).
Disadvantages:
- Needs additional middleware work.
- Pairs must be defined before execution, using Registry or external metadata, not the data itself; during execution, pairs may be dropped, but not created or otherwise changed.
Some of the middleware work needed to enable this approach is giving individual PipelineTasks
more control over QuantumGraph
generation. This work is already underway on DM-21904; the plan for that work was not put together with this use case in mind, but it should nevertheless address most of the problem.
The rest of the middleware work involves a design choice on how to label pairwise quanta and datasets: data ID keys in Gen3 are dimensions, and are generally pre-declared to the system when a repository is created (because dimensions are often associated with tables, adding or changing the set of dimensions is in general a schema change, and not something we want to do often). The data ID values (rows of those tables) are also pre-declared, but later, in steps that run after repository creation but before processing that uses those data ID values. We do not currently have any dimension that could be used to label pairs of raw calibrations.
We do, however, have a dimension, visit
, that can already be used to labels pairs of other raws. At present, visits
are explicitly and only for back-to-back on-sky science images with identical pointing and sufficiently similar observing conditions that it is reasonable in many contexts to consider them a single observation. This is restrictive, but that is part of what makes them useful.
Option B1
Add a new dimension analogous to visit
, but distinct from it.
This would make group-definition (i.e. pair-definition, but the system could support groups of more than two with no additional work) a separate step run before QuantumGraph
generation (let alone execution). Exposures could belong to multiple groups, and we could (as with visit
) provide group systems in which each exposure is in only one group.
The advantage of this approach (relative to B3, primarily) is that group-definition can be done up front, and groups are then not just reusable, but consistent - data IDs that identify per-group data products are guaranteed to have the same meaning in different processing runs, avoiding a lot of potential confusion.
The disadvantage of this approach is that group definition must be done up front, in a separate, pre-QuantumGraph-generation step. And while that could be run very frequently (even before every processing run), it’s not really designed for that - this would lead to a lot of redundancy in the database representation of things, because the stuff intended to enforce consistency across processing runs isn’t being used; a relationship designed to be many-to-one would be nearly one-to-one in practice.
Note that there is nothing here about groups being defined from header information. That is of course a possibility, but it is in no way a requirement (well, except maybe in OCPS or something, but then that’s an OCPS constraint, not a middleware one).
In particular, passing complete group definitions from a file is easy in this scenario, because the group-definition command is not a PipelineTask
and has complete control over its command-line UI. Using config override files would make this particularly easy, but may be a somewhat awkward fit.
Option B2
Extend the visit concept to include calibration frame groups, removing the requirement that visits represent on-sky science observations.
This is a lot like B1, but probably worse in several respects:
-
it means code that currently relies on the
visit
concept as a way to declare that it works on on-sky science images would need a different way to do that; -
it means the database columns that hold regions for visits need to have a lot of nulls, or we need to define another dimension/tablr to hold those regions (and probably deal with the messiness of some kind of “inheritance” system for dimensions);
-
visit
IDs are expected to be defined in such a way that they are unique across all visit systems (though I think the LSST visit IDs we generate do not obey this property, and are hence a looming problem), and hence one does not need to include avisit_system
key as well in data IDs. That choice is appropriate in the limit that group definition is very rare compared to processing of already-defined groups - a limit that I thinkvisit
is in, but CPP pairs may not be. In contrast, with B1 we could instead allow group IDs to be unique only within a system (while requiring system IDs in data IDs), hence making it easier to define new systems.
Overall, the only context in which I think this option is worth considering would be if we had PipelineTasks
we wanted to run on both on-sky visits and CPP pairs (I can’t think of any). And even then, I’d lean towards a variant of B1 where we extend the CPP group concept to permit on-sky images instead of extending the visit
concept to permit CPP pairs.
As with B1 (and with visits!) there is no requirement (from the middleware, at least) that the group definitions in this scenario be based on headers.
Passing complete group definitions from a file in this scenario is the same as in B1 - we have a define-visits
command-line UI already, it doesn’t provide that functionality right now (except via config), and there’s nothing preventing us from adding it other than just doing the work.
@timj has expressed an interest in expanding the visit concept to all exposures in the past, so I’m hoping the above will either explain why it’s not a good idea or give him an opportunity to explain why it might still be a good idea.
Option B3
Add and use support for free-form, no-relationship identifiers in data IDs.
PipelineTasks
would be able to declare and use new dimensions, with the restriction that these dimensions never have built-in relationships or predefined values, and hence don’t have database tables and don’t guarantee any kind of data ID consistency across processing runs. Using them would also require the concrete PipelineTask
to provide an implementation of the QuantumGraph
-generation hook that most other PipelineTasks
can inherit from the base class.
This option requires more work in the middleware system than the others (so we’d probably need to live with an Option A approach for longer), but it would add a lot of flexibility to the system that would probably be used in other ways in the future.
This is much more natural than B1 or B2 if pair-definition is expected to happen essentially every processing run and pairs are rarely the same. Whether it’s enough better than A to justify the extra work probably depends on whether we can provide intra-PipelineTask parallelization primitives to speed up A (or whether we even need to) and whether we actually do have other use cases (e.g. pairwise diffim?) that would take advantage of B3 functionality.
Passing group definitions from files would hopefully fit naturally into this scheme as a way of passing data IDs from files (which we need for other reasons), and would just be part of the work to get it up and running at all. But some design work is needed on that front, and it’s hard to say exactly how it would work.
B3 most resembles the approach @RHL has advocated in similar contexts; I’m hoping this makes the tradeoffs more clear so we can see whether it’s actually the right choice for CPP pairs (as opposed to just something it would be good for the system to be able to do - which I agree with, but don’t otherwise consider a high priority right now). I’m also hoping this makes it clear why I think B1/B2-style pre-definition of groups is the right approach for visit
, and would be even if we had B3-style functionality available.