A recent HipChat discussion on Star Selectors brought up some design rules for Tasks that I’ve long assumed (and I think others have as well) but have never really formally communicated or tried to write up. As these are just my opinions (for now), please feel free to push back if you disagree with any of this.
-
Tasks should be considered frozen after construction. No new attributes should be added and no attributes should be modified. This further implies that config objects and schemas cannot be modified after
Taskconstruction. This preventsTaskstate from changing in the course of processing different data units, and hence helps ensure consistent, reproducible processing. I personally think that this restriction should extend to caching and lazy operations (if we need a cache, perhaps we need a better place to put it than theTask?), but I could be convinced that we should have an exception for these. -
It must be possible to construct all
Tasks to be used in a production before actually running any of them. Together with (1), this ensures that all schemas and configuration can be realized without having access to any data. Note that this does not require that allTasksactually be constructed before running any of them, but that’s often a good idea anyway, if for no other reason than to verify that it is possible. -
Operations should be performed at the highest-level possible; calculations that do not require access to data should be performed in
Taskconstructors (or earlier), and calculations that apply to several units of related data (such as CCDs within a visit) should be performed once at the higher level (e.g. visit). While ourTaskframework does not currently provide all the hooks necessary to do this as much as we’d like, but we should expect that it ultimately will and structure our code accordingly (by e.g. putting conceptually higher-level code in separate methods).
These principles frequently require the schemas of input catalogs to be passed to Task constructors, to allow them to compute their output schemas. When this happens at the CmdLineTask level, lsst.pipe.base.ButlerInitializedTaskRunner should be used to pass a Butler to the CmdLineTask constructor, which can then access the *_schema for the catalog dataset it needs. See MeasureMergedCoaddSourcesTask in pipe_tasks’ multiBand.py for an example. The CmdLineTask should also take the schema(s) it needs as arguments directly, allowing the butler argument to default to None when those are provided.
These rules can make our Tasks harder to use outside of a production setting, and frequently the best way to avoid making them harder to use is to to provide a lot of flexibility in how a Task constructor receives information about its predecessors (such as the separate schema and butler arguments mentioned above). That flexibility comes with its own costs, and I hope that eventually an improved Task framework can take up some of this burden. For now, however, I think our policy must be that the burden should be born by the Tasks themselves, not by user code.
The fact that Python doesn’t make it easy to enforce to enforce Task freezing after construction at a language level is also problematic; thinking they can modify config values after Task construction is one of the most common gotchas for new users of the stack, partly because we don’t presently raise any exceptions when they do so. This - and the performance-focused, “no unnecessary operations” philosophy behind (3) do make me worry a bit that I might be trying to apply C++ design principles to a language where they simply don’t work as well, and that there might be more Pythonic principles we could be using instead that would achieve the same results. But I suspect that the reality is that having strongly-enforced rules to ensure consistent processing is just inherently unpythonic, and we’ll always have to balance Pythonic-ness and rigorous consistency.