Task API changes involving dataIds/butlers

RHL · June 23, 2016, 12:47pm

I took a look at using the tasks that ProcessFile uses to reduce some data in a workshop at LSST@Europe2 but I found that there were a number of changes that contravened the rules for using Tasks. Clearly putting ProcessFile under CI would have caught this, but we really shouldn’t have made these changes in the first place. Maybe the confusion is that very few cmdLineTasks are only used from the command line, so the no-butler rules apply to all tasks — they may add methods to allow their use from the command line, but they must still work directly from python.

With some pretty-trivial changes everything worked nicely (those are the workbooks I mentioned in http://community.lsst.org/t/notebooks-for-lsst-europe2-in-belgrade).

In more detail:

CalibrateTask.run() is no longer an interface that can be used to build pipelines – it takes a dataId-related ExposureIdInfo
CalibrateTask.calibrate() cannot be used either as it takes an ExposureIdInfo
CharacterizeImageTask.__init__() has a compulsory butler object

I can work around this (import ExposureIdInfo and build one with meaningless values; pass None to CharacterizeImageTask).

My reading of https://jira.lsstcorp.org/browse/RFC-173 did not permit “functional” tasks to depend on anything to do with butlers or dataIds, although I (now) note that Russell said that he was going to add this butler argument in an RFC-173 comment so I should have known this. The short-term fix is to default these objects to be None (e.g. https://jira.lsstcorp.org/browse/DM-6631) and maybe that’s OK.

Additionally, CalibrateTask assumes that a Psf is present, and does not provide a way to pass sigma through to the detection task. This implies that you /must/ run CharacterizeImageTask first, or add a fake Psf to the Exposure. I don’t like this, and at the very least it should be documented as a pre-condition for the task.

Finally, the packaging is a bit odd:

from lsst.pipe.tasks.characterizeImage import CharacterizeImageTask
from lsst.meas.algorithms.detection import SourceDetectionTask
from lsst.meas.deblender import SourceDeblendTask
from lsst.meas.base import SingleFrameMeasurementTask

Why is CharacterizeImageTask in pipe_tasks but the others in meas? Should it be in meas_base? I don’t know the correct answer here, but surely we can come up with a consistent solution.

swinbank · June 23, 2016, 1:07pm

I certainly don’t disagree with the general point, but: can we be clearer about exactly what these rules are? Looking at e.g. https://lsst-web.ncsa.illinois.edu/doxygen/x_masterDoxyDoc/pipe_tasks_write_task.html I don’t see anything about “no-butler rules”. Are they written down somewhere that I’m forgetting?

jbosch · June 23, 2016, 2:04pm

On this subject, I think part of the problem is that we have no real conventions for how packages should be defined, so we have all inferred our own definitions of each of our current packages from its (historical) content. A key question (but not the only one) is whether to organize packages by conceptual namespaces (which would require larger packages) or to minimize the dependencies of any particular piece of functionality (which I believe would generally result in smaller packages).

In any case, I do not think there is any hope of resolving this particular bit of packaging inconsistency in isolation.

swinbank · June 23, 2016, 2:28pm

It won’t come as a surprise to @jbosch, but for completeness it’s worth noting that addressing this is on the DMLT’s agenda: see DM-6337.

RHL · June 23, 2016, 2:30pm

I can’t find a clear statement, although it does appear in various emails over the last two years; this should be remedied.

However, the referenced page says:

The input description may not just be something like “sensorRef of the needed inputs”. As most of you know, I am unhappy about using blobs/kwargs/sensorRefs as inputs to Tasks, but this mail is not a request to fix this. However, if your task does use a blob then the documentation must list all of its members, both optional and required (and enough information for the user to decide which is which for their application). Also, some tasks don’t have a run method (e.g. SourceMeasurementTask.measure), but there’s no need to fix this now.

The “complete example” may be difficult. I’m mostly concerned about constructor arguments and the case that the inputs include a blob, in which case I want the docs to include how that blob should be built in the calling code – not from the pipe_base argument parser. If important information is provided by the config, then the user should be told how to build that config in their code; if the arguments are a blob of some sort the example must tell the user how to build it. Examples may not assume the use of the cmdLineTask to build sensorRefs as that can’t be used by generic user python scripts.

If you pass in a butler you’re going to find it hard to satisfy those conditions.

swinbank · June 23, 2016, 3:59pm

Agreed: I filed DM-6648.

KSK · June 23, 2016, 5:37pm

I think I understand the issue, but I don’t understand how to solve it. Low level tasks need to do I/O. The example here is the astrometry calibration task. This has always been true. We hid it before by using environment variable magic, but the proper way to do I/O is with the butler (whether it’s hidden in an object that’s passed down or not).

I was hoping that the supertask would help out with some of the messiness of constructing the reference loader, but it will not solve the low level I/O question.

RHL · June 24, 2016, 7:52am

We discussed the passing of a reference catalogue loader that wrapped up a butler, and that seemed to be an acceptable solution.

I don’t see why something like superTask comes in here; surely it’ll just use the code that creates a reference loader (it may happen that it’s written by the same team)