How to get a butler to the new reference object loading code?

rowen · April 25, 2016, 11:42pm

We would like advice on the best way to get a butler to @KSK’s new indexing code.

@KSK implemented his code as task LoadIndexedReferenceObjectTask. This uses a butler and a config parameter for the name of a dataset type to set up loading of index files. When a user requests all objects in a given region the task figures out which index files to read, reads them with the butler and returns the relevant data. At present the task performs this initial setup in its constructor (thus reporting certain kinds of errors as early as possible). However, this means its constructor needs a butler and tasks are not set up to do this. The question is how to solve this problem.

Choices we have considered:

Pass a butler to the loader task’s constructor.

This is a natural thing to do from the loader task’s perspective. The index file loader can be set up as the task is constructed, any errors reported, and the task is then ready to provide reference objects on demand.

However, it is a challenge because no other task is constructed with a butler at present, and we don’t see that changing. Only tasks that do I/O even need a butler, and they receive one (via a data ref) to their run method. Nonetheless, we could pass the butler to all task constructors and most would ignore it (except for constructing subtasks).

If we go this route then we should consider whether tasks that perform I/O should receive and store the butler at construction time, rather than receiving it in the run method. I doubt this is wise because even command-line tasks only do I/O in a small subset of methods, and those are made explicit by passing in the butler.
Pass a butler to the loader task’s run method and all other puplic methods that retrieve data.

I feel it clutters up the loader task and calling tasks, including several methods of reference loaders. Note that the butler argument would be ignored after the first call, since the reference loading code caches it.
Make the butler load reference objects, given a region.

This is our preferred solution. It makes CameraMapper more complicated, but it basically means moving the new code into the camera mapper.

In theory this also provides future expandability. Given a dataset name, the camera mapper can figure out which kind of reference object loader is required and instantiate the correct one. This is not necessarily difficult; a small text file can contain the required class.
Do not use the butler.

We dislike this solution because it breaks the abstraction of I/O and because we would need some other way than the butler to locate the reference data. (The current astrometr.net-based solution uses an environment variable, but we don’t consider that a good long-term solution).

jbosch · April 26, 2016, 1:02pm

We do in fact have several CmdLineTasks that take a Butler argument in their constructor; see multiband.py in pipe_tasks for some examples. I would not be opposed to passing all CmdLineTasks a Butler at construction, actually, to remove the need for a special TaskRunner; I think it’s a fairly common pattern that will only get more common.

That said, I also prefer (3) for this particular problem. Maybe we need some other class besides CameraMapper behind the Butler to do this, but I do not think spatial indexing belongs in science pipelines Task code, and I don’t understand why we’d want to hide other I/O implementation details behind the Butler but not this one.

RHL · April 26, 2016, 3:30pm

Let me add my voice to the consensus. Loading reference files is not a scientific task, it’s an unfortunate necessity (like reading raw data from disk), so it should be hidden from the science code, and the butler is the interface we use for things like this. So solution 3, please.

rowen · April 26, 2016, 5:17pm

If we do adopt solution 3 (having the butler return the catalog), as I hope we do, there are some subtleties of the interface to work out.

Our current reference loaders support two means of describing the desired region on the sky: as a center coordinate and a radius or as a WCS and a bounding box. We may want to support other means in the future, such as arbitrary regions. Given that there are multiple ways to describe a region, I think it would be difficult and unnatural to make the region part of the data ID. If so, the region information could be specified as additional keyword arguments to butler.get, or we could have the butler return a loader, which has the current standard methods for returning a catalog and might gain more in the future.

What is the data ID used for? To identify the catalog, the region, or what? At present the dataset name specifies the catalog and the data ID is used internally by the loader to locate “shard” files. If we stick with that model then the data ID is not used to retrieve the loader.

The other question is timing. If we start passing a butler to all task constructors (which @jbosch suggested would be useful) this is a very quick change. We could then start using @KSK’s code right now, as is, and transition to solution 3 as we have time to work out the details. The other option is to focus on moving @KSK’s code into the butler now – which I am also happy to do as long as I understand the details of how it will be referenced (hence this posting). It’s not that I think moving the code to the butler will be very difficult; the hard part will probably be coming up with a suitable interface.

RHL · April 26, 2016, 11:14pm

I think Jim proposed passing a butler to all CmdLineTasks; that’s not at all the same as to all task constructors. I would be very unhappy about all tasks, but could live with CmdLineTask so long as they are just drivers with no scientific content.

rowen · April 26, 2016, 11:35pm

@RHL and @jbosch the issue with only passing a butler to all CmdLineTask is getting it down to lower-level tasks. If we have a subtask hierarchy such as A -> B -> C and A and C are command-line tasks then how does C get a butler unless B passes it along?

I agree that non-command-line tasks should never use a butler. I just don’t see how we can avoid having one temporarily use one to make its subtasks.

jbosch · April 26, 2016, 11:44pm

Do we actually have a case where this could happen? I’d expect subtask hierarchies could start with CmdLineTasks and then switch to regular Tasks, but I can’t magine why they would switch back.

rowen · April 26, 2016, 11:46pm

I think reading reference catalog data could easily produce this sort of hierarchy. We use the butler to get the reference objects, but we don’t use it for anything else. All unpersistence and persistence of images and catalogs may be done far lower down.

jbosch · April 26, 2016, 11:58pm

I think that just means that any task that gets reference catalog data (or uses a subtask that does) needs to accept a butler argument, regardless of what kind of Task it is, or we need to get the reference catalog lower down and pass it through. I don’t think there’s any kind of design that could prevent that (at least not without doing something evil, like hiding it in a singleton).

KSK · April 27, 2016, 4:29pm

What I’m hearing is that we should require all CmdLineTasks take a butler as an argument to the constructor. They can then pass it on as needed to subtasks. This will solve my immediate problem, I think.

We can then look longer term at the possibility of having the butler return a reference catalog subset directly. I think we will want this and there have already been some cursory discussions at JTM and since with @npease. That will, of course, still require passing the butler down, so this is still relevant.

natepease · April 27, 2016, 4:46pm

I thought dataId was more generic than that. I think the goal is as basic as: “key-value pairs that describe the dataset in a scientifically meaningful way”.

Couldn’t the region could be passed as the single value in the dataId? I think the region value could be any type so long as the mapper can figure out how to deal with it (which it would have to be able to do for a separate keyword arg, too).

jbosch · April 27, 2016, 5:08pm

+1 on @KSK’s near-term solution, and @natepease’s suggestion that we just pass a region as the dataId in the long run.

rowen · April 27, 2016, 6:28pm

This sounds perfect to me. I will file a ticket to pass a butler to all CmdLineTasks and hope we don’t run into A -> B (non-cmd-line-task) -> C. At least if C needs a butler, this will fail at construction time.

Do we need an RFC for this change?

ktl · April 27, 2016, 6:32pm

Please wait a few hours; I’m preparing a perhaps long-winded comment on this thread that I have been unable to get to until today.

rowen · April 27, 2016, 11:16pm

OK. If you have any insights into the new SuperTask and whether it makes any sense to pass the butler to the constructor instead of other methods, that would be useful.

timj · April 27, 2016, 11:19pm

The assumption is that the SuperTask run_quantum() method will always be given a butler. There is no requirement for that butler to be passed into the run() method that will be called from run_quantum().

rowen · April 27, 2016, 11:22pm

The question is: if we do end up passing butler to the constructor of all command line tasks (and thus, presumably, super tasks) then in theory we could omit the butler argument to run_quantum. This assumes there is only one butler, and it’s the same one when the task is constructed as when it is run. (If those assumptions are wrong then perhaps the existing command line tasks that do want a butler in their constructor will need modification.)

timj · April 28, 2016, 12:18am

We can’t do that. It’s fundamental to SuperTask design in terms of provenance tracking, freeze/thaw and doing the in memory get/put sequence.

ktl · April 28, 2016, 9:56am

First, it seems likely that in the not-too-distant future any task that can do I/O will be required to be able to report what input and output datasets it may access before it executes. While it would be nice if that reporting could be done with minimal state set up and passed in, I think this may well involve use of a Butler, due to our desire for dynamically-created output dataset types. So passing a Butler instance to the task constructor for any I/O-capable task seems like it will be required. It does seem possible for that instance to be saved for use during execution (although note that some of the Butler instance’s internal state may change between task construction time and task execution time). This seems concordant with my understanding of where this thread has been headed.

Second, I think it will be forbidden to call get() and put() while performing the above reporting of input and output datasets, since setting up for those is exactly why the reporting needs to happen. This means that task constructors should not do I/O. This seems in conflict with the design for at least this particular task. One solution might be to define a third method for tasks (perhaps called something like init()), separate from construction but invoked only once after construction and before execution.

Third, in support of having tasks report the datasets they will access, it seems sensible to me to separate the task of transforming dataIds from tasks that use the resulting dataIds. For example, a task that determines the list of calibrated exposures that overlap a given skymap patch would take a patch dataId and return a list of exposure dataIds; the latter would be passed to a coaddition task. In this case, one could think of a similar index lookup task that transforms a region dataId into a list of HTM index dataIds; the latter is then passed to a retrieval task that actually loads the reference catalog shards. But such a strategy is not required in this case because the shards never need to be accessed independently and because there may be reasons to pass the region information to the retrieval code (e.g. to trim the list of returned references).

Fourth, any change to a major interface, especially a compatibility-breaking one, is likely to require an RFC unless you can guarantee that all users of that interface are unaffected. Since you can’t know all the possible CmdLineTasks that have been created, I think an RFC would be prudent.

Fifth, any keyword arguments to butler.get() are isomorphic with dataId components, as NateP said, as long as the values are plain-old-data.

Sixth, NateP and I would generally like both the Butler and especially its underlying Mapper to get out of the business of doing I/O themselves; they should dispatch to external routines that do that (whether in a result type’s class, a separate Formatter-like class, or just free functions). So in the not-too-distant future, the reference catalog loading code will be invoked by the Butler but should not be part of it or the CameraMapper; the necessary information to invoke it will be part of the setup (in a general sense) of the Butler/Mapper. In the short term, it may make sense to have reference catalogs be a type of Storage that is handled appropriately in the Butler’s Storage type-switch, rather than put them into the CameraMapper.

RHL · April 28, 2016, 12:31pm

When I said I was OK with CmdLineTask needing a butler I thought that it was clear that that means that non-CmdLineTasks may not take butlers. In the future that KT outlines where tasks declare their I/O this may change, but in the same way that science tasks may not accept dataIds, they may not accept butlers.

So your A -> B -> C case can’t occur. When we use task-like objects to compose more complex processing flows maybe it can, but only if B is not a “science” task – it’d have to be an instance of the replacement for CmdLineTask.