Figuring out how to call the Python API

rknop · April 1, 2022, 9:06pm

I’m trying to connect cookbooks you can find for command-line usage of “butler” and “pipetask” to the Pyton API. The Python API is documented, but I fear I find the documentation a bit impenetrable; it looks like it’s a reference for people who already have an overview of how it all works.

I’ve made some progress by looking at source code, but the level of abstraction from the click toolkit makes finding what I’m looking for very tangled.

This is where I’m stuck at the moment. I’m looking at a command like this:

pipetask --long-log run --register-dataset-types -j 12 \
-b $REPO --instrument lsst.obs.decam.DarkEnergyCamera \
-i DECam/raw/all,DECam/calib/curated/19700101T000000Z,DECam/calib/unbounded \
-o DECam/calib/merian/bias-01 \
-p $CP_PIPE_DIR/pipelines/cpBias.yaml \
-d "instrument='DECam' AND exposure IN $BIASEXPS" \
-c isr:overscan.fitType='MEDIAN_PER_ROW'

(from Merian Data Processing Using the LSST Science Pipelines - HackMD )

From looking at cpBias.yaml, I am pretty sure that the corresponding python class is lsst.ip.isr.IsrTask, which needs to be passed a lsst.ip.isr.IsrTaskConfig. However, I haven’t figured out how the various other arguments to the command line would correspond to anything I give to either of those objects-- things like the --instrument argument, for instance. I believe in this case that the ISR task is figuring out which defect files to use in the ISR task from the input collections it’s given, but I am not sure how to specify that to the Python API.

Of course I’d like to figure this one question out, but in a more general sense, is there something I’m missing in terms of how to figure this kind of thing out? Is there a layer of API documentation that I haven’t found? Is there a methodology for going from the command line arguments to the corresponding parameters to use in the python API?

Thanks,

-Rob

natelust · April 5, 2022, 2:49pm

Hello, I’m not sure this will be a complete answer for you, as I am not 100% what you are trying to accomplish, but I want to direct you to some resources that might help you answer some questions, and explain how some of this stuff fits together.

First, check out creating a pipeline for some info about what pipelines are, how to create them, and some info on their execution.

Second, Creating a PipelineTask explains a bit about what each of the task in the pipeline file are, how they behave and how to create one from scratch.

The Butler docs have some info on the underlying data storage model, and some low level apis. I don’t expect this is directly useful for you, but might be a good reference for terms you might find, or to satisfy your curiosity.

More directly to your question, pipetask itself was implemented with python, but was not designed to be an external interface point for those wishing to interact though python. That api is still under design and development.

However, we do have an early api available that is intended to solve simple needs (such as running bits of code through notebooks, etc) and may be subject to change. That said people have used it thus far have found it quite useful. The docs for it can be found here, and is importable from the lsst.ctrl.mpexec package.

I you have specific question on any part of this, feel free to ask and we will do our best to help.

Edit:
Some specifics of what I linked to may not be available on v23, but the concepts should all be applicable.

rknop · April 5, 2022, 3:17pm

Thank you, that is helpful, although there’s still a lot I’m trying to figure out.

What I’m trying to accomplish right now is just a small piece of something bigger. The bigger thing is just trying to understand the API so that I can do things with it. At the moment, the step I’m stuck on is just running the ISR stuff (really, just overscan subtraction) on some zero images. (Obviously this is not an end goal.) I can find various cookbooks for the command line (running “butler …” and “pipetask …”), and can find the yaml files (such as are described in “building a pipeline”). And, probably I could get through most of the data I’m working on right now with just running from the command line. But, there is a Python API, so I’d like to figure out how to interact with the code that way. I feel very thick as I struggle with doing the most basic stuff with this, but I know I’m not thick in general. (I was able, for example, to figure out how to build a small package for myself to make visualizations by writing OpenGL code in less time than it’s taken me to fail to figure out how to overscan correct an image with the DM stack Python API.)

For what I’m doing right now, from the command-line references I can find plus the yaml, I figured out that (probably) IsrTask is the thing I’m looking for. I found the documentation on it (source code and/or use of dir() and help()) here – ip_isr/isrTask.py at main · lsst/ip_isr · GitHub . (There’s no documentation on this here: lsst.ip.isr — LSST Science Pipelines .) I have a butler instance into which I’ve loaded the curated calibration information for DECam (the camera whose images I’m working with), and into which I’ve loaded the raw bias (zero) images for the dataset I’m working on. I have then queried the butler in Python to get a list of the actual zero images (either DatasetRef objects or the things you get by passing that to butler.get()). However, I have yet to figure out how to successfully pass this to IsrTask.run() to actually process the zero image (either individually or as a batch). I need (somehow) to pass the appropriate defect files. I have figured out how to pull those out of the butler as well, but the interfaces aren’t documented enough that I can’t really figure out what data structures I’m supposed to pass, how I’m supposed to pass it, etc. One way to figure out this kind of thing is take a command-line task that does what I want and read the source code to figure out what it’s doing, but I’ve found that the “click” library interface for the command-line is so abtracted away that I’ve found it very difficult to find the actual source code that parses the actual arguments for one of the command line examples I can find. (I’ve tried stepping through the code running with -mpdb, but get 10 or so calls deep and still haven’t found what I’m looking for.)

Is there any documentation out there for somebody who’s trying to figure these sorts of things out?

timj · April 5, 2022, 6:10pm

The pipeline execution infrastructure inrterfaces to the Task code via the runQuantum method:

github.com

lsst/ip_isr/blob/main/python/lsst/ip/isr/isrTask.py#L996

    
      
              super().__init__(**kwargs)
              self.makeSubtask("assembleCcd")
              self.makeSubtask("crosstalk")
              self.makeSubtask("strayLight")
              self.makeSubtask("fringe")
              self.makeSubtask("masking")
              self.makeSubtask("overscan")
              self.makeSubtask("vignette")
              self.makeSubtask("ampOffset")
          
          
def runQuantum(self, butlerQC, inputRefs, outputRefs):
              inputs = butlerQC.get(inputRefs)
          
          
    try:
                  inputs['detectorNum'] = inputRefs.ccdExposure.dataId['detector']
              except Exception as e:
                  raise ValueError("Failure to find valid detectorNum value for Dataset %s: %s." %
                                   (inputRefs, e))
          
          
    inputs['isGen3'] = True

This is the thing that pulls in the files from the butler and sets up the parameters to call the run method. You are of course allowed to call the run method yourself but that’s not how pipelines work. As @natelust mentions, the API we give you is to first create a quantum graph and then to pass that quantum graph to the SimpleExecutor.

For our batch processing we use the ctrl_bps package and the bps command line tool. This currently ends up running pipetask itself for each batch job.

We have the cp_pipe package for generating master biases etc. @czw can help you with that if you have questions.

I’m not though sure whether you are asking your questions because you are trying to learn how graph building and pipeline execution works or because you are trying to do something in a way that is very different to how we expect things to be done.

rknop · April 5, 2022, 6:31pm

I’m trying to figure out the right way to accomplish things using the Python API; I guess that means I want to figure out how graph building and pipeline execution works. However, from the documentation I’ve found, I hadn’t realized before now that that was what I was looking for.

natelust · April 5, 2022, 7:03pm

So, I think the disconnect here is there is fundamentally two different APIs in play layered on each other, not one complicated one.

Tasks (such as ISRTask) are algorithms with take in some data and return some processed result. This is done with the run method (excluding for the moment configuring the algorithm which is done in __init__). This method takes in memory data products, and returns in memory data products. This is the API to use if you have (arrays, objects, etc) you have produced or loaded laying around.

Managing data that comes from other sources (other tasks, off a telescope, something) is a separate system, the so called middleware (butler, registry, etc). The job of this is to identify what data you need based on what you specify (and general relations between datasets, what is declared beforehand etc).

PipelineTasks Are specific algorithms that have a bridge that allows the middleware which manages the data to pass the appropriate data along to the the run method of a (Pipeline)Task. This is done via the runQuantum method (Quantum being some minimum unit of all data needed for the algorithm to run, basically the arguments required for run and the outputs expected from run). RunQuantum is not a user API, but the one the middleware uses to interact with the task.

The system for coordinating the loading of appropriate known data from the butler, passing it into a task, retrieving output from a task and saving it is done with some sort of executor. You have seen one of these already, pipetask is an executor designed to manage all that complexity for you and is invoked from the command line. I pointed you at another SimplePipelineExecutor, which is intended to be used from within python (in a script, notebook, whatever).

Where a Pipeline comes in is that if you have a system that is capable of loading data, executing a task, managing outputs etc, it becomes an abstraction to start talking about doing this with more than one task at a time, chaining all the inputs and outputs together. Or conversely you can think of executing a single task as a specialization of running a larger pipeline.

The other bit about pipelines is that the contain specific configurations for the tasks (algorithms) contained in them that specializes them for the specific context the pipeline is designed to run in (i.e. calibration with LSSTCam vs HSC, would be two different pipelines with different configurations, though they may derive from some generic calibration pipeline)

So the question is really, which API is required for what you are trying to do, the API to interact with everything at a high level, where you basically specify only constraints on inputs and outputs and what to run, or the APIs of individual algorithms where you only care about processing in memory objects and getting in memory objects back (though sometimes the former is easier to use if you are using existing datasets because you dont need to do any managing of data yourself)

rknop · April 5, 2022, 7:23pm

Aha, thank you, yes, I hadn’t appreciated that distinction.

Right now, my interest is probably in the higher-level API. I strongly suspect eventually I’m going to want to be able to understand and use the lower-level API as well, and I would like to have a clear mental model myself as to how the two interact. For now, though, my zeroth-order goal is to be able to use the Python API to accomplish the same sorts of things I could accomplish by using “butler” and “pipetask” from the command as is documented in various cookbooks and such.

natelust · April 5, 2022, 7:51pm

Then yes, you likely the best place to start is SimplePipelineExecutor it’s basically the same as what you would do on the command line but from python. It does have a few ways to directly use lower level middleware primitives (directly create your own pipeline programmatically vs loading one) but I would not start there.

It should be enough to use its methods to do butler creation, creating an executor from a pipeline file, then executing from a few lines of python.

That said reading the Pipeline document will still be very useful to you, because it will help you understand configuration, controlling execution, selecting individual tasks, that kind of thing.

czw · April 5, 2022, 8:11pm

Just to put the link here, cp_pipe does have documentation on constructing calibrations. ip_isr does need to have the documentation updated. I’ve filed a JIRA ticket (DM-34332) so that doesn’t continue being ignored.

rknop · April 5, 2022, 8:52pm

It looks like that documentation is all for using the command-line utility, though, rather than the Python API.

Where would I look to learn more about how one constructs a QuantumGraph to pass to SimplePipelineExecutor?

ktl · April 6, 2022, 9:43am

I hope this is the explicit example that you are missing.

In SimplePipelineExecutor — LSST Science Pipelines :

Most callers should use one of the classmethod factory functions (from_pipeline_filename, from_task_class, from_pipeline) instead of invoking the constructor directly; these guarantee that the Butler and QuantumGraph are created consistently.

Wherever you see pipetask run -b BUTLER_REPO -p PIPELINE_YAML -i INPUT_COLLECTIONS -o OUTPUT_COLLECTION -d DATA_QUERY -c LABEL:KEY=VALUE, you can instead do:

from lsst.ctrl.mpexec import SimplePipelineExecutor
from lsst.pipe.base import Pipeline

butler = SimplePipelineExecutor.prep_butler(BUTLER_REPO, inputs=[INPUT_COLLECTIONS], output=OUTPUT_COLLECTION)
pipeline = Pipeline.from_uri(PIPELINE_YAML)
pipeline.addConfigOverride(LABEL, KEY, VALUE)
spe = SimplePipelineExecutor.from_pipeline(pipeline, where=DATA_QUERY, butler=butler)
quanta = spe.run(True)

If you don’t need a config override, it’s simpler:

from lsst.ctrl.mpexec import SimplePipelineExecutor
from lsst.pipe.base import Pipeline

butler = SimplePipelineExecutor.prep_butler(BUTLER_REPO, inputs=[INPUT_COLLECTIONS], output=OUTPUT_COLLECTION)
spe = SimplePipelineExecutor.from_pipeline_filename(PIPELINE_YAML, where=DATA_QUERY, butler=butler)
quanta = spe.run(True)

Handling the --long-log option to pipetask is a bit trickier because it relies on an internal API that is not documented in pipelines.lsst.io (but has reasonable docstrings):

from lsst.daf.butler.cli.cliLog import CliLog
CliLog.initLog(longlog=True)