makeCoaddTempExp.py - zero good pixels

I’ve recently been working on distributing various steps in the lsst pipeline (processCcd, makeCoadds, etc.) into a batch processing system managed by condor (CANFAR at the CADC, specifically). This is to access hundreds of processors, rather than ~10, for time sensitive projects (moving object tracking).

I have been able to distribute the processCcd.py task successfully. Each job processes either a single CCD, or an entire HSC frame, and tarballs+copies the butler rerun/processCcdOutputs directory back to cloud storage. I am now trying to execute makeCoaddTempExp.py on the same set of frames. I am not copying back the butler registry, because that is technically different for each batch job (for each frame).

To get makeDiscreteSkyMap.py to work, I had to reconstruct the butler repo inside the batch job. Specifically, copy over the raw frame(s) and matching processCcd outputs. Then injest the raw frames to reconstruct the butler registry. I also recreated the CALIB registry (ref_cats files, transmission, BFK, etc.). Importantly, I had to modify the _root variable in processCcd/repositoryCfg.yaml to reflect the new location of the butler (this varies from batch job to batch job).

makeDiscreteSkyMap seems to work perfectly fine. See attached output.

Then with makeCoaddTempExp, it seems to recognize the correct files, reporting back the right number, visits, ids, filters, etc. But it reports back that the images contain zero good pixels. See attached output.

I feel like I am missing some detail in creating the butler, but I can’t think of what it might be.

And before anyone asks, the distributed work is totally worth this pain required by a condor system; processCcd on the full 200 image dataset went from ~2 days to ~30 minutes. :slight_smile:

Thanks for the help. Againoutput.txt (56.7 KB)

Unfortunately the full relocatability of Butler Gen2 repos was never completely accomplished, but it should still be easier to hack things up than reingesting everything. One thing that sometimes helps is to actually fall back on Gen1, remove any/all repositoryCfg.yaml files, and rely on _parent symlinks, if they are even required in your case.

But you also have a different problem: No locations for get: datasetType:jointcal_photoCalib. This might be a pipeline configuration issue?

Thanks for the response! I’ll try nuking the repositoryCfg.yaml file. Though I am not entirely certain what you mean by the parent symlinks? The CORR files are real, and don’t have any symlinks.

As for the potential configuration issues, this all works correctly from the start if processCcd is run in place rather than being moved around like I am doing.

I tried the repo yaml nuking. It fails (regardless of whether or not I ingest raw images first) at makeDiscreteSkyMap complaining of no data found.

root INFO: Loading config overrride file ‘/home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/obs_subaru/19.0.0-36-g4abb71f4+1/config/makeDiscreteSkyMap.py’

root INFO: Loading config overrride file ‘/home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/obs_subaru/19.0.0-36-g4abb71f4+1/config/hsc/makeDiscreteSkyMap.py’

HscMapper WARN: Unable to find calib root directory

CameraMapper INFO: Loading Posix exposure registry from /media/NHproc/batch/HSC_fullproc-lsst/rerun/processCcdOutputs

HscMapper WARN: Unable to find calib root directory

root WARN: No data found for dataId=OrderedDict()

root INFO: Running: /home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_tasks/19.0.0-46-g8a917ba3/bin/makeDiscreteSkyMap.py /media/NHproc/batch/HSC_fullproc-lsst --id --rerun processCcdOutputs:coadd --config skyMap.projection=TAN -j 4

HscMapper WARN: Unable to find calib root directory

CameraMapper INFO: Loading Posix exposure registry from /media/NHproc/batch/HSC_fullproc-lsst/rerun/processCcdOutputs

HscMapper WARN: Unable to find calib root directory

makeDiscreteSkyMap INFO: Extracting bounding boxes of 0 images

makeDiscreteSkyMap FATAL: Failed: No data found from which to compute convex hull

Traceback (most recent call last):

File “/home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_tasks/19.0.0-46-g8a917ba3/python/lsst/pipe/tasks/makeDiscreteSkyMap.py”, line 97, in call

result = task.runDataRef(butler, dataRefList)

File “/home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-17-g9c22e3c+1/python/lsst/pipe/base/timer.py”, line 150, in wrapper

res = func(self, *args, **keyArgs)

File “/home/fraserw/lsst_stack/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_tasks/19.0.0-46-g8a917ba3/python/lsst/pipe/tasks/makeDiscreteSkyMap.py”, line 154, in runDataRef

raise RuntimeError(“No data found from which to compute convex hull”)

RuntimeError: No data found from which to compute convex hull

Well it seems you might be right with the configuraiton issue, @ktl

I went back and tried to see if I could run makeCoaddTempExp from a 3 individual CCD images that I proccessed locally, all from the same local butler. Copy images, ingest, processCcd all went fine. Then makeDiscreteSkyMap also seemed to work.

But makeCoaddTempExp.py failed in the same way it did as I first reported - “0 good pixels” for all patches. See attached output for example run of both steps.

This is strange as the same pipeline was working for some CFHT data with a different mapper.

Any ideas what mis configurations I should be looking for?

Thanks again.

output.txt (18.8 KB)

The warning before 0 good pixels is:

makeCoaddTempExp WARN: Calexp DataId(initialdata={'filter': 'HSC-R', 'pointing': 906, 'visit': 2974, 'ccd': 0, 'field': 'SP01', 'dateObs': '2014-06-25', 'taiObs': '2014-06-25', 'expTime': 90.0, 'tract': 0}, tag=set()) not found; skipping it: No locations for get: datasetType:jointcal_photoCalib dataId:DataId(initialdata={'filter': 'HSC-R', 'pointing': 906, 'visit': 2974, 'ccd': 0, 'field': 'SP01', 'dateObs': '2014-06-25', 'taiObs': '2014-06-25', 'expTime': 90.0, 'tract': 0}, tag=set())

It says it’s missing the jointcal_photoCalib dataset, and so the input is ignored. Have you run jointcal.py?

Actually that turned out to be an issue with me running one of the daily builds (April 22 I think) which in itself was necessitated because of the eups installation bug where a test failured because my username (fraserw) shares the letters “fr” with an r-band filter also containing “fr”. Installing v19 of the pipeline under a different user, and then moving it to my home solved that issue.

FWIW it looks like I have figured out the entire original issue. That is, figured out how to handle immoveable butlers and batch processing. Basically involves setting up the butler, and then using symlinks to trick the butler into thinking it’s running in the correct place. That and a few clever file copies and I have been able to distribute the LSST pipeline across ~500 cores on separate physical hardware. This includes the processCcd, makeDiscreteSKyMap, and makeCoaddTempExp steps. Since makeCoadd works, I suspect so will the stacking step, which is the last to be checked.

I will write up a community post on exactly how I did this.