I’ve recently been working on distributing various steps in the lsst pipeline (processCcd, makeCoadds, etc.) into a batch processing system managed by condor (CANFAR at the CADC, specifically). This is to access hundreds of processors, rather than ~10, for time sensitive projects (moving object tracking).
I have been able to distribute the processCcd.py task successfully. Each job processes either a single CCD, or an entire HSC frame, and tarballs+copies the butler rerun/processCcdOutputs directory back to cloud storage. I am now trying to execute makeCoaddTempExp.py on the same set of frames. I am not copying back the butler registry, because that is technically different for each batch job (for each frame).
To get makeDiscreteSkyMap.py to work, I had to reconstruct the butler repo inside the batch job. Specifically, copy over the raw frame(s) and matching processCcd outputs. Then injest the raw frames to reconstruct the butler registry. I also recreated the CALIB registry (ref_cats files, transmission, BFK, etc.). Importantly, I had to modify the _root variable in processCcd/repositoryCfg.yaml to reflect the new location of the butler (this varies from batch job to batch job).
makeDiscreteSkyMap seems to work perfectly fine. See attached output.
Then with makeCoaddTempExp, it seems to recognize the correct files, reporting back the right number, visits, ids, filters, etc. But it reports back that the images contain zero good pixels. See attached output.
I feel like I am missing some detail in creating the butler, but I can’t think of what it might be.
And before anyone asks, the distributed work is totally worth this pain required by a condor system; processCcd on the full 200 image dataset went from ~2 days to ~30 minutes.
Thanks for the help. Againoutput.txt (56.7 KB)