CoaddDriver stuck after "Loading config overrride file"

racine · July 4, 2020, 1:22pm

Hi all,

I am trying to analyse HSC data with v20.0.0 , but I am struck at the coaddDriver level.
Basically, I ran

singleFrameDriver.py /datadec/cppm/HSC/HSC_v20_repro/DATA --rerun singleFrameOutputs --id visit=1164..1194:2^17898..17908:2^17926..17934:2^17944..17952:2^17962^1200..1222:2^23690..23694:2^23704^23706^23716^23718 --longlog --cores 48 --timeout 9999999 --batch-profile |tee all_singleFrameDriver_prints.log
makeDiscreteSkyMap.py DATA --id --rerun singleFrameOutputs:coadd --longlog --profile SkyMapProfile.dat |tee makeDiscreteSkyMap.log
jointcal.py DATA --id visit=1164..1194:2 --rerun coadd:jointcal_per_night --longlog -j 24 --profile Logs/JC_Z_1/jointcalProfile.dat --config writeChi2FilesInitialFinal=True
mv *csv Logs/JC_Z_1/
jointcal.py DATA --id visit=17898..17908:2^17926..17934:2^17944..17952:2^17962 --rerun coadd:jointcal_per_night --longlog -j 24 --profile Logs/JC_Z_2/jointcalProfile.dat --config writeChi2FilesInitialFinal=True
mv *csv Logs/JC_Z_2/

and 2 more jointcal run for the 2 other nights.
I ran jointcal independently for every night, because I have 2 nights in filter Z and 2 in R that are appart by a year, so proper motion could be an issue.
This ran fine, but then when I try the next step, i.e. coaddDriver , it is stuck.
For instance for a simple case

coaddDriver.py DATA --rerun jointcal_per_night:coadd --selectId visit=1166 --id tract=9813 patch=5,4 filter=HSC-Z

I get

root INFO: Loading config overrride file '/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/v20.0.0/stack/miniconda3-py37_4.8.2-1a1d771/Linux64/obs_subaru/20.0.0/config/coaddDriver.py'
root INFO: Loading config overrride file '/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/v20.0.0/stack/miniconda3-py37_4.8.2-1a1d771/Linux64/obs_subaru/20.0.0/config/hsc/coaddDriver.py'

And it is stuck forever at that point.
Same if I add --show data , it is still stuck there. I am not sure if the chaining is fine, going coadd:jointcal_per_night first in the jointcal job and then jointcal_per_night:coadd in the coaddDriver job. But I tried without chaining, doing --rerun DATA/rerun/jointcal_per_night. I then get:

lsst.daf.persistence.butlerExceptions.NoResults: No locations for get: datasetType:deepCoadd_skyMap dataId:DataId(initialdata={}, tag=set())

Am I doing something obviously wrong?
More generally, how would you recommend to run these lines with regards to --rerun options?

Thanks,

Ben

price · July 4, 2020, 2:08pm

I think at least part of the problem here is that you’re creating an infinite loop of reruns. Recall that the syntax --rerun base:derived means "create derived rerun, based on base".

Now, see the pattern of reruns:

makeDiscreteSkyMap.py: --rerun singleFrameOutputs:coadd
jointcal.py: --rerun coadd:jointcal_per_night
coaddDriver.py: --rerun jointcal_per_night:coadd

So there’s a loop: singleFrameOutputs --> coadd --> jointcal_per_night --> coadd.

I would have thought the code would catch this kind of thing, but perhaps not. I’m not sure it’s causing your immediate problem, but it’s certainly not going to work.

There’s no need to make a new rerun for each step of the processing. I usually use a single rerun (usually based on the date, and also my username if it’s in a shared repository, e.g., --rerun price/foo-20200704) for all steps, only branching when there’s a clear need to keep the outputs of a step separate (e.g., re-running with a different configuration, --rerun price/foo-20200704:price/foo-20200704-experiment).

You’d also be well-served by including --cores in your coaddDriver.py command-line (but not too many if you’re on a cluster filesystem like GPFS, as the I/O pattern for the first part of coaddDriver.py can really stress cluster filesystems; something like 20 is usually my limit).