makeCoaddTempExp.py killed

Hello

While processing some HSC data with the version 6.0 of the hscPipeline I have come across a problem on the run of makeCoaddTempExp.py . The only error message is “killed” :

$ makeCoaddTempExp.py .  --rerun moz --id filter=HSC-Z field=SSP_DEEP_COSMOS tract=9813 --selectId visit=17942
root INFO: Loading config overrride file '/opt/lsst/6.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/obs_subaru/6.0-hsc/config/makeCoaddTempExp.py'
root INFO: Loading config overrride file '/opt/lsst/6.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/obs_subaru/6.0-hsc/config/hsc/makeCoaddTempExp.py'
CameraMapper INFO: Loading exposure registry from /data0/desprezg/HSC/registry.sqlite3
CameraMapper INFO: Loading calib registry from /data0/desprezg/HSC/CALIB/calibRegistry.sqlite3
CameraMapper INFO: Loading exposure registry from /data0/desprezg/HSC/registry.sqlite3
CameraMapper INFO: Loading calib registry from /data0/desprezg/HSC/CALIB/calibRegistry.sqlite3
CameraMapper INFO: Loading calib registry from /data0/desprezg/HSC/CALIB/calibRegistry.sqlite3
HscMapper WARN: Unable to find calib root directory
CameraMapper INFO: Loading calib registry from /data0/desprezg/HSC/CALIB/calibRegistry.sqlite3
root WARN: Unexpected ID field; guessing type is "str"

WARNING: version mismatch between CFITSIO header (v3.37) and linked library (v3.36).

root INFO: Running: /opt/lsst/6.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/6.0-hsc/bin/makeCoaddTempExp.py . --rerun moz --id filter=HSC-Z field=SSP_DEEP_COSMOS tract=9813 --selectId visit=17942
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '0,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '1,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '2,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '3,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '4,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '5,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '6,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp WARN: No exposures to coadd for patch DataId(initialdata={'patch': '7,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp.select INFO: Selecting calexp {'visit': 17942, 'pointing': 1111, 'filter': 'HSC-Z', 'ccd': 49, 'field': 'SSP_DEEP_COSMOS', 'dateObs': '2015-01-16', 'taiObs': '2015-01-16', 'expTime': 270.0}
makeCoaddTempExp INFO: Selected 1 calexps for patch DataId(initialdata={'patch': '8,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp INFO: Processing 1 existing calexps for patch DataId(initialdata={'patch': '8,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp INFO: Processing 1 warp exposures for patch DataId(initialdata={'patch': '8,0', 'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'tract': 9813}, tag=set())
makeCoaddTempExp INFO: Processing Warp 0/1: id=DataId(initialdata={'visit': 17942, 'field': 'SSP_DEEP_COSMOS', 'filter': 'HSC-Z', 'patch': '8,0', 'tract': 9813}, tag=set())
makeCoaddTempExp INFO: Processing calexp 1 of 1 for this Warp: id={'visit': 17942, 'pointing': 1111, 'filter': 'HSC-Z', 'ccd': 49, 'field': 'SSP_DEEP_COSMOS', 'dateObs': '2015-01-16', 'taiObs': '2015-01-16', 'expTime': 270.0}
makeCoaddTempExp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel
makeCoaddTempExp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (23, 23) to (2663, 2663)
Killed

Does someone knows where the problem comes from ? Can it be linked to an error I get from running coaddDriver.py on the same set of data :

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 24169 RUNNING AT piecld00.isdc.unige.ch
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions 

Thank a lot in advance for your answers and advises

Guillaume

Given that you’re seeing the same killing with both coaddDriver.py and makeCoaddTempExp.py, it seems likely that you’re running out of memory. That seems strange, since warping hasn’t been associated with memory problems before, and all you’re doing is warping a single calexp. Does your machine have only a tiny amount of free memory? Otherwise, I guess it could be a large memory requirement in the (relatively recently-added) PSF matching.

Aside: field doesn’t belong in the --id. That’s the origin of the root WARN: Unexpected ID field; guessing type is "str".

I’m not sure how this is happening. LSST stack uses v3.36 so it’s not clear to me how you have some code that was built against v3.37.

The machine I use as a total memory of 8GB. I don’t know if it is enough to process the calexp. I will try to monitor a process to see if it runs out of memory.

It’s indeed a problem of memory. The 8GB of memory and the 5GB swap are not enough to process some files. I made the test with the file :
'filter': 'HSC-Z', 'field': 'SSP_DEEP_COSMOS', 'pointing': 1111, 'visit': 17942, 'ccd': 49, 'dateObs': '2015-01-16', 'taiObs': '2015-01-16', 'expTime': 270.0

@price is it expected that makeCoaddTempExp can use more than 8GB for one CCD?

Our Slurm cluster at Princeton will, by default, kill jobs that exceed more than 4 GB/core, and though we can exceed that in some parts of the workflow for deep coadds with many inputs (like Cosmos), most datasets go through just fine. It’s incredibly strange that a single CCD should require more than double that limit.

I’ll see if I can reproduce this here. Could you post the set of commands you’ve run, please?

See Guillaume’s original post (visit 17942 was taken at the beginning of SSP)