I am receiving “Bus errors” when running ProcessCcd.yaml (IsrTask/CharacterizeImageTask/CalibrateImageTask) to process DECam images from the galactic bulge (crowded fields). I haven’t received these errors when processing images from less crowded fields. The logs are not super informative…here are the lines around the dump leading to the end of the log file:
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <TaskDef(CharacterizeImageTask, label=characterizeImage) dataId={instrument: 'DECam', detector: 40, visit: 1033749, ...}> failed; processing will continue for remaining tasks.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(CalibrateTask, label=calibrate) dataId={instrument: 'DECam', detector: 40, visit: 1033749, ...}>, skipping this task.
lsst.characterizeImage INFO: PSF estimation initialized with 'simple' PSF
lsst.characterizeImage.repair INFO: Identified 100 cosmic rays.
py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/meas_algorithms/gbe01a4569f+ccfec7bf50/python/lsst/meas/algorithms/detection.py:410: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0. Please explicitly specify a position.
sigma = psf.computeShape().getDeterminantRadius()
py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/meas_algorithms/gbe01a4569f+ccfec7bf50/python/lsst/meas/algorithms/detection.py:447: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0. Please explicitly specify a position.
sigma = psf.computeShape().getDeterminantRadius()
/gscratch/dirac/stevengs/decam_ddf/code/bin/common.sh: line 137: 15593 Bus error (core dumped) pipetask run -b /gscratch/dirac/stevengs/decam_ddf/repo -i DECam/raw/science/210916/decaps_east,DECam/raw/science-crosstalk-sources/210916/decaps_east,master-bias-certified/210916,master-flat-certified/210916,refcats/gen2,DECam/calib -o DECam/process/calexp/210916/decaps_east -p /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ap_pipe/g442d3c5ed6+fd17e318b0/pipelines/DarkEnergyCamera/ProcessCcd.yaml -c isr:overscan.fitType='MEDIAN_PER_ROW' -c calibrate:connections.photoRefCat='ps1_pv3_3pi_20170110' -c calibrate:photoCal.photoCatName='ps1_pv3_3pi_20170110' -c calibrate:photoRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110' -c calibrate:connections.astromRefCat='gaia_dr2_20200414' -c calibrate:astromRefObjLoader.ref_dataset_name='gaia_dr2_20200414' -c calibrate:astromRefObjLoader.anyFilterMapsToThis='phot_g_mean' -c calibrate:astromRefObjLoader.filterMap='{}' -j 28 --register-dataset-types --skip-existing --extend-run --clobber-outputs
lsst.calibrate.astrometry.matcher INFO: Matched 5063 sources
lsst.calibrate.astrometry.matcher INFO: Matched 5079 sources
lsst.calibrate.astrometry.matcher INFO: Matched 5079 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 5079 matches with on-sky distance mean and scatter = 0.032 +- 0.025 arcsec
lsst.calibrate.photoCal.match.sourceSelection INFO: Selected 8565/13713 sources
lsst.calibrate INFO: Loading reference objects from ps1_pv3_3pi_20170110 in region bounded by [271.58883529, 271.96771310], [-29.45195834, -29.12150937] RA Dec
lsst.calibrate INFO: Loaded 8721 reference objects
lsst.calibrate WARNING: Found version 0 reference catalog with old style units in schema.
lsst.calibrate WARNING: run `meas_algorithms/bin/convert_refcat_to_nJy.py` to convert fluxes to nJy.
lsst.calibrate WARNING: See RFC-575 for more details.
lsst.calibrate INFO: Converted refcat flux fields to nJy (name, units): (g_flux, ''); (r_flux, ''); (i_flux, ''); (z_flux, ''); (y_flux, ''); (i_fluxSigma, ''); (y_fluxSigma, ''); (r_fluxSigma, ''); (z_fluxSigma, ''); (g_fluxSigma, '')
lsst.calibrate.photoCal.match.referenceSelection INFO: Selected 958/10441 references
lsst.calibrate.photoCal.match INFO: Matched 254 from 1966/4225 input and 958/10441 reference sources
lsst.calibrate.photoCal.reserve INFO: Reserved 0/254 sources
lsst.calibrate.photoCal INFO: Applying color terms for filter='r DECam SDSS c0002 6415.0 1480.0', config.photoCatName=ps1_pv3_3pi_20170110 because config.applyColorTerms is True
lsst.calibrate.photoCal INFO: Magnitude zero point: 29.555499 +/- 0.000053 from 229 stars
lsst.calibrate INFO: Photometric zero-point: 29.555499
lsst.calibrate.computeSummaryStats INFO: Measuring exposure statistics
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'calibrate' on quantum {instrument: 'DECam', detector: 32, visit: 1033781, ...} took 934.703 seconds
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'calibrate' on quantum {instrument: 'DECam', detector: 9, visit: 1033787, ...} took 982.710 seconds
EOF
You can see that some processes are still running and producing output after the dump. I am running pipetask
with -j 28
so presumably one task produces the bus error while the others keep going. For the node that the bus error occurred on, I’ve seen it running just ~4 processes after the dump and then the job exiting after the output following the dump appears in the log. Presumably those processes where the calibrate
quanta that appear before the end of the log.
I can provide a sample core dump that the OS produced (760MB) if desired. Or it would be great to be pointed to debugging tools I can use to find a fix.
I am not sure what’s going on here, perhaps related to memory usage since this doesn’t happen in non-crowded fields? Throwing this out there in case it’s related to an issue in pipeline processing. Thanks for any help with this.