Bus error when running pipelines

I am receiving “Bus errors” when running ProcessCcd.yaml (IsrTask/CharacterizeImageTask/CalibrateImageTask) to process DECam images from the galactic bulge (crowded fields). I haven’t received these errors when processing images from less crowded fields. The logs are not super informative…here are the lines around the dump leading to the end of the log file:

lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <TaskDef(CharacterizeImageTask, label=characterizeImage) dataId={instrument: 'DECam', detector: 40, visit: 1033749, ...}> failed; processing will continue for remaining tasks.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(CalibrateTask, label=calibrate) dataId={instrument: 'DECam', detector: 40, visit: 1033749, ...}>, skipping this task.
lsst.characterizeImage INFO: PSF estimation initialized with 'simple' PSF
lsst.characterizeImage.repair INFO: Identified 100 cosmic rays.
py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/meas_algorithms/gbe01a4569f+ccfec7bf50/python/lsst/meas/algorithms/detection.py:410: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0.  Please explicitly specify a position.
  sigma = psf.computeShape().getDeterminantRadius()

py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/meas_algorithms/gbe01a4569f+ccfec7bf50/python/lsst/meas/algorithms/detection.py:447: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0.  Please explicitly specify a position.
  sigma = psf.computeShape().getDeterminantRadius()

/gscratch/dirac/stevengs/decam_ddf/code/bin/common.sh: line 137: 15593 Bus error               (core dumped) pipetask run -b /gscratch/dirac/stevengs/decam_ddf/repo -i DECam/raw/science/210916/decaps_east,DECam/raw/science-crosstalk-sources/210916/decaps_east,master-bias-certified/210916,master-flat-certified/210916,refcats/gen2,DECam/calib -o DECam/process/calexp/210916/decaps_east -p /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ap_pipe/g442d3c5ed6+fd17e318b0/pipelines/DarkEnergyCamera/ProcessCcd.yaml -c isr:overscan.fitType='MEDIAN_PER_ROW' -c calibrate:connections.photoRefCat='ps1_pv3_3pi_20170110' -c calibrate:photoCal.photoCatName='ps1_pv3_3pi_20170110' -c calibrate:photoRefObjLoader.ref_dataset_name='ps1_pv3_3pi_20170110' -c calibrate:connections.astromRefCat='gaia_dr2_20200414' -c calibrate:astromRefObjLoader.ref_dataset_name='gaia_dr2_20200414' -c calibrate:astromRefObjLoader.anyFilterMapsToThis='phot_g_mean' -c calibrate:astromRefObjLoader.filterMap='{}' -j 28 --register-dataset-types --skip-existing --extend-run --clobber-outputs
lsst.calibrate.astrometry.matcher INFO: Matched 5063 sources
lsst.calibrate.astrometry.matcher INFO: Matched 5079 sources
lsst.calibrate.astrometry.matcher INFO: Matched 5079 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 5079 matches with on-sky distance mean and scatter = 0.032 +- 0.025 arcsec
lsst.calibrate.photoCal.match.sourceSelection INFO: Selected 8565/13713 sources
lsst.calibrate INFO: Loading reference objects from ps1_pv3_3pi_20170110 in region bounded by [271.58883529, 271.96771310], [-29.45195834, -29.12150937] RA Dec
lsst.calibrate INFO: Loaded 8721 reference objects
lsst.calibrate WARNING: Found version 0 reference catalog with old style units in schema.
lsst.calibrate WARNING: run `meas_algorithms/bin/convert_refcat_to_nJy.py` to convert fluxes to nJy.
lsst.calibrate WARNING: See RFC-575 for more details.
lsst.calibrate INFO: Converted refcat flux fields to nJy (name, units): (g_flux, ''); (r_flux, ''); (i_flux, ''); (z_flux, ''); (y_flux, ''); (i_fluxSigma, ''); (y_fluxSigma, ''); (r_fluxSigma, ''); (z_fluxSigma, ''); (g_fluxSigma, '')
lsst.calibrate.photoCal.match.referenceSelection INFO: Selected 958/10441 references
lsst.calibrate.photoCal.match INFO: Matched 254 from 1966/4225 input and 958/10441 reference sources
lsst.calibrate.photoCal.reserve INFO: Reserved 0/254 sources
lsst.calibrate.photoCal INFO: Applying color terms for filter='r DECam SDSS c0002 6415.0 1480.0', config.photoCatName=ps1_pv3_3pi_20170110 because config.applyColorTerms is True
lsst.calibrate.photoCal INFO: Magnitude zero point: 29.555499 +/- 0.000053 from 229 stars
lsst.calibrate INFO: Photometric zero-point: 29.555499
lsst.calibrate.computeSummaryStats INFO: Measuring exposure statistics
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'calibrate' on quantum {instrument: 'DECam', detector: 32, visit: 1033781, ...} took 934.703 seconds
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'calibrate' on quantum {instrument: 'DECam', detector: 9, visit: 1033787, ...} took 982.710 seconds
EOF

You can see that some processes are still running and producing output after the dump. I am running pipetask with -j 28 so presumably one task produces the bus error while the others keep going. For the node that the bus error occurred on, I’ve seen it running just ~4 processes after the dump and then the job exiting after the output following the dump appears in the log. Presumably those processes where the calibrate quanta that appear before the end of the log.

I can provide a sample core dump that the OS produced (760MB) if desired. Or it would be great to be pointed to debugging tools I can use to find a fix.

I am not sure what’s going on here, perhaps related to memory usage since this doesn’t happen in non-crowded fields? Throwing this out there in case it’s related to an issue in pipeline processing. Thanks for any help with this.

Some evidence that this could be from an Out-of-Memory condition:

If the kernel fails to page in a code page due to memory pressure (OOM killer must be disabled) or failed IO request, SIGBUS.

Another cause of bus errors (on Linux anyway) is when the operating system can’t back a virtual page with physical memory (e.g. low-memory conditions or out of huge pages when using huge page memory.) Typically mmap (and malloc) just reserve the virtual address space, and the kernel assigns the physical memory on demand (so called soft page faults.) Make a large enough malloc, and then write to enough of it and you’ll get a bus error.

I’m re-running the pipelines on nodes with 500GB and lowering -j 28 to -j 8. I’m finding that the pipeline processes are reaching memory requirements of ~30GB each. I was running on 200GB memory nodes before, and 200/30=6.6 processes, much less than 28, so it seems plausible I was just running out of memory.

I’ll update in case I am still running into this issue. It might be helpful to have a way to estimate how much memory the pipelines will require. Perhaps those estimates will just be empirically determined rules of thumb e.g. from this experiment…

To get a sense of the number of sources being measured/calibrated in some of the exposures:

  • deblending:
$ cat ${log} | grep 'sources' | grep 'lsst.calibrate.deblend'
lsst.calibrate.deblend INFO: Deblending 6400 sources
lsst.calibrate.deblend INFO: Deblending 4820 sources
lsst.calibrate.deblend INFO: Deblended: of 4820 sources, 1959 were deblended, creating 7300 children, total 12120 sources
lsst.calibrate.deblend INFO: Deblending 3190 sources
lsst.calibrate.deblend INFO: Deblended: of 6400 sources, 2387 were deblended, creating 8123 children, total 14523 sources
lsst.calibrate.deblend INFO: Deblended: of 3190 sources, 1158 were deblended, creating 3834 children, total 7024 sources
lsst.calibrate.deblend INFO: Deblending 4944 sources
lsst.calibrate.deblend INFO: Deblending 4589 sources
lsst.calibrate.deblend INFO: Deblended: of 4589 sources, 2006 were deblended, creating 7901 children, total 12490 sources
lsst.calibrate.deblend INFO: Deblending 3641 sources
lsst.calibrate.deblend INFO: Deblended: of 4944 sources, 2330 were deblended, creating 10238 children, total 15182 sources
lsst.calibrate.deblend INFO: Deblending 3488 sources
lsst.calibrate.deblend INFO: Deblended: of 3641 sources, 1502 were deblended, creating 5918 children, total 9559 sources
lsst.calibrate.deblend INFO: Deblended: of 3488 sources, 1228 were deblended, creating 3993 children, total 7481 sources
lsst.calibrate.deblend INFO: Deblending 6548 sources
lsst.calibrate.deblend INFO: Deblended: of 6548 sources, 2687 were deblended, creating 9615 children, total 16163 sources
lsst.calibrate.deblend INFO: Deblending 4578 sources
lsst.calibrate.deblend INFO: Deblended: of 4578 sources, 1727 were deblended, creating 6094 children, total 10672 sources
lsst.calibrate.deblend INFO: Deblending 6918 sources
lsst.calibrate.deblend INFO: Deblended: of 6918 sources, 2927 were deblended, creating 10485 children, total 17403 sources
lsst.calibrate.deblend INFO: Deblending 4514 sources
lsst.calibrate.deblend INFO: Deblended: of 4514 sources, 1938 were deblended, creating 7369 children, total 11883 sources
  • measurement:
$ cat ${log} | grep 'sources' | grep 'lsst.calibrate.measurement'
lsst.calibrate.measurement INFO: Measuring 12120 sources (4820 parents, 7300 children) 
lsst.calibrate.measurement INFO: Measuring 14523 sources (6400 parents, 8123 children) 
lsst.calibrate.measurement INFO: Measuring 7024 sources (3190 parents, 3834 children) 
lsst.calibrate.measurement INFO: Measuring 12490 sources (4589 parents, 7901 children) 
lsst.calibrate.measurement INFO: Measuring 15182 sources (4944 parents, 10238 children) 
lsst.calibrate.measurement INFO: Measuring 9559 sources (3641 parents, 5918 children) 
lsst.calibrate.measurement INFO: Measuring 7481 sources (3488 parents, 3993 children) 
lsst.calibrate.measurement INFO: Measuring 16163 sources (6548 parents, 9615 children) 
lsst.calibrate.measurement INFO: Measuring 10672 sources (4578 parents, 6094 children) 
lsst.calibrate.measurement INFO: Measuring 17403 sources (6918 parents, 10485 children) 
lsst.calibrate.measurement INFO: Measuring 11883 sources (4514 parents, 7369 children) 
  • astrometry:
$ cat ${log} | grep 'lsst.calibrate.astrometry'
lsst.calibrate.astrometry INFO: Purged 4341 sources, leaving 2683 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 56164/56164 references
lsst.calibrate.astrometry INFO: Purged 9140 sources, leaving 2980 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 56436/56436 references
lsst.calibrate.astrometry INFO: Purged 9274 sources, leaving 5249 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 57289/57289 references
lsst.calibrate.astrometry INFO: Purged 4226 sources, leaving 3255 good sources
lsst.calibrate.astrometry INFO: Purged 8449 sources, leaving 4041 good sources
lsst.calibrate.astrometry INFO: Purged 6419 sources, leaving 3140 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 58895/58895 references
lsst.calibrate.astrometry.referenceSelector INFO: Selected 57573/57573 references
lsst.calibrate.astrometry.referenceSelector INFO: Selected 57642/57642 references
lsst.calibrate.astrometry INFO: Purged 11048 sources, leaving 4134 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 54916/54916 references
lsst.calibrate.astrometry.matcher INFO: Matched 2847 sources
lsst.calibrate.astrometry.matcher INFO: Matched 2848 sources
lsst.calibrate.astrometry.matcher INFO: Matched 2849 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 2849 matches with on-sky distance mean and scatter = 0.027 +- 0.017 arcsec
lsst.calibrate.astrometry INFO: Purged 10451 sources, leaving 5712 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 57846/57846 references
lsst.calibrate.astrometry.matcher INFO: Matched 3869 sources
lsst.calibrate.astrometry.matcher INFO: Matched 3874 sources
lsst.calibrate.astrometry.matcher INFO: Matched 3875 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 3875 matches with on-sky distance mean and scatter = 0.034 +- 0.026 arcsec
lsst.calibrate.astrometry.matcher INFO: Matched 4849 sources
lsst.calibrate.astrometry.matcher INFO: Matched 4892 sources
lsst.calibrate.astrometry.matcher INFO: Matched 4889 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 4889 matches with on-sky distance mean and scatter = 0.036 +- 0.029 arcsec
lsst.calibrate.astrometry.matcher INFO: Matched 3718 sources
lsst.calibrate.astrometry.matcher INFO: Matched 3717 sources
lsst.calibrate.astrometry.matcher INFO: Matched 3716 sources
lsst.calibrate.astrometry INFO: Matched and fit WCS in 3 iterations; found 3716 matches with on-sky distance mean and scatter = 0.038 +- 0.030 arcsec
lsst.calibrate.astrometry INFO: Purged 6500 sources, leaving 4172 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 53354/53354 references
lsst.calibrate.astrometry INFO: Purged 11442 sources, leaving 5961 good sources
lsst.calibrate.astrometry.referenceSelector INFO: Selected 58251/58251 references

I have little sense of how these translate to memory requirements though.

Yes. Seems right. If you have a slurm or condor cluster you should be able to use bps to submit batch jobs.

FYI, I’ve found a solution for the large memory usage in the calibrate task when processing very dense fields. The high memory usage comes from using very many (50k+) reference catalog sources in the reference matching portion of WCS fitting in AstrometryTask. There is a configurable present in the pipelines that solves this by limiting the number of reference sources used in the reference match:

tasks:
  calibrate:
    class: lsst.pipe.tasks.calibrate.CalibrateTask 
    config: 
      astrometry.matcher.maxRefObjects: 3000 # lower memory usage of matcher: limit the WCS matcher to 3000 reference objects, starting with the brightest

I am now seeing memory usage of ~2GB instead of ~30GB previously.

I’d suggest trying to set the magLimit.minimum and magLimit.maximum values of astrometry.referenceSelector, to limit the number of reference sources, and the similar values in astrometry.sourceSelector and/or the signalToNoise parameter of that, for the science sources. That should give you more fine-grained control of which sources are included in the astrometry fit.

Thanks, yes I actually found those configurables first and started with those. I found few tasks were failing when I applied magnitude cuts, but there were no issues when I used maxRefObjects, so I stuck with that.

I’ve also used maxRefObjects to decrease peak memory usage when running processCcd on a dense (DECam) field near the Galactic center…I’m curious whether there’s any information/documentation available about how exactly the pipeline downselects by default to arrive at the retained set of maxRefObjects reference objects…for instance, does the pipeline just select the brightest (unsaturated) maxRefObjects of the available astrometric reference sources, or are there other considerations like trying to achieve/maintain a uniform spatial distribution across the CCD and/or favoring isolated sources over potentially blended sources? I suppose I can dig around in the source code to try figuring this out, but I figured I’d ask here. Thanks very much.

It returns the brightest (highest flux) objects in the reference catalog: meas_astrom/matchPessimisticB.py at main · lsst/meas_astrom · GitHub

# Find the flux cut that gives us the desired number of objects.
if len(refCat) <= self.config.maxRefObjects:
    return refCat
fluxArray = refCat.get(refFluxField)
sortedFluxArray = fluxArray[fluxArray.argsort()]
minFlux = sortedFluxArray[-(self.config.maxRefObjects + 1)]

selected = (refCat.get(refFluxField) > minFlux)

outCat = afwTable.SimpleCatalog(refCat.schema)
outCat.reserve(self.config.maxRefObjects)
outCat.extend(refCat[selected])

return outCat

I wonder if MatchPessimisticBTask can be subclassed and _filterRefCat redefined to perform a different filtering.

1 Like

If you want to do filtering on the reference or science catalogs that go into AstrometryTask, use the astrometry.referenceSelector and astrometry.sourceSelector tasks to have more control over what is selected.

For problems with high memory usage, it’s worth checking the overlap of your input images and the reference catalogs, to see if you’re loading too large of a region. We pad out the regions of the detectors, but in the past we’ve made that padding far too large, so it’s worth overplotting those if you can. I believe the pessimistic matcher scales worse than quadratically with the number of sourcse.

1 Like