Lower memory usage for coadd creation

Hello. I am trying to create a coadd from ~52987 calexp datasets produced from DECam imaging, so 2k x 4k CCDs. It looks like generally the calexp fits files take up 53MB each on disk. I keep encountering out-of-memory errors when running the tasks from ApTemplate.yaml. Presumably the lsst.pipe.tasks.assembleCoadd.CompareWarpAssembleCoaddTask task would be the most memory intensive. I am using a machine with 220GB of RAM available, however I notice when watching tasks execute that one process will eventually take up ~90% of the machine RAM before my Slurm job is killed and an out-of-memory notice is issued.

Are there any parameters I can tweak to lower the memory usage?

Thanks,
Steven

It’s not entirely clear to me what you’re doing (e.g., how many of the 53k calexps go into each band+patch), nor that the coaddition step is solely responsible for the memory blowout (e.g., the deblender has in the past been notorious for being memory hungry), but I have noticed that the default for the subregionSize parameter is 2000x2000, while obs_subaru overrides it to 10000x200 [1], specifically to keep the memory footprint down on big stacks.

An HSC UltraDeep Cosmos stack I made recently (admittedly with the Gen2 middleware so there are slight differences) has a region with something approaching 200 overlapping visits in i-band. This ran on 20 cores on a single node with 192 GB of memory. It wasn’t fast, but it succeeded with no memory blowouts.


[1] Our patches are typically 4k square, so this ends up effectively being ~ 4000x200. We’re reading full rows, which is a bit more efficient than square regions with FITS.

Thanks, I can try tweaking that config. And also running pipetask run with -j 1 so that I can see from the logs exactly which task is running when the process runs away with memory.

For extra detail, here is a map of the Skymap and typical locations of the calexp CCDs as they overlap with the patches and with each other. COSMOS-{1,2,3} are the names of the on-sky targets for these DECam fields. This is the skymap that was generated with butler make-discrete-skymap. The patches are 4k by 4k.

The per-target and per-band number of visits I am trying to coadd are:

{'cosmos_1': {'g': 115, 'r': 102, 'i': 103},
 'cosmos_2': {'g': 101, 'r': 88, 'i': 86},
 'cosmos_3': {'g': 93, 'r': 77, 'i': 79}}

It looks like the maximum number of calexps in a single patch where the targets do not overlap is ~4. So in one of those patches, the per-band number of (partial) calexps that go into that patch are at most 4*115 = 460. There are patches in the middle where all targets overlap with each other somewhat. Looking at the patch to the left of RA=150 and middle of DEC=2, the number of partial calexps is 2 * cosmos_1 + 4 * cosmos_2 + 2 * cosmos_3, so the per-band number of visits is something like

{'g': 820, 'r': 710, 'i': 708}

Perhaps a plot of the number of calexps overlapping with each patch would be helpful… It looks like I am trying to coadd several hundred partial calexps in each patch though in each band, perhaps that is too many!

Thanks for the help.

I think you’re confusing detectors and exposures [1]. You may have several hundred detectors overlapping a single patch, but per band you only have at most about 300 exposures. Coadds are constructed per band from warps, and warps are constructed per exposure. So what matters for coaddition is the number of exposures that overlap a patch, not the number of detectors.

I think you’ll be fine once you modify the subregionSize.


[1] And visits. In the Gen2 world I’m used to, “visit” just means a “exposure”. I believe that’s different in Gen3, but in this case my guess is that it comes out to the same thing.

Thanks, adjusting subregionSize seemed to do the trick. I was able to run the assemble coadd task in ApTemplate.yaml by adding the config like so:

description: Coaddition
instrument: lsst.obs.decam.DarkEnergyCamera
imports:
  - location: $AP_PIPE_DIR/pipelines/ApTemplate.yaml
tasks:
  assembleCoadd:
    class: lsst.pipe.tasks.assembleCoadd.CompareWarpAssembleCoaddTask
    config:
      doSelectVisits: True
      doNImage: True
      assembleStaticSkyModel.doSelectVisits: True
      connections.outputCoaddName: parameters.coaddName
      # TODO: redundant connection definitions workaround for DM-30210
      connections.selectedVisits: parameters.selectedVisits
      connections.coaddExposure: parameters.template
      # TODO: end DM-30210 workaround
      subregionSize: (10000, 200)
      assembleStaticSkyModel.subregionSize: (10000, 200)

I was too quick to conclude this was working…it seems like there are still a few tasks failing and causing an out-of-memory error. Actually, I am not sure I understand the behavior of the stack given how I am seeing tasks are fail/succeed.

To preface, when I say out-of-memory error I mean the Slurm scheduler issues to me a message like:

slurmstepd: error: Detected 624407 oom-kill event(s) in StepId=2658029.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: n2244: task 0: Out Of Memory

followed by me being kicked off of the node. When this happens, I see one to many core.<n> file(s) created on disk – I think these contain (some of?) the memory content of the process for debugging and <n> is likely the process id. I’ll call these “explicit” out-of-memory events.

I have also seen the core files being created when no message is issued to me from Slurm (or I somehow missed it) and I am not kicked off of my node. These I’ll call “inferred” out-of-memory events.

On my first run of the ApTemplate.yaml, which contains four tasks:

  • lsst.pipe.tasks.postprocess.ConsolidateVisitSummaryTask
  • lsst.pipe.tasks.selectImages.BestSeeingQuantileSelectVisitsTask
  • lsst.pipe.tasks.makeCoaddTempExp.MakeWarpTask
  • lsst.pipe.tasks.assembleCoadd.CompareWarpAssembleCoaddTask

32/32918 quanta for these four tasks failed. It seems these were mostly MakeWarpTask from the logs:

lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <TaskDef(MakeWarpTask, label=makeWarp) dataId={instrument: 'DECam', skymap: 'discrete', tract: 0, patch: 48, visit: 984937, ...}> failed; processing will continue for remaining tasks.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(CompareWarpAssembleCoaddTask, label=assembleCoadd) dataId={band: 'g', skymap: 'discrete', tract: 0, patch: 48}>, skipping this task.

I’m not sure if I saw core files for this run.

I then re-ran the pipeline with --skip-existing --extend-run --clobber-outputs to extend the run, skip existing completed quanta, and retry failed tasks. On the second run, 15/32 quanta are reported completed (actually 19/32 based on how many quanta needed to be completed in subsequent runs – there are 13 left) before the processes mysteriously stopped running. There are no errors in the logs. The last log message is:

lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'assembleCoadd' on quantum {band: 'r', skymap: 'discrete', tract: 0, patch: 65} took 2105.945 seconds

I saw some core files for this run, but was not kicked off i.e. “inferred out-of-memory”.

I re-ran again, thinking that since a few tasks are succeeding each time, I can just iteratively re-start and it will eventually succeed. This time, there are no reported successful tasks in the logs, but I can infer that 3/13 succeeded (there are 10 left) based on subsequent runs, and the processing stopped as before without errors. The last line in the logs is:

lsst.ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'assembleCoadd' on quantum {band: 'g', skymap: 'discrete', tract: 0, patch: 66} took 1980.689 seconds

I saw core files for this run, but was not kicked off i.e. “inferred out-of-memory”.

I re-ran again, this time with -j 1 to spawn just one process thinking that I was hitting out of memory errors again and having just one process requiring memory would fix it. This time 6/10 quanta completed before processing stopped. The last lines of the log are:

lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 6 quanta successfully, 0 failed and 4 remain out of total 10 quanta.
lsst.makeWarp.select INFO: Selecting calexp {instrument: 'DECam', detector: 31, visit: 984937, ...}
lsst.makeWarp.select INFO: Selecting calexp {instrument: 'DECam', detector: 38, visit: 984937, ...}
lsst.makeWarp INFO: Processing calexp 1 of 2 for this Warp: id={instrument: 'DECam', detector: 31, visit: 984937, ...}
lsst.makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel
py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ip_diffim/g5706f010af+0869991ead/python/lsst/ip/diffim/modelPsfMatch.py:419: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0.  Please explicitly specify a position.
  dimenR = referencePsfModel.getLocalKernel().getDimensions()

lsst.makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (9, 9) to (12373, 12373)
lsst.ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values

I saw core files for this run, but was not kicked off i.e. “inferred out-of-memory”.

Finally, I ran again with -j 1 and found no quanta completed and the very first task failed with an “explicit out-of-memory error” after I saw the memory usage spike above 220GB. The only lines from the logs are:

lsst.ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 4 quanta for 4 tasks, graph ID: '1647645520.2195563-14080'
conda.common.io INFO: overtaking stderr and stdout
conda.common.io INFO: stderr and stdout yielding back
lsst.makeWarp.select INFO: Selecting calexp {instrument: 'DECam', detector: 31, visit: 984937, ...}
lsst.makeWarp.select INFO: Selecting calexp {instrument: 'DECam', detector: 38, visit: 984937, ...}
lsst.makeWarp INFO: Processing calexp 1 of 2 for this Warp: id={instrument: 'DECam', detector: 31, visit: 984937, ...}
lsst.makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel
py.warnings WARNING: /gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ip_diffim/g5706f010af+0869991ead/python/lsst/ip/diffim/modelPsfMatch.py:419: FutureWarning: Default position argument overload is deprecated and will be removed in version 24.0.  Please explicitly specify a position.
  dimenR = referencePsfModel.getLocalKernel().getDimensions()

lsst.makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (9, 9) to (12373, 12373)
lsst.ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values

Long story made short: it seems to me that there are just a few tasks that have a very high memory footprint. I include all of the details just to offer more information about what these failures/out-of-memory events look like and why they are mysterious to me. For example, why do some tasks fail on the first run but succeed on subsequent runs? And why are some task failures explicit (show up with an error message in the log) and sometimes the processing fails quietly?

And finally the question: can I tune subregionSize to a different (smaller?) value to further decrease the memory footprint?

Thanks,
Steven

This looks incredibly suspicious. @yusra, do you understand this?

Based on the detectors being used to make the warp (31 and 38 from visit 984937), I think this may be from a data quality issue: detector 31 of DECam produces some bad data across the CCD. However, most of the pipeline processing has worked okay – masking out this bad area – in the past, so I haven’t included any selections to exclude detector 31.

This is a plot of the 2 calexp for producing the warp.

Here are the same images made with afw.display to show the data masks:

And the mask colors are:
BAD: red
CR: magenta
CROSSTALK: None
DETECTED: blue
DETECTED_NEGATIVE: cyan
EDGE: yellow
INTRP: green
NOT_DEBLENDED: None
NO_DATA: orange
SAT: green
SUSPECT: yellow
UNMASKEDNAN: None

If that’s in every epoch, then maybe that’s why warpCompare is blowing up. But regardless of it being the cause of the memory issue, there’s no way that including that data in the coadd is going to give you a good result (maybe it manages to mask all the awful parts, but the PSF model won’t be good). You either need to fix it or mask it or drop the CCD altogether.

Thanks, yeah I think I’ll end up dropping the CCD from processing if the applied masks are not allowing processing to work and if it won’t produce a good result anyway. I think this feature is present across visits.