Seg faults in coadd task

I am running into a frustrating issue while stacking together archive HyperSuprimeCam images for a piece of sky I am looking at.

For some small fraction of my coadd runs I get a bad termination error, which seg faults with exit code 139. In most cases, these stacks have four input visits, one of which is larger than the others (as in, one visit covers pretty much the whole tile, and the other 3 lesser fractions). If I remove the larger visit from the coadd input, things are working just fine.

I initially thought this might be a memory related issue, but I have run plenty of other deeper stacks with more input visits, and also still larger input fractions (as in, I have run stacks with tens of input visits, several of which cover the full sky tile, on the same machine, with the same setup, without issue).

Is there any way to get more info out of the runs, and/or run with some higher verbosity or debug, to try to get more of a handle on why this is happening. I have visually inspected the rogue piece of problem image, and found nothing strange that jumps out at me, so running out of things to try.

relevant log snippet

coaddDriver.assembleCoadd.detectTemplate INFO: Detected 26588 positive peaks in 11404 footprints to 5 sigma
coaddDriver.assembleCoadd.detectTemplate INFO: Detected 26588 positive peaks in 11404 footprints to 5 sigma
coaddDriver.assembleCoadd.scaleWarpVariance INFO: Renormalizing variance by 0.988779
coaddDriver.assembleCoadd.scaleWarpVariance INFO: Renormalizing variance by 0.988779
coaddDriver.assembleCoadd.detect INFO: Detected 7686 positive peaks in 946 footprints and 3301 negative peaks in 794 footprints to 5 sigma
coaddDriver.assembleCoadd.detect INFO: Detected 7686 positive peaks in 946 footprints and 3301 negative peaks in 794 footprints to 5 sigma
coaddDriver.assembleCoadd.scaleWarpVariance INFO: Renormalizing variance by 1.119721
coaddDriver.assembleCoadd.scaleWarpVariance INFO: Renormalizing variance by 1.119721
coaddDriver.assembleCoadd.detect INFO: Detected 7124 positive peaks in 1671 footprints and 3707 negative peaks in 1122 footprints to 5 sigma
coaddDriver.assembleCoadd.detect INFO: Detected 7124 positive peaks in 1671 footprints and 3707 negative peaks in 1122 footprints to 5 sigma

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 23716 RUNNING AT ippc134
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

What version of the software are you running? The reference to coaddDriver implies to me that you are running the old gen2 version of the pipelines software.

Indeed this was still using the gen2 version from v23, since this particular project was essentially pre-setup uding the old pipeline a few years back.

If you had a stack trace we might be able to give you some clues (and maybe point at a ticket where we fixed the problem) but at this point we are not planning to make any new v23 releases. If you get the segv in v26 using the gen3 pipelines then that’s a different story.

The old documentation might help with log levels:

https://pipelines.lsst.io/v/v23_0_0/modules/lsst.pipe.base/command-line-task-logging-howto.html

The usual way forward from here is to guess which patch it was operating on at the time and run coaddDriver.py on just that patch with --batch-type=none under gdb, and get a stack trace when (if) it segfaults. If you can’t figure out which patch it is, you could run them all under that mode, but it would run serially.

Because LSST v23 was cut in the process of ditching the Gen2 middleware, you might do better rewinding a bit further and using hscPipe 8.5.3, where the Gen2 middleware has been extensively tested. It’s been a long time since I’ve tried this, but the installation instructions are:

    wget https://tigress-web.princeton.edu/~HSC/hscPipe8/newinstall.sh
    bash newinstall.sh
    # Answer "yes" to installing Anaconda unless you really know what you're doing
    # Source the appropriate file it tells you to, and then proceed to the next step
    
    # Install this release:
    eups distrib install hscPipe 8.5.3

I certainly cant fault that. This is a pretty old project, so if it wasn’t nearly done I would try to migrate to gen3.

Somehow I missed that page. I will see if anything pops up when running with better logging turned on. Thanks!