Debugging memory errors in the Science Pipelines

swinbank · July 19, 2019, 10:40pm

We recently spent some time tracking down a mysterious segmentation fault (hard crash) observed in Science Pipelines code. This turned out to be due to memory management issues in our C++ code: for the full story, see DM-20246. By request, here are some notes on the process used to track down the problem, in the hope that it might prove of general interest. I make no claims that this is “best practice”, but perhaps it’ll help others avoid stumbling into the same dead-ends as I did.

DM-20246 had the hallmarks of a memory issue: when the bug was triggered it had dramatic consequences, but triggering it was rather hit-or-miss — while there are instructions to reproduce on the ticket, they would lead to crash on some systems and not on others. That’s rather what you’d expect, as a lot depends on exactly how the compiler, the system libraries, and the DM code interact.
In this particular case, the problem was only seen on certain versions of macOS.

The standard approach for memory issues is to use Valgrind. Unfortunately, Valgrind isn’t well supported on macOS — we had to dig out an old version of the OS to get it running at all, and even then it ground to a halt in ISR without telling us much of interest.

While waiting for the rain to stop so I could cycle home, I had some vague recollection of a memory debugging tool being available in the LLVM compiler suite, and I wondered if that could help. After some searching, I found AddressSanitizer and decided to throw that at the problem since I had a few minutes downtime. (Worth noting that AddressSanitizer also works on GCC/Linux; it’s not limited to LLVM).

The basic trick to using AddressSanitizer is to build your code with the -fsanitize=address argument. When building the LSST stack, you’ll need to use a shared (rather than static) libasan library. On GCC, that’s the default, but on Mac/LLVM you’ll need to add the -shared-libasan argument as well.

I guessed that the error was almost certainly arising in afw, due to discussion about Footprints etc on the ticket (and because afw is where most of the hairiest code lives). I installed w_2019_28 using eups distrib, then build afw using AddressSanitizer. That means figuring out how to get the right command line arguments through the sconsUtils build system to the compiler. I did this through luck and blundering about rather than understanding what was going on — effectively, splattering them through sconsUtils until things started working. This is what I did, but it’s obviously not optimal (although it took three minutes and minimal understanding, so maybe it was an optimal use of time…).

With that change, I was able to build afw lib, but test failed because it still wasn’t getting the right arguments through the compiler & linker. Rather than argue further with sconsUtils, I built the C++ tests manually by simply copying, pasting & augmenting command lines. At that point, you can run the tests, and see the first failure: log here. That log is easy to read if you take a moment: it tells you what happened (container-overflow), where it happened (SpanSet_testFunctor_invoker()) and even where the memory being hit was allocated (SpanSet_testFunctor::test_method() test_spanSets.cc:723). The fix is fairly straightforward.

However, the steps-to-reproduce for this error involve Python code. There are two additional wrinkles:

You need to preload the address sanitizer library before python;
As soon as you import any LSST code in to Python, you’ll get a container-overflow from pybind11. (It’s beyond my ken as to whether that’s a real problem, or just because the pybind11 layer is full of magic).

You can work around these by doing something like:

$ ASAN_OPTIONS=detect_container_overflow=0:sleep_before_dying=120 DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/10.0.1/lib/darwin/libclang_rt.asan_osx_dynamic.dylib  processCcd.py ...

Note that on Linux you would use LD_PRELOAD rather than DYLD_INSERT_LIBRARIES. (We’ll talk about sleep_before_dying shortly).

You can now follow the instructions on the ticket to reproduce the problem. Note that even if you don’t get a segfault, AddressSanitizer will still notice the problem, tell you something is up, and cause the code to exit. A full log is here; the key parts are:

processCcd.calibrate.measurement INFO: Measuring 2745 sources (198 parents, 2547 children)
=================================================================
==33997==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61000003bffc at pc 0x00011964a6dd bp 0x7ffee6d30970 sp 0x7ffee6d30968
WRITE of size 4 at 0x61000003bffc thread T0

    #0 0x11964a6dc in void lsst::afw::geom::SpanSet::applyFunctorImpl<void lsst::afw::geom::SpanSet::flatten<float, float, 2, 1, 1>(ndarray::Array<float, (2) - (1), 1> const&, ndarray::Array<float, 2, 1>
const&, lsst::geom::Point<int, 2> const&) const::'lambda'(lsst::geom::Point<int, 2> const&, float&, float&)&, lsst::afw::geom::details::FlatNdGetter<float, 1, 1>, lsst::afw::geom::details::ImageNdGetter<f
loat, 2, 1> >(float&&, lsst::afw::geom::details::FlatNdGetter<float, 1, 1>, lsst::afw::geom::details::ImageNdGetter<float, 2, 1>) const (/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x1766dc)
    #1 0x119649acf in void lsst::afw::geom::SpanSet::applyFunctor<void lsst::afw::geom::SpanSet::flatten<float, float, 2, 1, 1>(ndarray::Array<float, (2) - (1), 1> const&, ndarray::Array<float, 2, 1> cons
t&, lsst::geom::Point<int, 2> const&) const::'lambda'(lsst::geom::Point<int, 2> const&, float&, float&)&, lsst::afw::geom::details::FlatNdGetter<float, 1, 1>, lsst::afw::geom::details::ImageNdGetter<float
, 2, 1> >(float&&, lsst::afw::geom::details::FlatNdGetter<float, 1, 1>&&, lsst::afw::geom::details::ImageNdGetter<float, 2, 1>&&) const (/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x175acf)
    #2 0x119608d37 in void lsst::afw::geom::SpanSet::flatten<float, float, 2, 1, 1>(ndarray::Array<float, (2) - (1), 1> const&, ndarray::Array<float, 2, 1> const&, lsst::geom::Point<int, 2> const&) const
(/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x134d37)
    #3 0x119627da3 in lsst::afw::detection::HeavyFootprint<float, int, float>::HeavyFootprint(lsst::afw::detection::Footprint const&, lsst::afw::image::MaskedImage<float, int, float> const&, lsst::afw::de
tection::HeavyFootprintCtrl const*) (/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x153da3)
    #4 0x15c50557d in lsst::afw::detection::HeavyFootprint<float, int, float> lsst::afw::detection::makeHeavyFootprint<float, int, float>(lsst::afw::detection::Footprint const&, lsst::afw::image::MaskedIm
age<float, int, float> const&, lsst::afw::detection::HeavyFootprintCtrl const*) (/Users/jds/Projects/LSST/src/afw/python/lsst/afw/detection/heavyFootprint.so:x86_64+0x7d57d)

(.. elided ..)

0x61000003bffc is located 0 bytes to the right of 188-byte region [0x61000003bf40,0x61000003bffc)
allocated by thread T0 here:
    #0 0x10933ed22 in wrap__Znam (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/10.0.1/lib/darwin/libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x68d22)
    #1 0x119645d16 in ndarray::Array<float, 1, 1> ndarray::detail::SimpleInitializer<1>::apply<ndarray::Array<float, 1, 1> >() const (/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x171d16)
    #2 0x119627663 in lsst::afw::detection::HeavyFootprint<float, int, float>::HeavyFootprint(lsst::afw::detection::Footprint const&, lsst::afw::image::MaskedImage<float, int, float> const&, lsst::afw::de
tection::HeavyFootprintCtrl const*) (/Users/jds/Projects/LSST/src/afw/lib/libafw.dylib:x86_64+0x153663)
    #3 0x15c50557d in lsst::afw::detection::HeavyFootprint<float, int, float> lsst::afw::detection::makeHeavyFootprint<float, int, float>(lsst::afw::detection::Footprint const&, lsst::afw::image::MaskedIm
age<float, int, float> const&, lsst::afw::detection::HeavyFootprintCtrl const*) (/Users/jds/Projects/LSST/src/afw/python/lsst/afw/detection/heavyFootprint.so:x86_64+0x7d57d)

So in short, at some point during processCcd.calibrate.measurement, we are constructing a HeavyFootprint. The constructor allocates some memory, then overflows it.

At this point, AddressSanitizer would normally just exit. However, the sleep_before_dying=120 parameter causes it to wait for two minutes before doing so. That gives you time to check the PID of your process and then connect to it with your debugger to take a look at what’s happening:

$ lldb
(lldb) process attach --pid NNNN  # pid obtained by running `ps`.

At this point, though, I switched to debugging in Python. The error is clearly happening just after this log message in meas/base/sfm.py. My first step was just to put in some calls to print to help narrow down where the issue is. I ended up with something like this:

        for parentIdx, measParentRecord in enumerate(measParentCat):
            self.log.info(f"  Processing parent {parentIdx}")
            # first get all the children of this parent, insert footprint in
            # turn, and measure
            measChildCat = measCat.getChildren(measParentRecord.getId())
            # TODO: skip this loop if there are no plugins configured for
            # single-object mode
            for childIdx, measChildRecord in enumerate(measChildCat):
                self.log.info(f"    Measuring child {childIdx}")
                noiseReplacer.insertSource(measChildRecord.getId())
                self.log.info(f"    Calling measure functions")
                self.callMeasure(measChildRecord, exposure, beginOrder=beginOrder, endOrder=endOrder)

And rapidly discovered that this was happening with parentIdx==0, childIdx==248. Having got that far, I dropped into pdb and stepped through callMeasure one measurement plugin at a time. Eventually, I hit on the culprit. However, despite a bunch of staring at the code, it’s not obvious what the problem is.

At this point, I used writeFits methods on both foot and maskedImage to store them to disk, in the hope that I could experiment with them without the need to keep re-running processCcd.py. And I could! First, in C++:

#include <cstdint>
#include <cassert>
#include <string>
#include <typeinfo>
#include <algorithm>
#include "boost/format.hpp"
#include "lsst/pex/exceptions.h"
#include "lsst/afw/detection/Peak.h"
#include "lsst/afw/image/MaskedImage.h"
#include "lsst/afw/image/LsstImageTypes.h"
#include "lsst/afw/detection/HeavyFootprint.h"
#include "lsst/afw/detection/Footprint.h"
#include "lsst/afw/detection/FootprintCtrl.h"
#include "lsst/afw/table/io/CatalogVector.h"
#include "lsst/afw/table/io/OutputArchive.h"
#include "lsst/afw/table/io/InputArchive.h"
#include "lsst/afw/table/io/Persistable.cc"

namespace afwImage = lsst::afw::image;
namespace afwDet = lsst::afw::detection;

typedef afwImage::MaskedImage<float> MaskedImageF;

int main(void) {

  MaskedImageF mim("dodgyMaskedImage.fits");
  std::shared_ptr<afwDet::Footprint> foot = afwDet::Footprint::readFits("dodgyFoot.fits");
  ndarray::Array<float, 1, 1> _image(ndarray::allocate(ndarray::makeVector(foot->getArea())));
  foot->getSpans()->flatten(_image, mim.getImage()->getArray(), mim.getXY0());

}

(excuse the gratuitous includes). That fails, but switching the call to makeVector to use foot->getArea()+2 works. Hmm.

Then in Python:

>>> import lsst.afw.detection as afwDet
>>> afwDet.Footprint.readFits("dodgyFoot.fits") 
>>> fp.getSpans()
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

(... many lines elided ...)
>>> fp.getArea()
47

At that point, it’s just a matter of counting — there are 49 1s printed in the output of getSpans, but the computed area is 47… difference of 2 sounds familiar. Obviously, that area is wrong.

From there, it’s just a matter of reading the code to figure out how the area of a SpanSet is computed and looking for the appropriate constructor which points you right at PixelRegion, then playing about a little more in the REPL to convince yourself it’s misbehaving.

Apologies for the length of this post, and please excuse the naïvety displayed above!

ktl · July 20, 2019, 5:19am

There’s a semi-hidden bypass to get compiler/linker flags into sconsUtils: the ARCHFLAGS environment variable. Unfortunately, it seems that the devtoolset-6 installation on lsst-dev.ncsa.illinois.edu doesn’t have an appropriate libasan.so library, so ARCHFLAGS=-fsanitize=address doesn’t fully work there, but I can confirm that the argument is passed through to c++ and ld for building lib, python, and tests.

ktl · July 20, 2019, 6:23am

It appears that yum install devtoolset-6-libasan-devel might be all that is needed to fix the issue on lsst-dev.

kfindeisen · July 25, 2019, 6:47pm

I confirm that, at least on Unix, adding the following to the various SConscript files is sufficient:

from lsst.sconsUtils import env
env.Append(ARCHFLAGS=["-g", "-fsanitize=address", "-fno-omit-frame-pointer", ])

You must use ARCHFLAGS rather than CCFLAGS, or you’ll get linker errors when you try to build test.

Also, be careful about which sanitizers you run. I tried combining -fsanitize=address and -fsanitize=undefined and the compiler’s (virtual) memory usage blew up.

ktl · July 25, 2019, 6:54pm

You don’t even need to modify the SConscript files; just doing ARCHFLAGS="-g -fsanitize=address -fno-omit-frame-pointer" scons opt=3 should work.