Two failed quantities running demo for 12_0 at NERSC

I just installed 12_0 at NERSC (on Edison since Cori is unavailable) and at SLAC, both via the conda install. I ran the 12_0 version of the demo at SLAC without a hitch, at NERSC there were 2 reported failed quantities when I ran ./bin/compare:

heatherk@edison09:/project/projectdirs/lsst/lsstDM/Edison/lsst_dm_stack_demo-12.0> ./bin/compare detected-sources.txt
Failed (absolute difference 1.00002e-10, relative difference 4.0431e-12 over tolerance 0) in column base_PsfFlux_fluxSigma.
Failed (absolute difference 1.00002e-10, relative difference 4.51474e-12 over tolerance 0) in column base_PsfFlux_fluxSigma.

Also of interest was a reported UserWarning:

Setting up: astrometry_net_data             Flavor: Linux64    Version: LOCAL:/global/project/projectdirs/lsst/lsstDM/Edison/lsst_dm_stack_demo-12.0/astrometry_net_data                                                                              
/project/projectdirs/lsst/lsstDM/Edison/v12_0/opt/lsst/stsci_distutils/lib/python/stsci.distutils-0.3.7-py2.7.egg/stsci/ UserWarning: Module pyfits was already imported from /project/projectdirs/lsst/lsstDM/Edison/v12_0/opt/lsst/pyfits/lib/python/pyfits-3.4-py2.7-linux-x86_64.egg/pyfits/__init__.pyc, but /global/project/projectdirs/lsst/lsstDM/Edison/v12_0/opt/lsst/pyfits/lib/python/pyfits-3.4-py2.7-linux-x86_64.egg is being added to sys.path                                

We also have w.2016.20 built from source on Edison, in that case, the demo ran just fine… where the demo was the “HEAD” at the time w.2016.20 was first made available.

My first inclination is to try to redo the conda install of 12_0, but I’m open to any suggestions.
Take care,

I think these errors are spurious, and that they’re due to a combination of tight tolerances and a new architecture. We already know that demo values can vary by approximately this amount between Linux and OS X, and we have different comparison files for those two architectures to deal with that. It looks like NERSC is different enough from our usual Linux environment (maybe due to different fast math libraries or hardware?) that we just can’t use the Linux comparison values directly.

I’m afraid I don’t know anything about the warning.

While I’m sure this is right in broad terms, it would be interesting to understand exactly where this discrepancy is coming from – not least so that we can ensure it won’t be a problem for other users.

Edison appears to be ordinary Intel CPUs, so it seems unlikely that this is due to the hardware.

At a guess, on Edison /project and /global/project are symlinked to the same place. At different places in the codebase, with access it directly and with calls to os.path.normpath, get slightly different answers, and issue a warning. It’s unlikely to be a problem.

1 Like

@jbosch @swinbank Thank you both for your responses. In truth we are just waiting for Cori to come back up - and our production work will be on Cori. My inclination is to install both the conda distribution and another using source. I’m wondering if I can coax the Twinkles folks to consider doing some runs using both and check for any discrepancies - though it’s not likely to uncover anything significant and of course time and resources are in short supply. I’ll inquire on this end.