PSFex fails in singleFrameDriver.py for some CCDs

Hi,
When I run singleFrameDriver.py there are some of the CCDs that are not processed. This is the error that it gives me:

9583 FATAL 2018-02-19T12:15:37.795+1100 singleFrameDriver: Failed on dataId={'taiObs': '2014-09-24', 'pointing': 997, 'visit': 8000, 'dateObs': '2014-09-24', 'filter': 'HSC-I', 'field': 'ABELL0119', 'ccd': 101, 'expTime': 240.0}: 
  File "src/PsfexPsf.cc", line 216, in virtual std::shared_ptr<lsst::afw::image::Image<double> > lsst::meas::extensions::psfex::PsfexPsf::_doComputeImage(const Point2D&, const lsst::afw::image::Color&, const Point2D&) const
    Only spatial variation (ndim == 2) is supported; saw 0 {0}
lsst::pex::exceptions::InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'

Is there a way to still get the processed image even if PSFex fails?

Thanks!

It’s rare for PSF estimation to fail on a CCD that doesn’t have dead amplifiers in i-band. Have you looked at the image to see if it has some obvious problems, e.g., lots of extinction?

If all you want is a “processed image” rather than the “calibrated image” that is the usual product of singleFrameDriver.py, then the following configuration overrides might help: processCcd.doCalibrate=False processCcd.charImage.doMeasurePsf=False charImage.doWriteExposure=True.

I thought that I fixed that issue. What version of meas_extensions_psfex are you running? It’s also possible that those fixes weren’t merged, as I made them while dealing with problems with undersampling, and we may have left some of the other improvements behind.

These kinds of errors regularly appear for ccd=9 (even for the latest meas_extensions_psfex), which has two bad amps, and sometimes on other CCDs with bad amps. But it’s rare for them to appear on good CCDs unless the images are bad (e.g., not many stars present at all).

I’m getting the same error for all my runs of HSC data. Is it possible to configure things so that the offending CCDs are simply bypassed, instead of killing the entire process?

BTW:, here’s my version: meas_extensions_psfex 14.0-2-ge3897b5+5 current w_2018_09 setup

Whether you’re running singleFrameDriver.py or processCcd.py, the failures shouldn’t be fatal unless you have --doraise on your command-line, in which case you’re asking for it to be fatal. What are you seeing?

@sophiereed pinged me the other day, saying she was seeing the same error several times while processing HSC Cosmos data. We worked out that most of the cases were for ccd=9, and the rest were confined to one or two visits that appear to have been taken through many magnitudes of cloud.

Here’s the last 70 lines of the log (on lsst-dev at /scratch/jcarlin/madcash/IC1613-sfd.o115989):

     line 405, in selectStars
    raise RuntimeError("No objects passed our cuts for consideration as psf stars")
RuntimeError: No objects passed our cuts for consideration as psf stars
Traceback (most recent call last):
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/cmdLineTask.py", line 408, in __call__
    result = task.run(dataRef, **kwargs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_drivers/6.0b0-hsc-20-g3029749+4/python/lsst/pipe/drivers/singleFrameDriver.py", line 72, in run
    return self.processCcd.run(sensorRef)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/processCcd.py", line 189, in run
    doUnpersist=False,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 351, in run
    background = background,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 430, in characterize
    apCorrMap = self.measureApCorr.run(exposure=dmeRes.exposure, catalog=dmeRes.sourceCat).apCorrMap
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_algorithms/14.0-18-gf7dca964+3/python/lsst/meas/algorithms/measureApCorr.py", line 252, in run
    (name, len(subset2), self.config.minDegreesOfFreedom+1))
RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel': only 0 sources, but require at least 2.
Traceback (most recent call last):
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/cmdLineTask.py", line 408, in __call__
    result = task.run(dataRef, **kwargs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_drivers/6.0b0-hsc-20-g3029749+4/python/lsst/pipe/drivers/singleFrameDriver.py", line 72, in run
    return self.processCcd.run(sensorRef)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/processCcd.py", line 189, in run
    doUnpersist=False,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 351, in run
    background = background,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 413, in characterize
    psfSigma = psf.computeShape().getDeterminantRadius()
lsst.pex.exceptions.wrappers.InvalidParameterError: 
  File "src/PsfexPsf.cc", line 221, in virtual std::shared_ptr<lsst::afw::image::Image<double> > lsst::meas::extensions::psfex::PsfexPsf::_doComputeImage(const Point2D&, const lsst::afw::image::Color&, const Point2D&) const
    Only spatial variation (ndim == 2) is supported; saw 0 {0}
lsst::pex::exceptions::InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'

Traceback (most recent call last):
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/cmdLineTask.py", line 408, in __call__
    result = task.run(dataRef, **kwargs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_drivers/6.0b0-hsc-20-g3029749+4/python/lsst/pipe/drivers/singleFrameDriver.py", line 72, in run
    return self.processCcd.run(sensorRef)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/processCcd.py", line 189, in run
    doUnpersist=False,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 351, in run
    background = background,
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/14.0-6-ge2c9487+48/python/lsst/pipe/base/timer.py", line 150, in wrapper
    res = func(self, *args, **keyArgs)
  File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/14.0-52-g00a5fa4c/python/lsst/pipe/tasks/characterizeImage.py", line 413, in characterize
    psfSigma = psf.computeShape().getDeterminantRadius()
lsst.pex.exceptions.wrappers.InvalidParameterError: 
  File "src/PsfexPsf.cc", line 221, in virtual std::shared_ptr<lsst::afw::image::Image<double> > lsst::meas::extensions::psfex::PsfexPsf::_doComputeImage(const Point2D&, const lsst::afw::image::Color&, const Point2D&) const
    Only spatial variation (ndim == 2) is supported; saw 0 {0}
lsst::pex::exceptions::InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'

I ran it using:

$ singleFrameDriver.py DATA --calib /datasets/hsc/calib/20180117 --rerun 20180305/IC1613 --id field=IC1613_CEN^IC1613_NE^IC1613_NW^IC1613_W --batch-type=slurm --mpiexec='-bind-to socket' --nodes 4 --procs 48 --time 600 --job IC1613-sfd

Most of the failures are for ccd=9, but there are some failures for others. Do you think you could eyeball the raw images for these to see if there’s anything strange, like if they’re “contaminated” by some nasty nearby galaxy?

pprice@lsst-dev01:~ $ grep -e "Failed.* 'ccd': 9," /scratch/jcarlin/madcash/IC1613-sfd.o115989 | wc -l
32
pprice@lsst-dev01:~ $ grep "Failed" /scratch/jcarlin/madcash/IC1613-sfd.o115989 | grep -v "'ccd': 9,"
79662 FATAL 2018-03-06T14:05:16.584-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_NW', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-G', 'visit': 91368, 'ccd': 101, 'taiObs': '2016-11-28', 'expTime': 30.0}: InvalidParameterError: 
67435 FATAL 2018-03-06T14:10:16.778-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_NW', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-I2', 'visit': 91404, 'ccd': 101, 'taiObs': '2016-11-28', 'expTime': 300.0}: InvalidParameterError: 
67438 FATAL 2018-03-06T14:43:03.014-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-G', 'visit': 91326, 'ccd': 25, 'taiObs': '2016-11-28', 'expTime': 30.0}: InvalidParameterError: 
67423 FATAL 2018-03-06T14:45:57.734-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-G', 'visit': 91330, 'ccd': 94, 'taiObs': '2016-11-28', 'expTime': 300.0}: InvalidParameterError: 
67430 FATAL 2018-03-06T14:56:09.809-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-I2', 'visit': 91390, 'ccd': 94, 'taiObs': '2016-11-28', 'expTime': 300.0}: InvalidParameterError: 
183841 FATAL 2018-03-06T14:57:22.400-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-I2', 'visit': 91382, 'ccd': 54, 'taiObs': '2016-11-28', 'expTime': 300.0}: RuntimeError: No matches to use for photocal
51125 FATAL 2018-03-06T15:10:20.578-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-G', 'visit': 91340, 'ccd': 4, 'taiObs': '2016-11-28', 'expTime': 300.0}: InvalidParameterError: 
51117 FATAL 2018-03-06T15:10:47.217-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-I2', 'visit': 91376, 'ccd': 94, 'taiObs': '2016-11-28', 'expTime': 300.0}: RuntimeError: No objects passed our cuts for consideration as psf stars
183865 FATAL 2018-03-06T15:19:16.504-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-G', 'visit': 91344, 'ccd': 94, 'taiObs': '2016-11-28', 'expTime': 300.0}: InvalidParameterError: 
183853 FATAL 2018-03-06T15:21:30.577-0600 singleFrameDriver: Failed on dataId={'field': 'IC1613_W', 'dateObs': '2016-11-28', 'pointing': 1793, 'filter': 'HSC-I2', 'visit': 91390, 'ccd': 91, 'taiObs': '2016-11-28', 'expTime': 300.0}: RuntimeError: No matches to use for photocal

Sure - I’ll take a look at a few images.

Am I correct in surmising that the process died before completing all the IC1613 data? Or am I misinterpreting the logs?

No, there’s no sign that the process itself died. If it had died, there would be some MPI stuff somewhere.
In fact, there’s a clear sign that it finished:

root ERROR: 42 dataRefs failed; not exiting as --noExit was set

The pileup of failure messages at the end of the log is because python 3’s stderr isn’t buffered. I’ll put in a workaround.

I checked out a few images. Some look “fine,” one had an extremely bright star in it, and one looked rather shallow with a fuzzy psf (perhaps there were clouds? I can’t find the logs to check that). So in some cases the “fails” make sense, and others not so much…

If it’s processing all but these handful of fails, that should be OK in the end. The coAdds won’t suffer much…

So for N2403-sfd.o114909, does the following imply that it did die without completing?

128966 INFO  2018-03-06T06:35:58.001-0600 singleFrameDriver.processCcd.charImage.applyApCorr: Applying aperture corrections to 23 flux fields
128993 INFO  2018-03-06T06:35:58.053-0600 singleFrameDriver.processCcd.calibrate.astrometry.matcher: Matched 260 sources
106481 INFO  2018-03-06T06:35:58.090-0600 singleFrameDriver.processCcd.calibrate.astrometry.matcher: Matched 162 sources

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 47780 RUNNING AT lsst-verify-worker09
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@lsst-verify-worker30] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:1@lsst-verify-worker30] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@lsst-verify-worker30] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[proxy:0:3@lsst-verify-worker43] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:3@lsst-verify-worker43] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@lsst-verify-worker43] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[proxy:0:2@lsst-verify-worker41] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2@lsst-verify-worker41] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@lsst-verify-worker41] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: lsst-verify-worker43: task 3: Exited with exit code 7
srun: error: lsst-verify-worker41: task 2: Exited with exit code 7
srun: error: lsst-verify-worker30: task 1: Exited with exit code 7

Yeah, that died hard. Looks like it was killed (EXIT CODE: 9). One possibility is it used too much memory and got killed by the Linux kernel, but that would be very unusual for singleFrameDriver.py.

I was processing 7 HSC fields in 2 filters simultaneously. Perhaps I should separate them to avoid the (possible) memory crash. I’ll try it…

Thanks for your help, Paul!

No, I don’t think it’s necessary to separate them: they should run one after the other using the available CPUs. I don’t understand why it failed, but I think you might want to just run your job again to see if it fares better.

Cool - I’ll give it a shot. Thanks again!

Hmmm. I ran it again, and it died a similar death (though on a different node, and while processing a different frame). Does anything look amiss in this invocation of SFD?

$ singleFrameDriver.py DATA --calib /datasets/hsc/calib/20180117 --rerun 20180305/NGC2403 --id field=NGC2403_CEN^NGC2403_EAST^NGC2403_NE^NGC2403_NW^NGC2403_SE^NGC2403_SW^NGC2403_WEST --batch-type=slurm --nodes 4 --procs 48 --time 600 --job N2403-sfd --clobber-versions --mpiexec='-bind-to socket'

Looks reasonable to me.

If it’s memory, you could try backing off the number of processors per node (e.g., --procs 24).

In general, you should leave --clobber-versions (and --clobber-config) off unless you need it.