Help with NaN in jointcal fitting of DECam data (with obs_decam?)

Tags: #<Tag:0x00007fb37f475d58>

We’re hoping for a little help pointing out how to debug a problem running jointcal on some DECam data. For a specific set of (at least) 6 dates in June, 2013, jointcal (on all filters) fails the astrometric fitting just as fitting starts—here’s the relevant part of the logfile:

jointcal.Associations INFO: Associated 33456 reference stars among 87194

jointcal.Associations INFO: Fitted stars before measurement # cut: 72458 dataId=jointcal.Associations INFO: Fitted stars after measurement # cut: 56878

jointcal.Associations INFO: Total, valid number of Measured stars: 873540, 857960'visit': 343378, 'ccdnum': 43}

jointcal INFO: === Starting astrometric fitting...acts for components of dataId=jointcal.ConstrainedAstrometryModel INFO: Got 58 chip mappings and 79 visit mappings; holding chip 28 fixed (3660 total parameters).ts for components of dataId=jointcal.AstrometryFit INFO: Reference Color: 0 sig 0

jointcal INFO: Initial chi2/ndof : nan/1782832=nanacts for components of dataId=jointcal FATAL: Failed processing tract 0, FloatingPointError: Initial chi2 is invalid: chi2/ndof : nan/1782832=nan   “  

My guess is that the fitting fails because the optimizer doesn’t know what to do with NaN. However, we don’t understand where this error comes from.

We’ve done some detective work but haven’t been able to figure it out:

  1. It’s not in the pixel values data of the raw images—processCcd runs fine on the exposures, and the calexps look fine.

  2. Both older and newer images than those dates run through jointcal fine.

  3. it doesn’t appear to be related to the reference catalog—we tried both PS1 and SDSS (in addition to our default GAIA astrometry catalog) and got the same result. Furthermore, if we omit those specific dates, jointcal runs fine.

  4. It’s not related to a specific point of the sky—we’ve checked 4-5 different pointing positions separated by many tens of degrees and the behavior is the same.

  5. It doesn’t appear to be in the source positions—we haven’t yet looked at all chips on all exposures, but we’ve displayed the ra,dec of the source catalogs for many, and there’s nothing weird-looking in the range of ra,dec. However, when we include those chips in the jointcal, we get the NaN error above.

  6. It doesn’t depend on filter—u,g,r,i,z images all fail if they were taken on those dates

  7. We looked through the image headers, and apart from some differences in keywords being uppercase or lowercase, there didn’t seem to be anything weird (in particular all the WCS keywords seemed o.k…).

At this point, I’m not sure where to look for the source of NaN. Has anyone else encountered this before and have an idea of how to fix this? Are there any other metadata we should be looking at to help diagnose the problem?

I recently had some problems with jointcal failing. I scanned the (very verbose) log, and managed to identify a visit for which the small number of input CCDs had only 0 measuredStars. It turned out that that visit had a tracking failure.

Yes, we’ve also occasionally had problems with exposures where the pointing was at the edge of our field and too few chips matched. But these are centered and have >~200 measuredStars per CCD. And it’s every exposure in the run, which makes me suspect there’s something in the image headers or metadata that’s tripping jointcal up. I just can’t figure out what…

I’m not sure if the log lines above were originally corrupted or if I corrupted them in the process of fixing the formatting, but either way it’s a tiny fraction of what jointcal outputs. Could you please post the full log output?

Sure–sorry, I had just posted the end because the log is pretty big…

jointcal_A2029_z-15453115.out (1.3 MB)

I see several exposures where it Matched 0 objects for many of the CCDs. I suggest looking into these, as they may be the source of your NaN.

Looking at the log, I have some additional suggestions unrelated to the problem that I hope will make your experience better:

  • Specify a tract in the --id, and the job will go much faster.
  • Don’t use --clobber-config in production.
  • Set envvar OMP_NUM_THREADS=1 to disable the warning about implicit use of threads.

Thanks! I’ll try a test omitting those 4 exposures–I’m surprised, though, as those aren’t the ones that were in the dangerous date range. Will report back…

Just to check: what version of jointcal and/or the Science Pipelines are you running? Note that jointcal doesn’t look at the images, just the catalogs.

You can try turning on the writeChi2FilesInitialFinal config option. This will result in a .csv file in your current working directory containing the contributions to the chi2 matrix from every source at the time of initialization. You should be able to find the NaNs in there, which might help you track down which detectors/visits are causing the problem. You can also try running with log level DEBUG (--loglevel jointcal=DEBUG in this case), though I’m not sure there will be much useful information during initialization.

The “Matched 0 objects” in the first block of the log is just the fact that those are the first detectors+visits in the list, so they will have no matches. Any detector on or past the edge of a tract may have zero cross matches, but should have refcat matches.

I wonder if this is related to this error reported by @sfu : DM-22548? We weren’t able to find anything conclusive in that case, and other things took priority after I’d initially looked into it.

1 Like

Thanks! We’re using v19.0 still (we were hesitant to move to v20). I’ll add the writeChi2FilesInitialFinal and let you know what we find. (Last night I tried a couple of tests omitting or including files and can confirm that the “Matched 0 objects” entries don’t affect the completion of the jointcal processing, whereas it is the exposures from that one run that cause the NaN. Yes, this the same problem @sfu pointed out–we’re coming back to it and it’s increased in prominence because it seems to affect every exposure of that run, no matter the pointing…

So, the output of writeChi2FilesInitialFinal is interesting–the _meas version is where all of the NaN are. It looks like columns rx,ry,rxi,ryi (and chi2) for many objects are all nan. I am not sure I understand the “visit” entry, though. What’s listed is almost always the first visit in the sequence and not the offending images… I’m including the a link to the file (it’s too big to include). I notice that even the objects that aren’t nan seem to have weirdly large values for this column–unless I don’t understand the units?