Help with NaN in jointcal fitting of DECam data (with obs_decam?)

We’re hoping for a little help pointing out how to debug a problem running jointcal on some DECam data. For a specific set of (at least) 6 dates in June, 2013, jointcal (on all filters) fails the astrometric fitting just as fitting starts—here’s the relevant part of the logfile:

jointcal.Associations INFO: Associated 33456 reference stars among 87194

jointcal.Associations INFO: Fitted stars before measurement # cut: 72458 dataId=jointcal.Associations INFO: Fitted stars after measurement # cut: 56878

jointcal.Associations INFO: Total, valid number of Measured stars: 873540, 857960'visit': 343378, 'ccdnum': 43}

jointcal INFO: === Starting astrometric fitting...acts for components of dataId=jointcal.ConstrainedAstrometryModel INFO: Got 58 chip mappings and 79 visit mappings; holding chip 28 fixed (3660 total parameters).ts for components of dataId=jointcal.AstrometryFit INFO: Reference Color: 0 sig 0

jointcal INFO: Initial chi2/ndof : nan/1782832=nanacts for components of dataId=jointcal FATAL: Failed processing tract 0, FloatingPointError: Initial chi2 is invalid: chi2/ndof : nan/1782832=nan   “  

My guess is that the fitting fails because the optimizer doesn’t know what to do with NaN. However, we don’t understand where this error comes from.

We’ve done some detective work but haven’t been able to figure it out:

  1. It’s not in the pixel values data of the raw images—processCcd runs fine on the exposures, and the calexps look fine.

  2. Both older and newer images than those dates run through jointcal fine.

  3. it doesn’t appear to be related to the reference catalog—we tried both PS1 and SDSS (in addition to our default GAIA astrometry catalog) and got the same result. Furthermore, if we omit those specific dates, jointcal runs fine.

  4. It’s not related to a specific point of the sky—we’ve checked 4-5 different pointing positions separated by many tens of degrees and the behavior is the same.

  5. It doesn’t appear to be in the source positions—we haven’t yet looked at all chips on all exposures, but we’ve displayed the ra,dec of the source catalogs for many, and there’s nothing weird-looking in the range of ra,dec. However, when we include those chips in the jointcal, we get the NaN error above.

  6. It doesn’t depend on filter—u,g,r,i,z images all fail if they were taken on those dates

  7. We looked through the image headers, and apart from some differences in keywords being uppercase or lowercase, there didn’t seem to be anything weird (in particular all the WCS keywords seemed o.k…).

At this point, I’m not sure where to look for the source of NaN. Has anyone else encountered this before and have an idea of how to fix this? Are there any other metadata we should be looking at to help diagnose the problem?

I recently had some problems with jointcal failing. I scanned the (very verbose) log, and managed to identify a visit for which the small number of input CCDs had only 0 measuredStars. It turned out that that visit had a tracking failure.

Yes, we’ve also occasionally had problems with exposures where the pointing was at the edge of our field and too few chips matched. But these are centered and have >~200 measuredStars per CCD. And it’s every exposure in the run, which makes me suspect there’s something in the image headers or metadata that’s tripping jointcal up. I just can’t figure out what…

I’m not sure if the log lines above were originally corrupted or if I corrupted them in the process of fixing the formatting, but either way it’s a tiny fraction of what jointcal outputs. Could you please post the full log output?

Sure–sorry, I had just posted the end because the log is pretty big…

jointcal_A2029_z-15453115.out (1.3 MB)

I see several exposures where it Matched 0 objects for many of the CCDs. I suggest looking into these, as they may be the source of your NaN.

Looking at the log, I have some additional suggestions unrelated to the problem that I hope will make your experience better:

  • Specify a tract in the --id, and the job will go much faster.
  • Don’t use --clobber-config in production.
  • Set envvar OMP_NUM_THREADS=1 to disable the warning about implicit use of threads.

Thanks! I’ll try a test omitting those 4 exposures–I’m surprised, though, as those aren’t the ones that were in the dangerous date range. Will report back…

Just to check: what version of jointcal and/or the Science Pipelines are you running? Note that jointcal doesn’t look at the images, just the catalogs.

You can try turning on the writeChi2FilesInitialFinal config option. This will result in a .csv file in your current working directory containing the contributions to the chi2 matrix from every source at the time of initialization. You should be able to find the NaNs in there, which might help you track down which detectors/visits are causing the problem. You can also try running with log level DEBUG (--loglevel jointcal=DEBUG in this case), though I’m not sure there will be much useful information during initialization.

The “Matched 0 objects” in the first block of the log is just the fact that those are the first detectors+visits in the list, so they will have no matches. Any detector on or past the edge of a tract may have zero cross matches, but should have refcat matches.

I wonder if this is related to this error reported by @sfu : DM-22548? We weren’t able to find anything conclusive in that case, and other things took priority after I’d initially looked into it.

1 Like

Thanks! We’re using v19.0 still (we were hesitant to move to v20). I’ll add the writeChi2FilesInitialFinal and let you know what we find. (Last night I tried a couple of tests omitting or including files and can confirm that the “Matched 0 objects” entries don’t affect the completion of the jointcal processing, whereas it is the exposures from that one run that cause the NaN. Yes, this the same problem @sfu pointed out–we’re coming back to it and it’s increased in prominence because it seems to affect every exposure of that run, no matter the pointing…

So, the output of writeChi2FilesInitialFinal is interesting–the _meas version is where all of the NaN are. It looks like columns rx,ry,rxi,ryi (and chi2) for many objects are all nan. I am not sure I understand the “visit” entry, though. What’s listed is almost always the first visit in the sequence and not the offending images… I’m including the a link to the file (it’s too big to include). I notice that even the objects that aren’t nan seem to have weirdly large values for this column–unless I don’t understand the units?

We’re still puzzled by this problem. I spent much of last week going through the output of ProcessCcd on the “bad” run and the other runs, and couldn’t find anything that was out of place. That said, jointcal runs fine if the images are omitted, but crashes immediately with the NaN error when they are included. We’ve narrowed it down to the rx,ry,rxi,ryi, and chi2 entries in the astrometry_initial_chi2-0_z-meas.csv
file. Can someone give me a hint as to where to look to track down where these NaN could be calculated? (The link is included, but I reproduce a couple of lines from the file; headers and relevant entries) in case it’s useful:

#id     xccd    yccd    rx      ry      xtp     ytp     mag     mjd     xErr    yErr    xyCov   xtpi    ytpi    rxi     ryi     color   fsindex ra      dec     chi2    nm      chip    visit
#id in source catalog   coordinates in CCD (pixels)             residual on TP (degrees)                transformed coordinate in TP (degrees)          rough magnitude Modified Julian Date of the measurement transformed measurement uncertainty (degrees)                   as-read position on TP (degrees)                as-read residual on TP (degrees)                currently unused        unique index of the fittedStar  on sky position of fittedStar           contribution to Chi2 (2D dofs)  number of measurements of this fittedStar       chip id visit id
91815667665207338       967.330371      173.495251      nan     nan     -186.439451     -91.8579238     14.9774629      56461.0671      9.09943477e-08  9.17736513e-08  5.81127881e-11  -0.140123132    0.00627986687   nan     nan     0       -1      227.261502      6.55549107      nan     5       1       213775
1 Like

So it looks like there’s a problem with the residual position on the tangent plane (rx,ry,rxi,ryi; and chi2 is calculated from those). But the position on the tangent plane is fine so it’s not a projection problem. So does this mean that a NaN has gotten into the coefficients?

@idellant Are the images going into this jointcal run available for download somewhere easily accessible? (I would like to see the headers, mainly, but it might be helpful to see what the images look like.)

Is it possible for you to send me a pickle file with the list of visitInfos?

@price, I’ve put two images (one that works, visit 0202219, and one that fails, visit 0213777), plus the calexps and src catalogs of a chip (41) in case you want to see them in . I apologize, I’m not sure where the pickle file with the visitInfos would be stored–the only .pickle files that seem to be generated look to be the various packages.pickle and the SkyMap.pickle file. Sorry, Ian

1 Like

I think the problem is that the airmass isn’t set in the header for visits that fail.

The good calexp contains the headers:

BORE-RA =     227.701523365432                                                  
BORE-DEC=     5.71966372367164                                                  
BORE-AZ =                23.98                                                  
BORE-ALT=                51.49                                                  
HIERARCH BORE-AIRMASS =   1.28                                                  
HIERARCH BORE-ROTANG =     90.                                                  
ROTTYPE = 'SKY     '           / Type of rotation angle                         

But BORE-AIRMASS is missing in the bad calexp:

BORE-RA =     227.725990026014                                                  
BORE-DEC=     5.65894149906265                                                  
BORE-AZ =                12.52                                                  
BORE-ALT=                53.52                                                  
HIERARCH BORE-ROTANG =     90.                                                  
ROTTYPE = 'SKY     '           / Type of rotation angle                         

I think this means that the airmass is NaN, which means CcdImage._tanZ is NaN, which means the refraction vector is NaN, which causes the rx,ry values to be NaN.

Can you check to see if the AIRMASS header is present and defined in the raw images?

The Proper Way to fix this is via a set of header correction files in obs_decam, but that scheme doesn’t currently exist, and I don’t remember what needs to be done to support that (@timj?). A quick and dirty fix would be to hack the headers (very naughty, I know!). And there should be a check for bad VisitInfo values in jointcal.


Indeed! The “bad” raw image has no AIRMASS keyword, whereas the good one does.
So, the practical question is–is it better to hack the raw image’s header and add the keyword, then rerun processCcd (I think so?)
or is it better/easier to just add the keyword to the calexps headers?
Thank you! This has been bugging us for so long…

1 Like

I believe you can create a simple one-line header correction YAML file as described in

It might go in corrections/DECam-0213777.yaml (or whatever the appropriate obs_id is).

1 Like

What behavior is desired for a DECam header where the AIRMASS keyword is missing? If this is something that should always be fixed for all DECam headers, it could be added here

The metadata translator is doing the right thing. It already has a facility for hacking headers on read that should be used here (but I don’t believe it’s set up yet for DECam).

If the airmass isn’t set in the VisitInfo, jointcal should throw an exception.

Thank you. Can I ask whether we should start from rerunning processCcd, or just rerun jointcal? Do we also need to adjust the config files? Thanks.