A few issues with jointcal

I’m using the DRP step3 processing to create deep coadds and good seeing templates using a set of calibrated images from DECam, and I am running into a few issues when running the jointcal task.

Slow convergence

I am finding that processing gets stuck on the jointcal stage, and I’ve identified that it is the outlier rejection step that is very slow to converge. I’ve found a configurable astrometryOutlierRelativeTolerance which I’ve set equal to 0.002 which allows for the astrometry solution to converge in a reasonable amount of time. However, I’ve found that there is no corresponding configurable for photometry, e.g. photometryOutlierRelativeTolerance. I’ve manually edited jointcal.py to set sigmaRelativeTolerance=0.002 as a default in _iterate_fit as a work around to allow the photomoetry fit to converge.

I’ve also noticed from debug logging that the median/mode/standard deviation of chi^2/dof indicates that the chi^2 distribution is highly skewed due to the outliers, and so both the mean and standard deviation are biased high, making mean+3sigma cuts ineffective at removing outliers. I have actually deleted the logs since making this post (I can try to regenerate if needed), but from memory the chi^2/dof statistics were something like: median=~4, mean=~10, stddev=~20. So I was wondering if just using a different statistic, e.g. median + standard deviation, or perhaps using percentiles of the distribution or statistics derived from percentiles would be better for outlier determination.

Memory usage issues

I am finding that running jointcal has very high memory requirements when calibrating across ~250 visits from DECam (62-3=59 CCDs per visit) with imaging of dense fields (galactic bulge). I am seeing that memory usage grows over time up to ~250GB, and then I’ve had processes fail due to out-of-memory errors. Here are a few logs pointing to the number of sources used in fitting:

INFO 2022-04-30T01:09:08.630-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(jointcal.py:1260) - ====== Now processing astrometry...
INFO 2022-04-30T01:22:33.088-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(loadReferenceObjects.py:904) - Loading reference objects from gaia_dr2_20200414 in region bounded by [269.18883014, 272.59111395], [-31.00847184, -28.04821100] RA Dec
INFO 2022-04-30T01:23:16.169-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(loadReferenceObjects.py:945) - Loaded 5172187 reference objects
INFO 2022-04-30T01:23:20.465-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(loadReferenceObjects.py:1321) - Correcting reference catalog for proper motion to <Time object: scale='tai' format='jyear' value=2021.4277170694447>
INFO 2022-04-30T01:27:36.187-07:00 lsst.jointcal.astrometryReferenceSelector (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(sourceSelector.py:618) - Selected 5172187/5172187 references
WARNING 2022-04-30T01:27:36.785-07:00 lsst.jointcal.Associations (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(Associations.cc:255) - No ra/dec proper motion covariances in refcat: Field with name 'pm_ra_Dec_Cov' not found with type 'F'.
INFO 2022-04-30T01:27:42.829-07:00 lsst.jointcal.Associations (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(Associations.cc:333) - Associated 605638 reference stars among 5172187
INFO 2022-04-30T01:27:43.016-07:00 lsst.jointcal.Associations (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(Associations.cc:343) - Fitted stars before measurement # cut: 613829
INFO 2022-04-30T01:27:45.372-07:00 lsst.jointcal.Associations (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(Associations.cc:383) - Fitted stars after measurement # cut: 606747
INFO 2022-04-30T01:27:45.373-07:00 lsst.jointcal.Associations (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(Associations.cc:384) - Total, valid number of Measured stars: 25030947, 25023865

Should I play around with the sourceSelector, astrometryReferenceSelector, and photometryReferenceSelector to select fewer sources? (If this is the cause of the memory usage – I know that jointcal is manipulating a very sparse matrix, so maybe the number of sources is not a problem.)

Determining if a potential bad fit is bad

I am finding however that a tolerance of 0.002 leads to the fit being “potentially bad”, e.g.:

INFO 2022-04-27T17:24:21.558-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'i DECam SDSS c0003 7835.0 1470.0', tract: 0, ...})(jointcal.py:1410) - Fit completed chi2/ndof : 3.49659e+07/533408=65.5518
ERROR 2022-04-27T17:24:21.602-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'i DECam SDSS c0003 7835.0 1470.0', tract: 0, ...})(jointcal.py:1715) - Potentially bad fit: High chi-squared/ndof.

So I have to do some more experimenting, but I am also looking for a way to verify that this fit is bad enough to produce WCS/photometry results worse than a per-detector WCS solution. I am wondering if anyone has suggestions.

Model is not valid

Finally, I’ve had photometric fitting fail after astrometric fitting succeeded:

ERROR 2022-04-30T12:27:33.735-07:00 lsst.ctrl.mpexec.singleQuantumExecutor (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(singleQuantumExecutor.py:196) - Execution of task 'jointcal' on quantum {instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...} failed. Exception ValueError: Model is not valid: check log messages for warnings.

I am wondering what I should look for in the logs to indicate why this is happening. I find that there are warnings saying there are too few measured stars (which is curious since these are dense fields):

WARNING 2022-04-30T12:27:32.717-07:00 lsst.jointcal (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(jointcal.py:1631) - ccdImage 995274_62 has only 81 measuredStars (desired 100)

And warnings that the pixel values are negative:

WARNING 2022-04-30T12:27:33.716-07:00 lsst.jointcal.ConstrainedMagnitudeModel (jointcal:{instrument: 'DECam', skymap: 'discrete_decaps_east', physical_filter: 'r DECam SDSS c0002 6415.0 1480.0', tract: 0, ...})(PhotometryModel.cc:60) - CcdImage 995251_25 is negative at (0,2047.5): -12676.5

I am thinking I’ll inspect some of the CCDs that it claims have few sources and/or extreme negative pixel values to see if something has gone wrong in step{0-2} processing.

Thanks for any help with this.

Jointcal is very slow when operating on many visits because it wants to iterate until there are zero outliers, but it finds new outliers with each iteration. When I hit this problem, I recompile jointcal with the following patch, which results in much faster solutions:

diff --git a/src/FitterBase.cc b/src/FitterBase.cc
index a7f3feb..e02191f 100644
--- a/src/FitterBase.cc
+++ b/src/FitterBase.cc
@@ -213,7 +213,7 @@ MinimizeResult FitterBase::minimize(std::string const &whatToFit, double nSigmaC
     std::size_t totalRefOutliers = 0;
     double oldChi2 = computeChi2().chi2;
 
-    while (true) {
+    for (std::size_t iteration = 0; iteration < 5; ++iteration) {
         Eigen::VectorXd delta = chol.solve(grad);
         if (doLineSearch) {
             scale = _lineSearch(delta);

Regarding your memory issues, your tract looks rather large (something on the order of 9 square degrees?), and all that is getting loaded into memory at once. It looks like you’re using a discrete skymap constructed with butler make-discrete-skymap. Perhaps it would be better to split things into smaller tracts by using a different skymap?

On convergence and the tolerance value for photometry: we’ve basically deprecated jointcal photometry in favor of fgcmcal, but I don’t believe fgcmcal works on DECam yet (@erykoff could tell you if there’s a way to get it to work). We’ve also basically paused jointcal development, as @cmsaunders is working on gbdes to replace it for astrometry, so improvements in outlier rejection (which we’ve long known was the slowest part) will be a long time coming.

The reference catalogs shouldn’t have much of anything to do with jointcal’s memory usage: they’re a pretty small component of the overall number of sources; that’s dominated by the measuredStars (every star on every visit and detector). You could try upping sourceSelector["science"].signalToNoise.minimum to 20 or 50 or 100 to use fewer sources per detector. I don’t know at what level the fit might get bad due to not having enough sources, but it should reduce your runtime and memory requirements.

Chi2 of 65 is quite high. Is this for photometry or astrometry? Do you have logs from before you set sigmaRelativeTolerancethat you could look at, to see how quickly it is converging, or whether it has basically flat-lined at that point? I recall seeing high chi2 for photometric fits long ago, due to either poorly-matched reference catalogs (are you using DECam colorterms appropriate to your reference catalog?) or a model that is not well matched to the instrument (the current jointcal defaults should work well for DECam, but I haven’t verified that on very large datasets).

Model is not valid errors mean that the photometric model becomes negative somewhere on the detector bbox (or very near it). I’ve seen that with bad data (e.g. a nearly fully-vignetted HSC detector), so that’s the first thing I’d check.

The warnings about number of stars may or may not be useful: we picked some numbers for those warning levels initially based on our expectations for LSST, but probably need to tune them (and certainly they may set too high for other datasets). You can set config.minMeasuredStarsPerCcd (try 40?) and config.minRefStarsPerCcd (try 20?) to smaller values to cut down on the number of warning messages and help you identify detectors that have very few sources used.