Jointcal processing: compiling list of tracts and visits

I am trying to reduce some open use HSC data which spans hundreds of degrees in the sky. I was using hscPipe 8.4 for this. Given that the sky area is large I am using the same skyMap as the HSC SSP. I have a number of questions regarding this.

  1. How to compile the list of visits to be sent to jointcal? Is it correct that I should go through a list of the visit, ccd to figure out the tracts corresponding to them and then pass a single tract and the corresponding visits to jointcal?

  2. I have tried to compile such a list (there must be a simpler pipeline way of figuring this out). I am just doing it with jointcal.py run with --show data and then parsing the output. Is that the correct approach?

  3. It also seemed that jointcal can only be run in a smp mode, not via pbs. Can someone please confirm?

1 Like

I don’t think there is an official tool that generates lists of visits by tract. One option would be to simply pass everything to jointcal and it would iterate over tracts, but this restricts parallelism and would almost certainly be a recipe for frustration. If you have a non-official solution (e.g., your suggestion in point 2; or perhaps @furuswhs has something from the SSP work) , that would be the way to go for now. I expect the coming Gen 3 middleware will deal with this, which makes it unlikely that an official tool for the Gen 2 middleware will be constructed for this.

jointcal doesn’t have an internal option to run it on the cluster (c.f. singleFrameDriver and multibandDriver). You would need to generate and submit the batches yourself.

Note also that jointcal is currently purely single-core, and has significant memory requirements, so you should run one jointcal instance per machine, not per CPU core.

Thanks to both of you.

I am currently using method 2, finding out what visits contributed to any given tract and passing them along to jointcal. I am running single jointcal instance per node. But I have been seeing problems. For some tracts, the code just does not finish at all and gives no warning whatsoever (just continues to run) to walltime termination. I am giving an assortment of locations where the code either gets stuck or ends below:

jointcal.AstrometryFit INFO: assignIndices: Now fitting Distortions Positions

or

jointcal ERROR: Potentially bad fit: High chi-squared/ndof.
jointcal INFO: Updating WCS for visit: 249176, ccd: 0

or

jointcal.PhotometryFit INFO: assignIndices: now fitting: Fluxes

or (the next one seems more like a bug or possibly memory error).

jointcal.PhotometryFit INFO: assignIndices: now fitting: Model Fluxes
Caught signal 11, backtrace follows:
/opt/local/hscPipe8/stack/miniconda3-4.7.10-4d7b902/Linux64/utils/8.0-hsc/lib/libutils.so(+0x15df4) [0x7fffea883df4]
/lib64/libc.so.6(+0x32354329a0) [0x7ffff78039a0]
/opt/local/hscPipe8/stack/miniconda3-4.7.10-4d7b902/Linux64/jointcal/8.0-hsc+1/lib/libjointcal.so(lsst::jointcal::FitterBase::findOutliers(double, lsst::jointcal::MeasuredStarList&, lsst::jointcal::FittedStarList&) const+0x78) [0x7ffff0767e48]
/opt/local/hscPipe8/stack/miniconda3-4.7.10-4d7b902/Linux64/jointcal/8.0-hsc+1/lib/libjointcal.so(lsst::jointcal::FitterBase::minimize(std::string const&, double, bool, bool, std::string const&)+0x113b) [0x7ffff076a67b]
/opt/local/hscPipe8/stack/miniconda3-4.7.10-4d7b902/Linux64/jointcal/8.0-hsc+1/python/lsst/jointcal/fitter.so(+0x20168c) [0x7fff99d0a68c]
/opt/local/hscPipe8/stack/miniconda3-4.7.10-4d7b902/Linux64/jointcal/8.0-hsc+1/python/lsst/jointcal/fitter.so(+0x1fc0ad) [0x7fff99d050ad]
python((null)+0x305) [0x7ffff811dec5]

We’ll need some more information: what is the exact command you are running, what wall time, how much data, etc.

Sure, here is the information:

First case:

jointcal.py Feb_2021 --calib Feb_2021/CALIB --rerun skymap:jcal --id visit=249176^249178^249180 tract=14599 ccd=0..8^10..103 -j 20 --config doPhotometry=True

The corresponding place where it stops (job killed after more than 8 hours)

jointcal.AstrometryFit INFO: Number of outliers (Measured + Reference = Total): 1359 + 5810 = 7169
jointcal INFO: Model chi2/ndof : 1.13472e+06/130426=8.70009
jointcal.AstrometryFit INFO: assignIndices: Now fitting Distortions Positions
jointcal.AstrometryFit INFO: findOutliers: found 0 meas outliers and 1 ref outliers 
jointcal.AstrometryFit INFO: findOutliers: found 0 meas outliers and 0 ref outliers 
jointcal.AstrometryFit INFO: Number of outliers (Measured + Reference = Total): 0 + 1 = 1
jointcal INFO: Fit completed chi2/ndof : 1.13458e+06/130424=8.6992
jointcal ERROR: Potentially bad fit: High chi-squared/ndof.
jointcal INFO: Updating WCS for visit: 249176, ccd: 0

Second case:

jointcal.py Feb_2021 --calib Feb_2021/CALIB --rerun skymap:jcal --id visit=249194 tract=14188 ccd=0..8^10..103 -j 20 --config doPhotometry=True

Output stalls at (job killed after more than 8 hours):

jointcal WARN: ccdImage 249194_101 has only 47 measuredStars (desired 100)
jointcal WARN: ccdImage 249194_102 has only 86 measuredStars (desired 100)
jointcal WARN: ccdImage 249194_103 has only 64 measuredStars (desired 100)
jointcal.PhotometryFit INFO: assignIndices: now fitting: Fluxes

Third case:

jointcal.py Feb_2021 --calib Feb_2021/CALIB --rerun skymap:jcal --id visit=249536 tract=13994 ccd=0..8^10..103 -j 20 --config doPhotometry=True

Stalls at (job killed after more than 8 hours)):

jointcal WARN: ccdImage 249536_101 has only 9 measuredStars (desired 100)
jointcal WARN: ccdImage 249536_101 has only 9 RefStars (desired 30)
jointcal WARN: ccdImage 249536_102 has only 26 measuredStars (desired 100)
jointcal WARN: ccdImage 249536_102 has only 26 RefStars (desired 30)
jointcal WARN: ccdImage 249536_103 has only 36 measuredStars (desired 100)
jointcal.AstrometryFit INFO: assignIndices: Now fitting Distortions Positions

What Science Pipelines version does hscPipe 8.4 correspond to? We should have a sticky post about that relationship, for those of us not involved in HSC releases.

It looks like the run on tract=14599 completed (though the final chi2 is rather high), it just ran out of time before it finished writing the output. Try giving it a longer time before timeout? I’m a bit surprised that it stopped just as it was writing the first output, though. I wonder if that’s related to the bad chi2. For example, something pathological in the final WCS that causes the construction of the final SkyWcs to fail catastrophically?

For the other two tracts you only specified one visit each: I haven’t tested jointcal with only a single visit, so I’m not that surprised at odd behavior, but I would expect more obvious errors, not it to just hang.

Can you please upload the full logs somewhere, running with --loglevel jointcal=DEBUG?

You’re specifying -j20: jointcal itself doesn’t care, but I don’t know if the gen2 Butler does anything with it; could it be creating per-core memory limits based on that?

@price should be able to answer the LSST pipeline version corresponding to hscPipe 8.4.

The run on tract=14599 reached that position and stayed there for a long time which is what puzzled me. I waited a long time (>6 hours for sure) for it to write something, but nothing happened. I could try to see if something bad happens by putting in some debugging statement.

I will rerun with jointcal=DEBUG and upload logs soon.

The -j flag is mentioned here in the hscPipe manual: https://hsc.mtk.nao.ac.jp/pipedoc/pipedoc_8_e/tutorial_e/mosaic.html

hscPipe 8 was forked off LSST at around w.2020.05.

I don’t know where that document comes from (@price?) but it should be updated to remove the -j for jointcal.

So it seems the main reason why jointcal was hanging was due to the -j flag. This should definitely be removed from the HSCpipe tutorial. I remember mosaic used to have numCoresForReadSource and I thought this -j function had a similar role, which is why I did not question it initially.

I will report back with more of the failures I am seeing. But I will try to sort them and group them into different categories first. Some are related to stale file handles which may be some issues with the mounted filesystem, some related to std::bad_alloc which should be memory allocation issues.

Just wanted to report back, that I was able to run jointcal on our cluster machines which have a larger memory per node without any failures after removing the -j flag.

2 Likes