processCcd blank second extension

I am running into a very strange issue around processCcd.py. I am trying to launch a series of batch jobs to the canfar cloud (uses condor) where each job would process a single HSC frame.

During setup, I created a VM that is interacted with through SSH. On that machine, the lsst_distrib was installed. A script was also created, that copies onto the VM an image, ingests that image into the butler, runs processCcd, and copies the image back. When logged into the VM directly, that script is fully successful, and the resultant processed CCD are beautiful. Everything looks great.

So that machine is dupicated as a VM for use in batch mode.

When exactly the same job is launched through condor, things trip up. Oddly, the terminal output of processCcd.py is exactly the same as it is when run interactively, and the data extension is processed correctly. But the second extension that contains the pixel flags for sources, background, bad pixels etc., is blank, containing only zeros.

The batch job raises no flags, logs no errors, and the processCcd script finishes as normal.

So there is something strange about the process that fills the second extension that doesn’t work correctly, but doesn’t take the script down or through any errors (or even any warnings).

Anyone have any ideas what to look for?

A few details:
-ubuntu 18.04 OS
-running on intel broadwell hardware
-the source and setup calls for the lsst distribution are all good
-PATH, PYTHONPATH, and LD_LIBRARY_PATH variables are identical between the interactive VM and that run in batch mode.

That is wild, and intriguing.

Some wildly speculative questions:
The second extension (the mask) exists and looks normal except that it contains zeros, or could it be truncated somehow? What about the third extension (variance plane)? Are you sure that the pipeline writes it in the form you’re seeing, or could it have been corrupted in transit? Could there maybe have been something wrong in the cfitsio build, since you’re running on a non-standard OS?

You could perhaps try something simpler to narrow things down: write a simple script to read an ExposureF (containing data in all extensions) and write it out again.

Thanks @ktl

Over a number of days, we dug into a huge number of things, including much of what you mentioned. Turns out the issue was not with the pipeline, but with ds9. The images were fine from the start. Sigh.

Pro-tip: if you are using ds9 version 8.1, don’t.

Aha! It turns out that this is a sort-of-known but not well-advertised problem. Some Slack messages from back in January:

If you’re reading with something other than afw, it’s quite possible our unusual (but legal) compression for masks is not working. Definitely all but the very latest versions of DS9 have that problem, and the symptom is all zeros.

and

[Robert Lupton]: I have to play games to make our masks work with ds9. When Eric added them for me he didn’t add support for different colours for each bitplane, so the afwDisplay code does the slicing.

I’ve added some text to pipelines.lsst.io about this; see https://jira.lsstcorp.org/browse/DM-24678 for more details.