Testing threshold concerns raised in python 3 porting

mrawls · October 7, 2016, 11:30pm

In the course of working on DM-7292, which should be a straightforward port of pipe_tasks from py2 to py3, we encountered an interesting problem with how thresholds set with assertAlmostEqual are used in tests. In this specific situation, testProcessCcd returns slightly different values for image statistics (background mean, numGoodPix, image mean, etc.) when it is run in py2 and py3.

I believe the problem relates to how image noise is simulated with pseudorandom numbers. For instance, python’s built-in random returns different values for the same seed in py2 vs. py3, while numpy.random returns identical values in py2 and py3. Unfortunately, this doesn’t seem to be the cause of this specific issue because the problem persisted when I switched to using numpy.random in the few places where random was used throughout pipe_tasks.

Two main points. One, please be aware that tests may pass but in fact have subtly different calculated values than expected, particularly if random is involved. Two, has anyone encountered an issue like this before and have any clues where the differences in the test image statistics are being introduced?

price · October 8, 2016, 12:50am

Can you list which tests are failing?

mrawls · October 8, 2016, 12:56am

No tests are currently failing, but this concerns testProcessCcd. The tolerance thresholds for several assertAlmostEqual statements in this test were previously increased specifically so it will pass.

price · October 8, 2016, 1:08am

I’ve never liked that test very much, as it feels needlessly restrictive. I don’t think testProcessCcd should be checking whether we can do background subtraction exactly as we did before; that should be left to the background subtraction test. Testing inconsequential things, and with such a tight tolerance, breaks encapsulation and makes legitimate development more difficult.

RHL · October 8, 2016, 3:09pm

Do we get different answers when the random image is identical? Meredith implies that this might be true:[quote=“mrawls, post:1, topic:1220”]
Unfortunately, this doesn’t seem to be the cause of this specific issue because the problem persisted when I switched to using numpy.random in the few places where random was used throughout pipe_tasks.
[/quote]

While I agree that things might change as we adjust algorithms, the py3 port shouldn’t be introducing anything like this.

rowen · October 10, 2016, 5:59pm

I agree with @rhl. In this case I think the fact that the test failed (before the tolerances were increased) is telling us something potentially important and worth tracking down.

One possibility is that some existing py2 code is using / as integer division (if so, this should be fixed; all python-2 compatible code should have from __future__ import absolute_import, division, print_function and if integer division is wanted use //).

The issue of random numbers came up in a different context: I heard a report that running processCcd.py repeatedly on a given set of data produced some results that were not bit-identical. It seemed to me that a random number with the seed not set was likely to be the cause. Where that might be is a good question. We explicitly use a random number generator to replace sources with noise during measurement, and random numbers are probably used in certain solvers. In any case, the symptoms of a systematic difference between py2 and py3 don’t seem to fit a non-seeded random number generator (though they could fit a seeded generator that was known to give a different sequence of numbers in py2 and py3).

price · October 10, 2016, 11:47pm

Another possibility is something relying on global state, similar to DM-7040.

timj · October 11, 2016, 10:57pm

I wouldn’t necessarily think that that would cause behavioral change between python3 and python2 though. Do you have a theory?

Adding future division to the files has the advantage that it should result in things breaking on 2 in the same way they do on 3.

price · October 11, 2016, 11:19pm

No. It would cause a context-dependent difference, but I can’t think of anything immediately that would cause differences across python versions. Sorry for the noise.

mrawls · October 15, 2016, 2:27am

Quick update: after a closer look, it is clear that the background image is causing the differences between py2 and py3. Since the background model is generated in meas_algorithms and not in pipe_tasks, I will be finishing this py3 port ticket shortly and opening new ones to address the underlying issue. I will post a more thorough explanation of what we discovered in another Community thread. (UPDATE: that thread is now up.)