Change of Timeout behavior of CommandLineTask (and its children)

Recently several people (myself included have had python multiprocessing timeout errors when processing large amounts of data (multiple visits, ccds, etc) with some -j option. I have tracked this behavior down to how we are setting the timeout value within CommandLineTask. Essentially, if a timeout is not specified to the argument parser, a default value of 9999 seconds is set. The processing pool then goes about its job of processing data, and starts a method which will eventually fetch the processing pool’s results. This method is supplied the timeout value, and begins to count down as soon as it is started. If the timeout is reached and the results of the pool’s processing are not available, a timeout error is thrown (this is all happening within the multiprocessing module).

The problem in our case arrises from the fact that some processing jobs may take longer than the default 9999s. We have two options to address this issue. One option is to scale the input timeout by the number of elements to process divided by the number of processors available. This will insure there is a timeout, but that it will scale with the workload. If we go this path, the documentation should be updated to indicate that user supplied timeout will be for a single analysis, and will be scale by the workload. The other option includes not setting a default timeout value. When none is passed as the timeout, the process is allowed to run forever. This has the benefit of never underestimating the time a workload will take to complete, but has the drawback that potentially something could happen and the process will never indicate to the user that it should be killed. If we go with unlimited runtime we should still change the user suppled timeout to be scaled by the amount of work the user has requested.

Are there any preferences on which route we go with? Or are there any alternative ideas?

The timeout is required due to a bug in python (if there’s no timeout, then you can’t Ctrl-C interrupt properly, as I recall).

The timeout is currently set to 9999 sec because there were (unsubstantiated) hints on stackoverflow that setting it too high caused some performance problems. I haven’t seen any problems, and we’ve chunked everything up sufficiently that small inefficiencies shouldn’t be noticed. I suggest we increase the timeout substantially.

Losing the ability to do Ctrl-C is definitely something we want to avoid. Unless we can verify that’s not a problem, it sounds like we may not want to set the timeout to None.

I do think we should set it to scale with the amount of work to be done rather than just set it to a large number. It’s large jobs dying unexpectedly that’s particularly annoying for people, since they’re the biggest pain to restart.

I think any clever scaling algorithm will still catch someone eventually. Better to make it None (if and only if we can preserve Ctrl+C) or something ridiculous like a year.

The relevant bug in Python is 8296 (there are others, linked from there, which are relevant). This is fixed in Python 3.3+, but not in 2.7 and there’s no sign that the fix will be back-ported.

From a quick squint at the code, there’s no obvious reason that a bigger timeout value should make things significantly slower than a small one. I agree with @ctslater that simplicity is preferable to cleverness unless the latter is really required. Therefore, the idea of just adding a few 9s to the timeout seems preferable.