Intermittent "database is locked" error when running AlardLuptonSubtractTask in v29_2_0

caimx · September 25, 2025, 3:03pm

Dear LSST Science Pipelines Team,

I’m a member of the Wide Field Survey Telescope (WFST) data reduction pipeline software team. We are currently upgrading our obs_wfst package from v24_0_0 to v29_2_0. So far, we have successfully passed tests for calibration construction and single-frame processing.

However, we are encountering intermittent failures during the image subtraction step using lsst.ip.diffim.subtractImages.AlardLuptonSubtractTask. Some tasks fail with the following error:

{
	"name":"lsst.ctrl.mpexec.singleQuantumExecutor",
	"asctime":"2025-09-25T02:29:46.123526Z",
	"message":"Execution of task 'subtractImages' on quantum {instrument: 'WFC', detector: 18, visit: 118115, band: 'u', day_obs: 20250222, physical_filter: 'WFC-u'} failed. Exception OperationalError: (sqlite3.OperationalError) database is locked\n[SQL: BEGIN IMMEDIATE]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)",
	"levelno":40,
	"levelname":"ERROR",
	"filename":"singleQuantumExecutor.py",
	"pathname":"/data/public/lsst_stack_v29_2_0/conda/envs/lsst-scipipe-10.1.0/share/eups/Linux64/ctrl_mpexec/ge10c2aeecd+d8b3cefe0c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py",
	"lineno":290,
	"funcName":"_execute",
	"process":2345353,
	"processName":"task-{instrument: 'WFC', detector: 18, visit: 118115, band: 'u', day_obs: 20250222, physical_filter: 'WFC-u'}",
	"MDC": {
		"LABEL":"subtractImages:{instrument: 'WFC', detector: 18, visit: 118115, band: 'u', day_obs: 20250222, physical_filter: 'WFC-u'}",
		"RUN":"WFC/runs/ap/20250925T021706Z"
	}
}

The issue appears to be intermittent — we ran the subtraction step across 36 parallel processes three times:

First run: all succeeded
Second run: 3 out of 36 failed with the “database is locked” error
Third run: all succeeded again

This suggests the failure is not deterministic and may be related to concurrent access or file locking.

We would like to ask:

What is the recommended way to completely avoid this SQLite locking issue (i.e., achieve 100% success rate) in v29_2_0?
Could this type of error potentially occur in earlier pipeline stages such as calibration construction, single-frame processing, or coaddition?

Thank you for your support and guidance.

Best regards,
Minxuan Cai
WFST Data Reduction Pipeline Team

timj · September 25, 2025, 3:39pm

It sounds like you are using either pipetask run -j LARGEN or lots of pipetask run commands in parallel using sqlite registry. Sqlite does not run at scale due to these locking issues. Very large N can run into locking problems as you have found. What we do is using the quantum backed butler approach where the quanta are executed without any interaction with the registry (just using the quantum graph itself) and then when all the processing is complete we register the results (using butler transfer-from-graph). To simplify this work we use the BPS infrastructure with panda/htcondor/parsl for massive parallelization. Additionally, we are currently rewriting part of this to support provenance gathering in BPS.

Postgres might help in the short term but for large scale batch you can not have each quantum hammering the database.

You may want to consider either running pipetask run-qbb yourself (and the init step first) or else looking into BPS.

caimx · September 26, 2025, 7:33pm

Thank you very much for your guidance!

Following your suggestion, we have changed our Butler registry from SQLite to PostgreSQL. After making this change, we re-ran the apPipe task (across 36 parallel processes) five times consecutively, and no “database is locked” errors occurred in any run.

This confirms that PostgreSQL effectively resolves the immediate concurrency issue in our current processing setup.

Thanks again for your support!

timj · September 26, 2025, 8:05pm

Great to hear. Once you get into hundreds of quanta being processed at the same time across multiple nodes you will definitely need to use BPS since Postgres won’t handle the load (and BPS is designed for that scale).

caimx · September 27, 2025, 2:09am

Thank you for the important reminder — we’re now planning our transition to BPS for future large-scale processing.