Hi Folks!
I’ve been working to pass some archival data through ap_pipe and have completed all of the steps without issue… except for diaPipe. The problem I’ve been running into is that, if I submit the diaPipe task to be completed via pipetask run -j N
or split between some number of cores via bps submit
, several of the jobs fail and throw sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
.
Running through the traceback (which I’ve included below), what I think is happening is that multiple jobs are trying to write to the apdb at once and, for some jobs, their wait-time for the database to unlock exceeds the sqlalchemy defaults. I’ve confirmed that this is the caseby running diaPipe with a single job, which completes without throwing any exceptions…
But it get’s a little stranger because when I increase the queue time (e.g. passing -c diaPipe.connection_timeout=30
with pipetask/bps) a similar number of jobs fail due to being locked out of the database. I’ve also tried passing this config option to the line which creates the apdb (e.g. make_apdb.py -c connection_timeout=30 -c db_url="sqlite:///apdb.db"
), which also does not resolve the issue. I’ve also tried explicitly setting the isolation level to different parameters, which has not helped.
One potential fix that I’ve found, which I don’t think is a very good solution, is to update my bps_config.yaml to include retries: N
, which forces jobs to retry N-times before being considered a failure. This works, and in principle I think there is an option to force jobs to retry until a specific error code is thrown (buried here in generic_workflow.py)… but I think it should be possible to get this to run without these issues. At this point I’m thinking that maybe there is some hidden config option I can tweak or if there is something on my institution’s sqlite install which is causing this issue.
Anyways, here is the full traceback, thanks for the help!
INFO 2024-07-30T11:12:28.506-04:00 lsst.ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:854) - QuantumGraph contains 1 quantum for 1 task, graph ID: '1722350728.215042-644873'
Quanta Tasks
------ -------
1 diaPipe
INFO 2024-07-30T11:12:28.574-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:198) - Preparing execution of quantum for label=diaPipe dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}.
INFO 2024-07-30T11:12:28.606-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:250) - Constructing task and executing quantum for label=diaPipe dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}.
INFO 2024-07-30T11:12:39.853-04:00 lsst.diaPipe.diaCalculation (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(diaCalculation.py:389) - Calculating summary stats for 138 DiaObjects
INFO 2024-07-30T11:12:41.611-04:00 lsst.diaPipe.diaForcedSource.forcedMeasurement (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(forcedMeasurement.py:336) - Performing forced measurement on 253 sources
INFO 2024-07-30T11:12:41.641-04:00 lsst.diaPipe.diaForcedSource.forcedMeasurement (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(forcedMeasurement.py:336) - Performing forced measurement on 253 sources
ERROR 2024-07-30T11:12:47.538-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:260) - Execution of task 'diaPipe' on quantum {instrument: 'DECam', detector: 21, visit: 671494, ...} failed. Exception OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO "DiaInsertId" (insert_id, insert_time) VALUES (?, ?)]
[parameters: ('bb5b2df9546a4958b1197bfb66d7c38b', '2017-08-25 00:03:42.331991')]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
ERROR 2024-07-30T11:12:47.558-04:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:503) - Task <TaskDef(lsst.ap.association.diaPipe.DiaPipelineTask, label=diaPipe) dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}> failed; processing will continue for remaining tasks.
Traceback (most recent call last):
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
self.dialect.do_execute(
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 473, in _executeQuantaInProcess
self.quantumExecutor.execute(qnode.taskDef, qnode.quantum)
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 161, in execute
result = self._execute(taskDef, quantum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 258, in _execute
self.runQuantum(task, quantum, taskDef, limited_butler)
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 460, in runQuantum
task.runQuantum(butlerQC, inputRefs, outputRefs)
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ap_association/g6758148a95+45f81d9659/python/lsst/ap/association/diaPipe.py", line 310, in runQuantum
outputs = self.run(**inputs)
^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/utils/g34fdc7a0c9+81bc2a20b4/python/lsst/utils/timer.py", line 295, in timeMethod_wrapper
res = func(self, *args, **keyArgs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ap_association/g6758148a95+45f81d9659/python/lsst/ap/association/diaPipe.py", line 470, in run
self.apdb.store(
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/dax_apdb/g4c90b7bd52+5145c320d2/python/lsst/dax/apdb/apdbSql.py", line 499, in store
self._storeInsertId(insert_id, visit_time, connection)
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/dax_apdb/g4c90b7bd52+5145c320d2/python/lsst/dax/apdb/apdbSql.py", line 716, in _storeInsertId
connection.execute(stmt)
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
return meth(
^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 517, in _execute_on_connection
return connection._execute_clauseelement(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
ret = self._execute_context(
^^^^^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
return self._exec_single_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
self._handle_dbapi_exception(
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2344, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
self.dialect.do_execute(
File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO "DiaInsertId" (insert_id, insert_time) VALUES (?, ?)]
[parameters: ('bb5b2df9546a4958b1197bfb66d7c38b', '2017-08-25 00:03:42.331991')]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
INFO 2024-07-30T11:12:48.239-04:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:519) - Executed 0 quanta successfully, 1 failed and 0 remain out of total 1 quanta.