diaPipe Database Locking with bps/pipetask

Hi Folks!

I’ve been working to pass some archival data through ap_pipe and have completed all of the steps without issue… except for diaPipe. The problem I’ve been running into is that, if I submit the diaPipe task to be completed via pipetask run -j N or split between some number of cores via bps submit, several of the jobs fail and throw sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked.

Running through the traceback (which I’ve included below), what I think is happening is that multiple jobs are trying to write to the apdb at once and, for some jobs, their wait-time for the database to unlock exceeds the sqlalchemy defaults. I’ve confirmed that this is the caseby running diaPipe with a single job, which completes without throwing any exceptions…

But it get’s a little stranger because when I increase the queue time (e.g. passing -c diaPipe.connection_timeout=30 with pipetask/bps) a similar number of jobs fail due to being locked out of the database. I’ve also tried passing this config option to the line which creates the apdb (e.g. make_apdb.py -c connection_timeout=30 -c db_url="sqlite:///apdb.db"), which also does not resolve the issue. I’ve also tried explicitly setting the isolation level to different parameters, which has not helped.

One potential fix that I’ve found, which I don’t think is a very good solution, is to update my bps_config.yaml to include retries: N, which forces jobs to retry N-times before being considered a failure. This works, and in principle I think there is an option to force jobs to retry until a specific error code is thrown (buried here in generic_workflow.py)… but I think it should be possible to get this to run without these issues. At this point I’m thinking that maybe there is some hidden config option I can tweak or if there is something on my institution’s sqlite install which is causing this issue.

Anyways, here is the full traceback, thanks for the help!

INFO 2024-07-30T11:12:28.506-04:00 lsst.ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:854) - QuantumGraph contains 1 quantum for 1 task, graph ID: '1722350728.215042-644873'
Quanta  Tasks 
------ -------
     1 diaPipe
INFO 2024-07-30T11:12:28.574-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:198) - Preparing execution of quantum for label=diaPipe dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}.
INFO 2024-07-30T11:12:28.606-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:250) - Constructing task and executing quantum for label=diaPipe dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}.
INFO 2024-07-30T11:12:39.853-04:00 lsst.diaPipe.diaCalculation (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(diaCalculation.py:389) - Calculating summary stats for 138 DiaObjects
INFO 2024-07-30T11:12:41.611-04:00 lsst.diaPipe.diaForcedSource.forcedMeasurement (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(forcedMeasurement.py:336) - Performing forced measurement on 253 sources
INFO 2024-07-30T11:12:41.641-04:00 lsst.diaPipe.diaForcedSource.forcedMeasurement (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(forcedMeasurement.py:336) - Performing forced measurement on 253 sources
ERROR 2024-07-30T11:12:47.538-04:00 lsst.ctrl.mpexec.singleQuantumExecutor (diaPipe:{instrument: 'DECam', detector: 21, visit: 671494, ...})(singleQuantumExecutor.py:260) - Execution of task 'diaPipe' on quantum {instrument: 'DECam', detector: 21, visit: 671494, ...} failed. Exception OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO "DiaInsertId" (insert_id, insert_time) VALUES (?, ?)]
[parameters: ('bb5b2df9546a4958b1197bfb66d7c38b', '2017-08-25 00:03:42.331991')]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
ERROR 2024-07-30T11:12:47.558-04:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:503) - Task <TaskDef(lsst.ap.association.diaPipe.DiaPipelineTask, label=diaPipe) dataId={instrument: 'DECam', detector: 21, visit: 671494, ...}> failed; processing will continue for remaining tasks.
Traceback (most recent call last):
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 473, in _executeQuantaInProcess
    self.quantumExecutor.execute(qnode.taskDef, qnode.quantum)
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 161, in execute
    result = self._execute(taskDef, quantum)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 258, in _execute
    self.runQuantum(task, quantum, taskDef, limited_butler)
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g218a3a8f53+ca4789321c/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 460, in runQuantum
    task.runQuantum(butlerQC, inputRefs, outputRefs)
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ap_association/g6758148a95+45f81d9659/python/lsst/ap/association/diaPipe.py", line 310, in runQuantum
    outputs = self.run(**inputs)
              ^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/utils/g34fdc7a0c9+81bc2a20b4/python/lsst/utils/timer.py", line 295, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ap_association/g6758148a95+45f81d9659/python/lsst/ap/association/diaPipe.py", line 470, in run
    self.apdb.store(
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/dax_apdb/g4c90b7bd52+5145c320d2/python/lsst/dax/apdb/apdbSql.py", line 499, in store
    self._storeInsertId(insert_id, visit_time, connection)
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/dax_apdb/g4c90b7bd52+5145c320d2/python/lsst/dax/apdb/apdbSql.py", line 716, in _storeInsertId
    connection.execute(stmt)
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
           ^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 517, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2344, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/oscar/data/idellant/englert_newPipelineDev/lsst_installations/lsst_stack_v26_0_0/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO "DiaInsertId" (insert_id, insert_time) VALUES (?, ?)]
[parameters: ('bb5b2df9546a4958b1197bfb66d7c38b', '2017-08-25 00:03:42.331991')]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
INFO 2024-07-30T11:12:48.239-04:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:519) - Executed 0 quanta successfully, 1 failed and 0 remain out of total 1 quanta.

Yes, a SQLite database will not be able to handle running diaPipe with a large number of visits or a large number of concurrent processes. If you have a few hundred ccd-visits or fewer you can probably continue to use SQLite with -j 1 for diaPipe, but if you have more data or want to use more processes you will need to use Postgres. We don’t have documentation on how to set up your own Postgres server, though if you are doing development work for Rubin and have USDF credentials you can use ours: Accessing the APDB in the USDF - Data Management - Confluence

1 Like

Got it… I was afraid of that. I’m part of a group running a survey and we have lots of archival data that we can search for transients through… unfortunately I’ve run into some administrative issues when it comes to setting up postgres on my institution’s HPC.

But, since this is the second time postgres has been the recommended solution, I’ll sit down and figure out how to get it running properly (I know there is a way to do it, it’ll just take a bit of time to get running).

Thanks for the help!