Hello,
I am running into a strange set of errors when using an object store as my Butler datastore. See for example, these error messages that occur when running a pipeline with parallelism (pipetask run -j 24 ...
) on a single node:
ERROR 2023-11-14T19:16:26.460-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (selectGoodSeeingVisits:{band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149})(singleQuantumExecutor.py:266) - Execution of task 'selectGoodSeeingVisits' on quantum {band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/skymaps/skyMap/skyMap_discrete_skymaps.pickle (54904320) does not match size recorded in registry of 1569
Process task-{band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149}:
Traceback (most recent call last):
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 168, in _executeJob
quantumExecutor.execute(taskDef, quantum)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 167, in execute
result = self._execute(taskDef, quantum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 264, in _execute
self.runQuantum(task, quantum, taskDef, limited_butler)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 466, in runQuantum
task.runQuantum(butlerQC, inputRefs, outputRefs)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_tasks/ge37a0ae47b+91d6b12347/python/lsst/pipe/tasks/selectImages.py", line 458, in runQuantum
inputs = butlerQC.get(inputRefs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 295, in get
val = self._get(ref)
^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 221, in _get
return self.__butler.get(ref)
^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/_butler.py", line 1428, in get
return self._datastore.get(ref, parameters=parameters, storageClass=storageClass)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 2292, in get
return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent, cache_ref=cache_ref)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 1279, in _read_artifact_into_memory
raise RuntimeError(
RuntimeError: Integrity failure in Datastore. Size of file s3://repo/skymaps/skyMap/skyMap_discrete_skymaps.pickle (54904320) does not match size recorded in registry of 1569
and
ERROR 2023-11-14T19:16:26.514-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (makeWarp:{instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76147, visit: 1028879, ...})(singleQuantumExecutor.py:266) - Execution of task 'makeWarp' on quantum {instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76147, visit: 1028879, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits (1569) does not match size recorded in registry of 80640
Process task-{instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 75799, visit: 1028879, ...}:
Traceback (most recent call last):
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 168, in _executeJob
quantumExecutor.execute(taskDef, quantum)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 167, in execute
result = self._execute(taskDef, quantum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 264, in _execute
self.runQuantum(task, quantum, taskDef, limited_butler)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 466, in runQuantum
task.runQuantum(butlerQC, inputRefs, outputRefs)
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_tasks/ge37a0ae47b+91d6b12347/python/lsst/pipe/tasks/makeWarp.py", line 332, in runQuantum
inputs = butlerQC.get(inputRefs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 295, in get
val = self._get(ref)
^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 221, in _get
return self.__butler.get(ref)
^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/_butler.py", line 1428, in get
return self._datastore.get(ref, parameters=parameters, storageClass=storageClass)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 2292, in get
return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent, cache_ref=cache_ref)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 1279, in _read_artifact_into_memory
raise RuntimeError(
RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits (1569) does not match size recorded in registry of 80640
It looks like the process that wanted the visitSummary (size 80640) instead got something of size 1569, while another process looking for a skyMap (size 1569) instead got something of size 54904320 (on the order of the size of a calexp). So the process looking for the visitSummary got the skyMap instead, and the process looking for the skyMap got an image instead.
I don’t understand what is happening here, as each process should be making a separate TCP connection to the object store – those streams shouldn’t get crossed, right? And each process should have its own instance of the butler, and if the butler is caching datasets to disk, it should be using different temporary directories/filenames, right?
Or it’s not the processes getting mixed up, but instead the refs to cached file mapping getting mixed up in butlerQC.get(inputRefs)
?
Any idea what could be happening here? I haven’t noticed these errors when running without parallelism (pipetask run -j 1
) and even when using bps submit
. And the error goes away if I re-run the same pipeline several times with pipetask run -j 24
.
I am using version w_2023_38
of the pipelines here.
Thanks,
Steven