Mixing up of Butler datasets when requested from an object store

Hello,

I am running into a strange set of errors when using an object store as my Butler datastore. See for example, these error messages that occur when running a pipeline with parallelism (pipetask run -j 24 ...) on a single node:

ERROR 2023-11-14T19:16:26.460-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (selectGoodSeeingVisits:{band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149})(singleQuantumExecutor.py:266) - Execution of task 'selectGoodSeeingVisits' on quantum {band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/skymaps/skyMap/skyMap_discrete_skymaps.pickle (54904320) does not match size recorded in registry of 1569
Process task-{band: 'VR', instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76149}:
Traceback (most recent call last):
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 168, in _executeJob
    quantumExecutor.execute(taskDef, quantum)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 167, in execute
    result = self._execute(taskDef, quantum)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 264, in _execute
    self.runQuantum(task, quantum, taskDef, limited_butler)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 466, in runQuantum
    task.runQuantum(butlerQC, inputRefs, outputRefs)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_tasks/ge37a0ae47b+91d6b12347/python/lsst/pipe/tasks/selectImages.py", line 458, in runQuantum
    inputs = butlerQC.get(inputRefs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 295, in get
    val = self._get(ref)
          ^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 221, in _get
    return self.__butler.get(ref)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/_butler.py", line 1428, in get
    return self._datastore.get(ref, parameters=parameters, storageClass=storageClass)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 2292, in get
    return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent, cache_ref=cache_ref)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 1279, in _read_artifact_into_memory
    raise RuntimeError(
RuntimeError: Integrity failure in Datastore. Size of file s3://repo/skymaps/skyMap/skyMap_discrete_skymaps.pickle (54904320) does not match size recorded in registry of 1569

and

ERROR 2023-11-14T19:16:26.514-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (makeWarp:{instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76147, visit: 1028879, ...})(singleQuantumExecutor.py:266) - Execution of task 'makeWarp' on quantum {instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 76147, visit: 1028879, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits (1569) does not match size recorded in registry of 80640
Process task-{instrument: 'DECam', skymap: 'discrete', tract: 6, patch: 75799, visit: 1028879, ...}:
Traceback (most recent call last):
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 168, in _executeJob
    quantumExecutor.execute(taskDef, quantum)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 167, in execute
    result = self._execute(taskDef, quantum)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 264, in _execute
    self.runQuantum(task, quantum, taskDef, limited_butler)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/ctrl_mpexec/g76ae3ab134+f0199d472f/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 466, in runQuantum
    task.runQuantum(butlerQC, inputRefs, outputRefs)
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_tasks/ge37a0ae47b+91d6b12347/python/lsst/pipe/tasks/makeWarp.py", line 332, in runQuantum
    inputs = butlerQC.get(inputRefs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 295, in get
    val = self._get(ref)
          ^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/pipe_base/g655761b648+94da4844e8/python/lsst/pipe/base/_quantumContext.py", line 221, in _get
    return self.__butler.get(ref)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/_butler.py", line 1428, in get
    return self._datastore.get(ref, parameters=parameters, storageClass=storageClass)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 2292, in get
    return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent, cache_ref=cache_ref)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mmfs1/gscratch/dirac/shared/opt/conda/envs/lsst-scipipe-7.0.1/share/eups/Linux64/daf_butler/ga1d28be6d8+a9e5a04819/python/lsst/daf/butler/datastores/fileDatastore.py", line 1279, in _read_artifact_into_memory
    raise RuntimeError(
RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits (1569) does not match size recorded in registry of 80640

It looks like the process that wanted the visitSummary (size 80640) instead got something of size 1569, while another process looking for a skyMap (size 1569) instead got something of size 54904320 (on the order of the size of a calexp). So the process looking for the visitSummary got the skyMap instead, and the process looking for the skyMap got an image instead.

I don’t understand what is happening here, as each process should be making a separate TCP connection to the object store – those streams shouldn’t get crossed, right? And each process should have its own instance of the butler, and if the butler is caching datasets to disk, it should be using different temporary directories/filenames, right?

Or it’s not the processes getting mixed up, but instead the refs to cached file mapping getting mixed up in butlerQC.get(inputRefs)?

Any idea what could be happening here? I haven’t noticed these errors when running without parallelism (pipetask run -j 1) and even when using bps submit. And the error goes away if I re-run the same pipeline several times with pipetask run -j 24.

I am using version w_2023_38 of the pipelines here.

Thanks,
Steven

This is pretty baffling, and I don’t have any theories yet.

Did you save the QuantumGraph before executing it in a separate process? If so, there’s some chance we’d learn something by inspecting it, if it’s small enough for you to upload here. And that would open the door to something going wrong in QG I/O, which has a lot of complex caching where it’s wouldn’t be hard to hide a bug.

Is there enough information in the logs to identify those two datasets, such that you could query for them and print their full DatasetRefs?

Can you try running with pipetask run --start-method spawn ? I think the linux default is fork but there have been some issues in the past with fork’s reusing of connections.

For the file:

s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits

What is the actual size of this object? You can find that easily enough with:

from lsst.resources import ResourcePath
file = ResourcePath("s3://repo/DEEP/template_testing_3/20231114T185110/step1/20231115T025252Z/visitSummary/20210905/VR/VR_DECam_c0007_6300.0_2600.0/1028879/visitSummary_DECam_VR_VR_DECam_c0007_6300_0_2600_0_1028879_DEEP_template_testing_3_20231114T185110_step1_20231115T025252Z.fits")
print(file.size())

That would tell us which number is correct from that error message.

I moved on from that specific set of processing, but I’ve been able to re-create it with another set.

I created the quantum graph with:

pipetask qgraph -b $REPO -p $CP_PIPE_DIR/pipelines/DarkEnergyCamera/RunIsrForCrosstalkSources.yaml -i DEEP/raw/flat/20230709/VR,DECam/calib --output-run testing/1 -d "instrument='DECam' and detector=1" -q ./isr.qgraph

and execute it (over again until an error occurs) with:

butler remove-runs $REPO testing/1 --no-confirm && pipetask --long-log --log-level DEBUG run -b $REPO -g ./isr.qgraph -j 24 -i DEEP/raw/flat/20230709/VR,DECam/calib --output-run testing/1 2>&1 | tee isr1.log

I found a good example of the error:

$ grep ERROR ./isr1.log | grep Integrity
ERROR 2023-11-20T18:46:38.435-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:266) - Execution of task 'overscan' on quantum {instrument: 'DECam', detector: 1, exposure: 1220156, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t194827/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t194827_S1_DECam_raw_all.fits.fz (709260480) does not match size recorded in registry of 709583040
ERROR 2023-11-20T18:46:38.436-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:266) - Execution of task 'overscan' on quantum {instrument: 'DECam', detector: 1, exposure: 1220161, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t195124/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t195124_S1_DECam_raw_all.fits.fz (709583040) does not match size recorded in registry of 709260480

Checking the file sizes:

>>> from lsst.resources import ResourcePath
>>> from lsst.resources import ResourcePath
>>> print(ResourcePath("s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t194827/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t194827_S1_DECam_raw_all.fits.fz").size())
709583040
>>> print(ResourcePath("s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t195124/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t195124_S1_DECam_raw_all.fits.fz").size())
709260480

And then here are the logs matching those data ids:

$  grep "instrument: 'DECam', detector: 1, exposure: 1220156" isr1.log
DEBUG 2023-11-20T18:46:38.188-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:641) - Submitting <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}>
INFO 2023-11-20T18:46:38.365-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:204) - Preparing execution of quantum for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}.
DEBUG 2023-11-20T18:46:38.366-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:331) - Looking for existing outputs in the way for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}.
DEBUG 2023-11-20T18:46:38.367-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
VERBOSE 2023-11-20T18:46:38.373-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1594) - Number of datasets found in datastore: 0 out of 3 datasets checked.
DEBUG 2023-11-20T18:46:38.374-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:401) - Checking existence of input 'camera' for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}.
DEBUG 2023-11-20T18:46:38.374-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
DEBUG 2023-11-20T18:46:38.378-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1587) - Calling mexists() with single ref that does exist (camera@{instrument: 'DECam'} [sc=Camera] (run=DECam/calib/unbounded id=3eeb6b2b-17bb-46bd-b3dc-710d599fa87d)).
DEBUG 2023-11-20T18:46:38.379-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:401) - Checking existence of input 'raw' for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}.
DEBUG 2023-11-20T18:46:38.379-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
DEBUG 2023-11-20T18:46:38.403-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1587) - Calling mexists() with single ref that does exist (raw@{instrument: 'DECam', detector: 1, exposure: 1220156, ...} [sc=Exposure] (run=DECam/raw/all id=ee8d3d45-3c19-555f-869c-2324ade69ee8)).
INFO 2023-11-20T18:46:38.417-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:256) - Constructing task and executing quantum for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}.
DEBUG 2023-11-20T18:46:38.421-08:00 lsst.daf.butler._butler (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(_butler.py:1426) - Butler get: raw@{instrument: 'DECam', detector: 1, exposure: 1220156, ...} [sc=Exposure] (run=DECam/raw/all id=ee8d3d45-3c19-555f-869c-2324ade69ee8), dataId=None, parameters=None
DEBUG 2023-11-20T18:46:38.422-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:608) - Retrieve raw@{instrument: 'DECam', detector: 1, exposure: 1220156, ...} [sc=Exposure] (run=DECam/raw/all id=ee8d3d45-3c19-555f-869c-2324ade69ee8) from FileDatastore@<butlerRoot> with parameters None
DEBUG 2023-11-20T18:46:38.426-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(fileDatastore.py:1263) - Accessing data from s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t194827/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t194827_S1_DECam_raw_all.fits.fz
ERROR 2023-11-20T18:46:38.435-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220156, ...})(singleQuantumExecutor.py:266) - Execution of task 'overscan' on quantum {instrument: 'DECam', detector: 1, exposure: 1220156, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t194827/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t194827_S1_DECam_raw_all.fits.fz (709260480) does not match size recorded in registry of 709583040
DEBUG 2023-11-20T18:46:39.081-08:00 lsst.daf.butler.core.formatter ()(formatter.py:678) - Retrieved formatter lsst.daf.butler.formatters.logs.ButlerLogRecordsFormatter from key 'ButlerLogRecords' for entity 'overscan_log@{instrument: 'DECam', detector: 1, exposure: 1220156, ...} [sc=ButlerLogRecords] (run=testing/1 id=412f50ef-b46f-4c7c-84fa-e06a38c755e2)'
DEBUG 2023-11-20T18:46:39.083-08:00 lsst.daf.butler.core.fileTemplates ()(fileTemplates.py:272) - Got file {run:/}/{datasetType}.{component:?}/{exposure.day_obs}/{exposure.obs_id}/{datasetType}_{component:?}_{instrument:?}_{physical_filter}_{exposure.obs_id}_{detector.full_name}_{run} from overscan_log@{instrument: 'DECam', detector: 1, exposure: 1220156, ...} [sc=ButlerLogRecords] (run=testing/1 id=412f50ef-b46f-4c7c-84fa-e06a38c755e2) via band+instrument+detector+physical_filter+exposure
Process task-{instrument: 'DECam', detector: 1, exposure: 1220156, ...}:
DEBUG 2023-11-20T18:46:39.398-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:571) - finished: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}>
DEBUG 2023-11-20T18:46:39.399-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:595) - failed: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}>
ERROR 2023-11-20T18:46:39.400-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:606) - Task <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}> failed, exit code=1; processing will continue for remaining tasks.
ERROR 2023-11-20T18:47:00.793-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:675) -   - FAILED: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220156, ...}>

and

$ grep "instrument: 'DECam', detector: 1, exposure: 1220161" isr1.log
DEBUG 2023-11-20T18:46:38.167-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:641) - Submitting <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}>
INFO 2023-11-20T18:46:38.356-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:204) - Preparing execution of quantum for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}.
DEBUG 2023-11-20T18:46:38.359-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:331) - Looking for existing outputs in the way for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}.
DEBUG 2023-11-20T18:46:38.360-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
VERBOSE 2023-11-20T18:46:38.367-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1594) - Number of datasets found in datastore: 0 out of 3 datasets checked.
DEBUG 2023-11-20T18:46:38.368-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:401) - Checking existence of input 'camera' for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}.
DEBUG 2023-11-20T18:46:38.369-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
DEBUG 2023-11-20T18:46:38.373-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1587) - Calling mexists() with single ref that does exist (camera@{instrument: 'DECam'} [sc=Camera] (run=DECam/calib/unbounded id=3eeb6b2b-17bb-46bd-b3dc-710d599fa87d)).
DEBUG 2023-11-20T18:46:38.374-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:401) - Checking existence of input 'raw' for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}.
DEBUG 2023-11-20T18:46:38.374-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1553) - Checking for the existence of multiple artifacts in datastore in chunks of 10000
DEBUG 2023-11-20T18:46:38.390-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1587) - Calling mexists() with single ref that does exist (raw@{instrument: 'DECam', detector: 1, exposure: 1220161, ...} [sc=Exposure] (run=DECam/raw/all id=b315b7d7-6ecf-5781-9ecc-3f3f88f6fb9a)).
INFO 2023-11-20T18:46:38.417-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:256) - Constructing task and executing quantum for label=overscan dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}.
DEBUG 2023-11-20T18:46:38.422-08:00 lsst.daf.butler._butler (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(_butler.py:1426) - Butler get: raw@{instrument: 'DECam', detector: 1, exposure: 1220161, ...} [sc=Exposure] (run=DECam/raw/all id=b315b7d7-6ecf-5781-9ecc-3f3f88f6fb9a), dataId=None, parameters=None
DEBUG 2023-11-20T18:46:38.422-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:608) - Retrieve raw@{instrument: 'DECam', detector: 1, exposure: 1220161, ...} [sc=Exposure] (run=DECam/raw/all id=b315b7d7-6ecf-5781-9ecc-3f3f88f6fb9a) from FileDatastore@<butlerRoot> with parameters None
DEBUG 2023-11-20T18:46:38.426-08:00 lsst.daf.butler.datastores.fileDatastore (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(fileDatastore.py:1263) - Accessing data from s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t195124/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t195124_S1_DECam_raw_all.fits.fz
ERROR 2023-11-20T18:46:38.436-08:00 lsst.ctrl.mpexec.singleQuantumExecutor (overscan:{instrument: 'DECam', detector: 1, exposure: 1220161, ...})(singleQuantumExecutor.py:266) - Execution of task 'overscan' on quantum {instrument: 'DECam', detector: 1, exposure: 1220161, ...} failed. Exception RuntimeError: Integrity failure in Datastore. Size of file s3://repo/DECam/raw/all/raw/20230709/ct4m20230709t195124/raw_DECam_VR_DECam_c0007_6300_0_2600_0_ct4m20230709t195124_S1_DECam_raw_all.fits.fz (709583040) does not match size recorded in registry of 709260480
DEBUG 2023-11-20T18:46:38.977-08:00 lsst.daf.butler.core.formatter ()(formatter.py:678) - Retrieved formatter lsst.daf.butler.formatters.logs.ButlerLogRecordsFormatter from key 'ButlerLogRecords' for entity 'overscan_log@{instrument: 'DECam', detector: 1, exposure: 1220161, ...} [sc=ButlerLogRecords] (run=testing/1 id=5ea9346f-db95-4aaa-9b80-3272a1faf950)'
DEBUG 2023-11-20T18:46:38.977-08:00 lsst.daf.butler.core.fileTemplates ()(fileTemplates.py:272) - Got file {run:/}/{datasetType}.{component:?}/{exposure.day_obs}/{exposure.obs_id}/{datasetType}_{component:?}_{instrument:?}_{physical_filter}_{exposure.obs_id}_{detector.full_name}_{run} from overscan_log@{instrument: 'DECam', detector: 1, exposure: 1220161, ...} [sc=ButlerLogRecords] (run=testing/1 id=5ea9346f-db95-4aaa-9b80-3272a1faf950) via band+instrument+detector+physical_filter+exposure
Process task-{instrument: 'DECam', detector: 1, exposure: 1220161, ...}:
DEBUG 2023-11-20T18:46:39.192-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:571) - finished: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}>
DEBUG 2023-11-20T18:46:39.194-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:595) - failed: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}>
ERROR 2023-11-20T18:46:39.195-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:606) - Task <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}> failed, exit code=1; processing will continue for remaining tasks.
ERROR 2023-11-20T18:47:00.793-08:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:675) -   - FAILED: <TaskDef(lsst.ip.isr.isrTask.IsrTask, label=overscan) dataId={instrument: 'DECam', detector: 1, exposure: 1220161, ...}>

I’ve tried running with --start-method spawn and I haven’t been able to recreate the issue, so perhaps this is the issue.

I’ve uploaded the quantum graph here. The log file is too big (8MB), so I’ve uploaded it without the debug messages and put the full one here.

isr1_no_debug.log (63.5 KB)
isr.qgraph (29.5 KB)

Thank you for the detailed response. We think the problem is that the connection pool is set up before we start processing quanta and the connections end up in the forks. The current plan is to change the default start method to spawn on all platforms as the only safe solution. This is not the first time that fork has bitten us and the small performance savings are not worth all this pain.

We would be interested to know if --start-method forkserver works fine as well.

I’ve run the pipeline 6 times with --start-method forkserver and haven’t seen the issue. It comes up immediately with fork, so I think forkserver is okay.

1 Like

Last week’s weekly should have the fix in it. We have stopped using fork completely and now default to spawn. forkserver is still an option.