"OSError: [Errno 524] Unknown error 524" on Cori compute node

I am using v19_0_0 of the LSST science pipelines. I can run processCcd.py successfully at NERSC on a Cori login node using the following command:

processCcd.py DATA --calib DATA/CALIB --rerun processCcdOutputs --id --longlog &> processCcd.log &

However, when I try to run this exact same command (with the exact same input Butler repo / file / directory setup) on a Cori compute node (specifically in the interactive queue), I get an error pretty much right at the start, before any processing happens:

root INFO: Loading config overrride file ‘/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/obs_decam/19.0.0+2/config/processCcd.py’
CameraMapper INFO: Loading exposure registry from /global/cfs/cdirs/cosmo/work/wise/rubin/DECaLS_r/DATA/registry.sqlite3
CameraMapper INFO: Loading calib registry from /global/cfs/cdirs/cosmo/work/wise/rubin/DECaLS_r/DATA/CALIB/calibRegistry.sqlite3
/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pex_config/19.0.0/python/lsst/pex/config/config.py:1289: FutureWarning: Config field isr.doAddDistortionModel is deprecated: Camera geometry is incorporated when reading the raw files. This option no longer is used, and will be removed after v19.
FutureWarning)
Traceback (most recent call last):
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_tasks/19.0.0+2/bin/processCcd.py”, line 25, in
ProcessCcdTask.parseAndRun()
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_base/19.0.0/python/lsst/pipe/base/cmdLineTask.py”, line 605, in parseAndRun
parsedCmd = argumentParser.parse_args(config=config, args=args, log=log, override=cls.applyOverrides)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/pipe_base/19.0.0/python/lsst/pipe/base/argumentParser.py”, line 684, in parse_args
namespace.butler = dafPersist.Butler(inputs=inputs, outputs=outputs)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/butler.py”, line 536, in init
self._initRepo(repoData)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/butler.py”, line 552, in _initRepo
repoData.repo = Repository(repoData)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/repository.py”, line 141, in init
self._storage.putRepositoryCfg(repoData.cfg, repoData.cfgRoot)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/posixStorage.py”, line 162, in putRepositoryCfg
storage.write(location, cfg)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/posixStorage.py”, line 258, in write
writeFormatter(butlerLocation, obj)
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/fmtPosixRepositoryCfg.py”, line 77, in _write
with safeFileIo.SafeLockedFileForWrite(loc) as f:
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/safeFileIo.py”, line 191, in enter
self.open()
File “/opt/lsst/software/stack/stack/miniconda3-4.7.10-4d7b902/Linux64/daf_persistence/19.0.0/python/lsst/daf/persistence/safeFileIo.py”, line 200, in open
fcntl.flock(self._fileHandle, fcntl.LOCK_EX)
OSError: [Errno 524] Unknown error 524

Any insight about this would be appreciated, as it is preventing me from performing any large-scale processing runs at NERSC. Is there perhaps some config flag I can specify to circumvent whatever file lock checking may be blocking me here? Thanks very much.

https://pipelines.lsst.io/install/prereqs.html#filesystem-prerequisites

There is no way to circumvent this with the Gen2 Butler; you must place your files on a filesystem with flock. You might try using a local-to-worker filesystem followed by copying of the results.

With Gen3, you could possibly use a Postgres database for all Butler interactions, but failure to lock would likely prevent usage of the sqlite-based “execution butler” and BPS until the new graph-based execution is ready.

1 Like

Thanks a lot for the quick response, K-T! I do now see that this topic/requirement is discussed in the pipeline documentation that you linked.

I just tried running my same test processCcd.py command on a Cori compute node, but with all of the Butler I/O isolated to Cori scratch, and it worked! Previously I had been doing the Butler I/O on a different NERSC file system (CFS).

Another potential option for me might be to use the /tmp space on each Cori compute node, though I haven’t tried that yet and there appears to be only ~64 GB per node of such /tmp space available, which wouldn’t cover a very large number of DECam exposures worth of reduced outputs. Thanks again!

In case it’s of interest to anyone else: I just verified that isolating all the Butler I/O to a Cori compute node’s /tmp space also works.