Disk contention when running many DM jobs (advice needed)

As part of the DESC Twinkles project we have been running 1000’s of batch jobs both at SLAC and NERSC. Due to the way DM butler works (or my limited understanding of it) all the jobs write to the same output directory, since the next processing stage is to run coadd jobs on the combined output.

When running ~100 jobs writing to the same output directory we see disk contention problems, typically resulting in spurious crashes at SLAC or extreme slowness (measured as CPU/wallclock time) at NERSC. In the past (for other experiments) we have overcome such problems by writing output to local scratch space for each job, then at the end of the job copying the output to its final location.

So several questions:

a) Are we correct in assuming that all of the output must be written to a single directory to allow the next DM job to read all of it into a single job? If not what should we be doing instead?
b) Can we write the output to a temporary (scratch) location, then at the end of the job do a cp -r to copy it to a single output location? If not is there some other way to achieve the same result?
c) Anything else we should know about running DM jobs at scale?

If useful you can find a description of the actual jobs we are running here:

The current typical use of the Butler does write to a shared output repository. Staging and other more sophisticated features that can be built into the Butler were deferred for now. For now, the Butler repository on-disk structure is such that independent local scratch repositories can indeed be combined by merely copying files, noting that there are a few files that will occur in every scratch repository and should ideally be checked for identical contents.

What kind of filesystem and interconnect are you writing to? Avoid NFS like the plague! Make sure you’re using a proper cluster filesystem.

We regularly run of order hundreds of processCcd jobs in parallel that write to the same output directory of a GPFS with Infiniband, and have had no problems like you describe. We do have colleagues using NFS and have heard reports of strange I/O errors. I’m sure several of us can tell you horror stories about NFS at scale…

BTW, are you sure the “extreme slowness” reported at NERSC is not related to thread contention (DM-4714)? What pipeline version are you running? Do you have OMP_NUM_THREADS=1?