Disk contention when running many DM jobs (advice needed)

tony_johnson · May 20, 2016, 8:36pm

As part of the DESC Twinkles project we have been running 1000’s of batch jobs both at SLAC and NERSC. Due to the way DM butler works (or my limited understanding of it) all the jobs write to the same output directory, since the next processing stage is to run coadd jobs on the combined output.

When running ~100 jobs writing to the same output directory we see disk contention problems, typically resulting in spurious crashes at SLAC or extreme slowness (measured as CPU/wallclock time) at NERSC. In the past (for other experiments) we have overcome such problems by writing output to local scratch space for each job, then at the end of the job copying the output to its final location.

So several questions:

a) Are we correct in assuming that all of the output must be written to a single directory to allow the next DM job to read all of it into a single job? If not what should we be doing instead?
b) Can we write the output to a temporary (scratch) location, then at the end of the job do a cp -r to copy it to a single output location? If not is there some other way to achieve the same result?
c) Anything else we should know about running DM jobs at scale?

If useful you can find a description of the actual jobs we are running here:

github.com

LSSTDESC/Twinkles/blob/master/doc/Cookbook/DM_Level2_Recipe.md

# Recipe: Emulating the DM Level 2 Pipeline

_Simon Krughoff_

This recipe is intended to show an example of how to process simulated image
data, as if we were running the LSST DM Level 2 "annual release processing".
The products will be catalogs of sources, objects etc.

## Build the indexes for astrometric and photometric calibration

We use the `phoSim` reference catalogs to emulate the kind of high accuracy
calibration that we expect to be possible with the LSST data. This is an
approximation, but for many purposes a good one. 

Currently the reference catalogs need to be formatted as astrometry.net index files.  I can convert the
reference catalog produced by `generatePhosimInput.py`, but there are a couple of precursor steps.  First,
there is a bug in how phosim creates the nominal WCS (PHOSIM-18).  The result is that the WCS claims to be
ICRS but ignores precession.  Since the matching algorithms assume we know approximately where the telescope
is pointing, they fail unless the catalogs are fixed.

This file has been truncated. show original

ktl · May 20, 2016, 10:05pm

The current typical use of the Butler does write to a shared output repository. Staging and other more sophisticated features that can be built into the Butler were deferred for now. For now, the Butler repository on-disk structure is such that independent local scratch repositories can indeed be combined by merely copying files, noting that there are a few files that will occur in every scratch repository and should ideally be checked for identical contents.

price · May 21, 2016, 5:27am

What kind of filesystem and interconnect are you writing to? Avoid NFS like the plague! Make sure you’re using a proper cluster filesystem.

We regularly run of order hundreds of processCcd jobs in parallel that write to the same output directory of a GPFS with Infiniband, and have had no problems like you describe. We do have colleagues using NFS and have heard reports of strange I/O errors. I’m sure several of us can tell you horror stories about NFS at scale…

price · May 24, 2016, 3:12pm

BTW, are you sure the “extreme slowness” reported at NERSC is not related to thread contention (DM-4714)? What pipeline version are you running? Do you have OMP_NUM_THREADS=1?