I’ve been bugging @price and promised to send along some details about this here on Community.
I’ve been trying to run ingestDriver.py using both Shifter (NERSC) and Singularity (Anywhere else) containers - based on the w_2018_30 docker image from lsstsqre. Things run just fine when I run interactively on a login node at NERSC or at ANL. However, I run into trouble if I submit a batch job, and try to use the default --batch-type=smp:
HYDU_create_process (utils/launch/launch.c:75): execvp error on file srun (No such file or directory)
I can revert to using --batch-type=none on the compute nodes, but that defeats the purpose of using the xxxxDriver, and the ingest runs painfully slowly. Things work as expected if I use a stack installed from source and avoid using a container in my batch jobs.
I was directed to some documentation about building MPICH into Singularity containers here:
I ended up sending a message to an mpi mailing list to see if I could get some guidance, as I think this is specific to using a container. I received a response which I’ll just drop here:
this is a nicely complex problem that I can’t say I know a solution of,
but, let me say what I know and perhaps it’ll shed some light on the
To answer your question on how mpirun interacts with srun (or SLURM in
general), most MPIs (or better to say, PMIs that MPI uses for process
launch) these days have SLURM support so when built they can leverage
SLURM. Or the SLURM is set up to facilitate the remote node connection
(e.g. by hijacking ssh through its own PMI - I don’t know this just
guessing). So, for the MPI distros that I tried (MPICH and derivatives -
Intel MPI, MVAPICH2; and OpenMPI), mpirun at some point calls srun, no
matter if it was built with SLURM support explicitly or not. Which would
explain the srun error you are getting.
Now, what I think is happening in your case is that you are calling the
mpirun (or its equivalent inside mpi4py) from INSIDE of the container,
where there’s no srun. Notice that most MPI container examples, including
the very well written ANL page, instruct you to use mpirun (or aprun in
Cray’s case) OUTSIDE of the container (the host), and launch N instances
of the container through the mpirun.
I reproduced your problem on our system in the following way:
- Build a Singularity container with local MPI installation, e.g.
- shell into the container and build some mpi program (e.g. I have the
cpi.c example from mpich - mpicc cpi.c -o cpi).
- This then runs OK in the container on an interactive node (= outside
SLURM job - mpirun does not use SLURM’s PMI = does not use srun).
- Launch the job, then shell into the container, and try to run mpirun
-np 2 ./cpi - I get the same error you get, since
$ which srun
Now, I can try to set the path to the SLURM binaries
$ export PATH="/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin:$PATH"
$ which srun
but then get another error:
$ mpirun -np 2 ./cpi.c
srun: error: Invalid user for SlurmUser slurm, ignored
srun: fatal: Unable to process configuration file
so the environment needs some more changes to get the srun to work
correctly from inside the container. Though I think this would still only
be hackable for an intra-node MPI launch, as inter-node you’ll rely on
SLURM that would have to get accessed from outside of the container.
So, bottom line, launching mpirun from the host is preferable.
I am not sure how you can launch the mpi4py from the host, since I don’t
use mpi4py, but, in theory it should not be any different than launching
MPI binaries. Though I figure modifying your launch scripts around this
may be complicated.
BTW, I have had a reasonable success with mixing ABI compatible MPIs
(MPICH, MVAPICH2, Intel MPI) in and out of the container. It often works
but sometimes it does not.
Any thoughts would be appreciated… being able to use containers in production would be very beneficial both at NERSC and ANL - and until this bit is sorted, we really can’t do that.