When your job starts to execute, the batch system will execute the script file you submitted on the first node assigned to your job. If your job is to run on multiple cores and/or multiple nodes, it is your script's responsibility to deliver the various tasks to the different cores and/or nodes. How to do this varies with the application, but some common techniques are discussed here.
The scheduler assigns the variable
$PBS_NODEFILE
which contains the name of a file that
lists all of the nodes that you've been assigned. If you are assigned
multiple cores on the same node, the name of that node appears multiple
times (once per core assigned) in that file.
Serial or single core jobs are the simplest case. Indeed, the batch system starts processing your job script on a core of the first node assigned to your job, and in the single core case this is the only core/node assigned to your job. So there is really nothing special that you need to do; just enter the command for your job and it should run.
You still need to submit jobs to the scheduler to have them run on one of
the compute nodes. So you will probably want to submit a batch job script
via the sbatch
command. Assuming your login name is
joeuser
and you are running a serial executable called
my-serial-code
in the bin
subdirectory of your
home directory, with inputs and outputs in the /lustre/joeuser/data
directory, you could use something like this for your jobs scripts:
#!/bin/bash
#SBATCH --n 1
#SBATCH -t 10:00:00
#SBATCH --mem=1024
. ~/.bashrc
cd /lustre/joeuser/data
~/bin/my-serial-code
The above script requests a single CPU for at most 10 hours, with 1 GiB (1024MiB) of RAM.
If you saved the above to a file named myjob.sh
, you could submit it
to the scheduler with the command sbatch myjob.sh
.
The next simplest case is if your job is running on a single node but is multithreaded. I.e., OpenMP codes that are not also using MPI will typically fall into this category. Again, usually there is nothing special that you need to do.
One exception is if you are not using all the cores on the node for this job. In this case, you might need to tell the code to limit the number of cores being used. This is true for OpenMP codes, as OpenMP will by default try to use all the cores it can find.
For OpenMP, you can set the environmental variable
OMP_NUM_THREADS
in your job script to match the number of
cores per node requested by the job. Since slurm provides this to you
in $SLURM_CPUS_PER_TASK
, you can just do:
setenv OMP_NUM_THREADS $SLURM_CPUS_PER_TASK
for csh type shells, or
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS
for bourne type shells.
The above only works if you are using the Putting it all together, you could create a file and then use the command The
Message Passing Interface (MPI) is a standardized and portable system
for communication between the various tasks of parallized jobs in HPC
environments. A number of different implementations of MPI libraries
are available on the Division of Information Technology maintained HPC
clusters, and usually several versions of each. Although the MPI interface
itself is somewhat standardized, the different versions are not binary
compatible. It is important that you match the MPI implementation (and
version, and to be safe the compiler and compiler version) you use and with
which your code was compiled.
The best supported MPI library on the Deepthought clusters is
OpenMPI, and the later versions are better supported than the earlier ones.
OpenMPI is supported for both GNU and Intel compilers, and is compiled to
support all of the various interconnect hardware available on our clusters.
I.e., it has the best support for infiniband on the clusters.
If you are using the Intel compilers, you can also use the native Intel
MPI libraries. Again, the later versions are better supported than earlier
ones. We do not support using Intel MPI with GNU compilers.
Slurm works well with both OpenMPI and Intel MPI libraries, making these
the easiest to use as well.
OpenMPI is the preferred MPI unless your application specifically requires
one of the alternate MPI variants. OpenMPI knows about slurm, so it makes
it easier to invoke; no need to specify the number of tasks to the
Since slurm and OpenMPI get along well, invoking your OpenMPI based
code in a slurm job script is fairly simple; for the case of a GNU based
OpenMPI code, you can do something like:
-c
or
--cpus-per-task
flag to specify the number of cores,
which you really should be to ensure the cores are allocated
on the same node. If however you are using a combination of
specifying the number of tasks and number of nodes (which is
not advisable), you should replace the variable
SLURM_CPUS_PER_TASK
with --cpus-per-task
.
my-openmp-job.sh
with something like:
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -t 10:00:00
#SBATCH --mem-per-cpu=4096
. ~/.bashrc
cd /lustre/joeuser/data
OMP_NUM_THREADS=$SLURM_NTASKS
export OMP_NUM_THREADS
~/bin/my-openmp-code
sbatch my-openmp-job.sh
to submit it to the scheduler for execution. The above job script
requests 16 cores (-n
) on a single node (-N
)
for at most 10 hours (-t
) with 4 GiB (4096 MiB,
--mem-per-core
) or RAM per core, or 64 GiB of RAM total.
The OpenMP code is in ~/bin/my-openmp-code
and the run
will start in the /lustre/joeuser/data
directory.
Running MPI jobs
tap
or module load
multiple MPI implementations
or versions at the same time (including same version for different compilers).
You can only have one loaded at any given time. At best, if you
tap
or module load
two versions, only the last will
work, if even that. The module
command should complain if you
attempt to do so.
tap
or module load
command to your
shell initialization scripts (e.g. ~/.cshrc.mine
. For the
newer OpenMPI and Intel libraries, with slurm, you can just include the
appropriate tap
or module load
commands in
your job script, also.
Running OpenMPI jobs
mpirun
command because it can get it from slurm environmental
variables.
OpenMPI is also compiled to support all of the various
interconnect hardware, so for nodes with fast transport (e.g. InfiniBand),
the fastest interface will be selected automatically.
#!/bin/bash
#SBATCH --ntasks=8
#SBATCH -t 00:00:30
. ~/.bashrc
cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c
echo "openmpi-gnu-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module unload intel
#It is recommended that you add the exact version of the
#compiler and MPI library used when you compiled the code
#to improve long-term reproducibility
module load gcc
module load openmpi
module list
which mpirun
mpirun openmpi-gnu-hello
The above is for a bash shell, and most of the lines are just
informational, printing status. The main components are the lines
to load the GNU openmpi libraries (latest) and the mpirun line which
actually invokes our code (openmpi-gnu-hello
in this case).
The unloading of the intel module is optional but a good idea; if you
somehow got it loaded, it would conflict with the loading of the gcc and openmpi
module, causing your job to fail.
|
Performance Alert
We have seen significant performance issues when using OpenMPI version 1.6 or 1.6.5 with more than about 15 cores/node and when the setting -bind-to-core is NOT used.
Deepthought2 users are encouraged to add the -bind-to-core
to their mpirun command.
|
OpenMPI versions after 1.6.5 have -bind-to-core
set as the
default. This setting seems is not appropriate in all cases, but seems to
give significantly better performance in most common cases. See e.g.
http://blogs.cisco.com/performance/open-mpi-binding-to-core-by-default/
for a discussion of this change. One case in particular where
--bind-to-core
is problematic is if you have multiple MPI
jobs running on the same node; each MPI job will start binding tasks to
cores, but will go through the cores in the same order --- so the first task
of each job will be bound to the first core; the second task to the second
core, and so on. So if running small MPI jobs with the
--share flag, you are
probably best off to NOT use --bind-to-core
(for OpenMPI 1.6.5
or earlier) (or to use --bind-to none
for OpenMPI 1.8.1 or
higher).
A similar example, for the Intel compiler version of OpenMPI, using
the C-shell, and including the -bind-to-core
option would be:
#!/bin/csh
#SBATCH --ntasks=82c1;2c1;2c
#SBATCH -t 00:00:30
cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c
echo "openmpi-intel-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module load intel/2013.1.039
module load openmpi/1.6.5
module list
which mpirun
mpirun -bind-to-core openmpi-intel-hello
|
Because the
tap or module load commands for the
Intel compilers set up both the compilers and the Intel implementation of the
MPI libraries, you MUST NOT tap the
Intel compiler AFTER you tap the
OpenMPI libraries. At least for older versions of the Intel compilers, doing
so will cause the Intel MPI libraries setup to override the OpenMPI libraries.
Even re-tap -ing the OpenMPI libraries again will not fix that; you will need to log out and back in. This has been fixed with the
tap/module load scripts for the 2012 and later
versions of the Intel compilers.
|
Note that when submitting jobs that do not involve GPUs, you are likely to get warnings about CUDA libraries, as described in this FAQ. These are harmless and can be ignored --- since your job is not using GPUs, it was likely assigned to non-GPU enabled nodes which do not have the hardware specific CUDA libraries therefore generating the warning. But since the job is not using GPUs, it does not need the CUDA libraries, so the warning can be ignored.
For more information, see the examples.
The Intel compiler suite includes its own implementation of MPI libraries,
which is available for use if you compile your code with the Intel compilers.
The Intel MPI libraries are made available when you tap
or
module load
the Intel compilers. They are not available with
the GNU compilers.
You are advised to always use the latest Intel compiler suite when compiling your code.
Since slurm and Intel MPI get along well, invoking your Intel MPI based code in a slurm job script is fairly simple;
#!/bin/tcsh
#SBATCH --ntasks=8
#SBATCH -t 00:00:30
cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c
echo "intel-intel-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module unload openmpi
module load intel/2013.1.039
which mpirun
mpirun intel-intel-hello
The above is for a C-style shell, and most of the lines are just
informational, printing status. The main components are the lines
to load the intel MPI libraries and the mpirun
actually invokes our code (intel-intel-hello
in this case).
The unloading of the openmpi module is optional but a good idea; if you
somehow got it loaded, it would conflict with the loading of the Intel MPI
module, probably causing your job to fail.
For more information, see the examples.
|
Use of the LAM MPI libraries is no longer supported. The LAM libraries
do not work properly with Slurm.
Use either the latest OpenMPI or Intel MPI libraries instead.
|
|
The LAM MPI library function which parses the host string from Slurm
appears to be broken. As the LAM MPI libraries are no longer maintained
by the authors, it cannot be fixed by upgrading. The following provides
a workaround, but it is STRONGLY recommended that you move to another
MPI library.
|
The LAM MPI library requires you to explicitly setup the MPI daemons on all the nodes before you start using MPI, and tear them down after your code exits. So to run an MPI code you would typically have the following lines:
#!/bin/tcsh
#SBATCH -t 0:30
#SBATCH --ntasks=8
#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to LAM's desired format
set MPI_NODEFILE=$WORKDIR/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s cpu=%s\n", $2, $1); }' > $MPI_NODEFILE
tap lam-gnu
lamboot $MPI_NODEFILE
mpirun -np $SLURM_NTASKS C YOUR_APPLICATION
lamclean
lamhalt
The lines from the "Generate a PBS_NODEFILE" to the sort | uniq | awk
pipeline are a hack. LAM is supposed to be able to obtain the list of
nodes assigned to your job from environmental variables passed by Slurm,
but there appears to be an error in handling node lists that are simply
comma separated lists of nodes (no range lists). As a result, the
lamboot
can sometimes work without providing a nodes file,
but often it cannot. The "hack" provides a nodes list file in the
format LAM wants. It starts by calling a script to generate a PBS
format node list file, and then massages that into the format that
LAM wants.
The first line after the tap command sets up the MPI pool between
the nodes assigned to your job. Since lamboot
cannot
reliably get this information directly from slurm, we provide the
host list file we generated above.
The second line starts up a copy of YOUR_APPLICATION on each
cores (hence the 'C") assigned to your job.
The last line cleans up the MPI pool.
For more information, see the examples.
|
Use of the MPICH or LAM MPI libraries is no longer supported on the
Deepthought HPC clusters.
Use either the latest OpenMPI or Intel MPI libraries instead.
|
Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)
The MPICH implementation of MPI also requires the MPI pool to be explicitly set up and torn down. The set up step involves starting mpd daemon processes on each of the nodes assigned to your job.
A typical MPICH job will look something like:
#!/bin/tcsh
#SBATCH -t 1:00
#SBATCH --ntask=8
set WORKDIR=/export/lustre_1/my_workdir
cd $WORKDIR
tap mpich-gnu
#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to MPICH's desired format
set MPI_NODEFILE=$WORKDIR/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s:%s\n", $2, $1); }' > $MPI_NODEFILE
mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE
mpiexec -n $SLURM_NTASKS YOUR_PROGRAM
mpdallexit
After specifying our job requirements, and going to our work directory,
we tap the MPI libraries for our code. We then proceed
to generate a file containing the list of nodes for our job. Slurm returns
this in an environemntal variable, in an abbreviated format. To avoid
having to convert that directly to MPICH format, we call Slurm's
Moab/Maui/PBS/Torque compatibility tool, generate_pbs_nodefile
which generates a nodefile in the PBS format. This is not quite what we
want, as MPICH wants the name of each node to appear only once, followed
by a :
and the number of tasks to run on the node. The
sort
, uniq
, and awk
pipeline
converts it for us.
The mpdboot
line starts an mpd daemon on each of your
nodes (using the Slurm SLURM_JOB_NUM_NODES
variable to
specify how many nodes we have. The nodes are passed via the nodefile
we just created.
The mpiexec
line starts SLURM_NTASKS
instances
of your code, one for each core assigned to your job.
And the mpdallexit
line cleans up the MPICH daemons, etc.
For more information, see the examples.
The above will work as long as you do not run more than one MPI job on
the same node at the same time; since most MPI jobs use all the cores on a
node anyway, it is fine for most people. If you do run into the situation
where multiple MPI jobs are sharing nodes, when the first job calls mpdallexit,
all the mpds for all jobs will be killed, which will make the second and later
jobs unhappy. In these cases, you will want to set the environmental
variable MPD_CON_EXT
to something unique (e.g. the job id) before
calling mpdboot
, and add the --remcons
option to
mpdboot, e.g.
mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE --remcons
If you are doing hybrid OpenMP/OpenMPI parallelization, the simple
examples above will not suffice. In particular, you will need to provide
more input to the scheduler about how you wish to split the cores assigned
to you between OpenMP and MPI. The sbatch
has additional
parameters that can be used for this; you can tell it how many MPI
tasks you wish to run, and how many cores each task should get.
Typically, in addition to --ntasks
, you would want to specify
--ntasks-per-node
: How many MPI tasks to run on each node,
typically in a hybrid OpenMP/MPI case it would be set to 1.--cpus-per-task
: How many cores should get assigned to
each task.So if your job wants 12 nodes, with each node running a 8 core OpenMP process that talks to the other nodes over MPI, you would do something like:
#!/bin/bash
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH -t 72:00:00
module load openmpi/1.6.5
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS
mpirun my-hybrid-code
MPI is currently the most standard way of launching, controlling, synchronizing, and communicating across multi-node jobs, but it is not the only way. Some applications have their own process for running across multiple nodes, and in such cases you should follow their instructions.
The examples page shows an example of using the basic ssh command to start a process on each of the nodes assigned to your job. Something like this could be used to break a problem into N chunks that can be processed independently, and send each chunk to a different core/node. However, most real parallel jobs require much more than just launching the code: the passing of data back and forth, synchronization, etc. And for a simple job as described is often better to submit separate jobs in the batch system for each chunk.
Note: To successfully use ssh to start processes on nodes assigned to your job, you will need to set up passwordless ssh among the compute nodes.