Running codes (serial and parallel)

When your job starts to execute, the batch system will execute the script file you submitted on the first node assigned to your job. If your job is to run on multiple cores and/or multiple nodes, it is your script's responsibility to deliver the various tasks to the different cores and/or nodes. How to do this varies with the application, but some common techniques are discussed here.

The scheduler assigns the variable $PBS_NODEFILE which contains the name of a file that lists all of the nodes that you've been assigned. If you are assigned multiple cores on the same node, the name of that node appears multiple times (once per core assigned) in that file.

Running Serial jobs
Running multithreaded jobs on a single node
Running MPI jobs

OpenMPI jobs
Intel MPI jobs
MPICH jobs
LAM jobs

Running hybrid OpenMP/MPI jobs
Running non-MPI jobs on multiple nodes

Running Serial jobs

Serial or single core jobs are the simplest case. Indeed, the batch system starts processing your job script on a core of the first node assigned to your job, and in the single core case this is the only core/node assigned to your job. So there is really nothing special that you need to do; just enter the command for your job and it should run.

You still need to submit jobs to the scheduler to have them run on one of the compute nodes. So you will probably want to submit a batch job script via the sbatch command. Assuming your login name is joeuser and you are running a serial executable called my-serial-code in the bin subdirectory of your home directory, with inputs and outputs in the /lustre/joeuser/data directory, you could use something like this for your jobs scripts:

#!/bin/bash
#SBATCH --n 1
#SBATCH -t 10:00:00
#SBATCH --mem=1024

. ~/.bashrc

cd /lustre/joeuser/data

~/bin/my-serial-code

The above script requests a single CPU for at most 10 hours, with 1 GiB (1024MiB) of RAM. If you saved the above to a file named myjob.sh, you could submit it to the scheduler with the command sbatch myjob.sh.

Running Multithreaded jobs on a single node

The next simplest case is if your job is running on a single node but is multithreaded. I.e., OpenMP codes that are not also using MPI will typically fall into this category. Again, usually there is nothing special that you need to do.

One exception is if you are not using all the cores on the node for this job. In this case, you might need to tell the code to limit the number of cores being used. This is true for OpenMP codes, as OpenMP will by default try to use all the cores it can find.

For OpenMP, you can set the environmental variable OMP_NUM_THREADS in your job script to match the number of cores per node requested by the job. Since slurm provides this to you in $SLURM_CPUS_PER_TASK, you can just do:

setenv OMP_NUM_THREADS $SLURM_CPUS_PER_TASK

for csh type shells, or

OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS

for bourne type shells.

The above only works if you are using the -c or --cpus-per-task flag to specify the number of cores, which you really should be to ensure the cores are allocated on the same node. If however you are using a combination of specifying the number of tasks and number of nodes (which is not advisable), you should replace the variable SLURM_CPUS_PER_TASK with . But you should be using --cpus-per-task.

Putting it all together, you could create a file my-openmp-job.sh with something like:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -t 10:00:00
#SBATCH --mem-per-cpu=4096

. ~/.bashrc

cd /lustre/joeuser/data

OMP_NUM_THREADS=$SLURM_NTASKS
export OMP_NUM_THREADS 
~/bin/my-openmp-code

and then use the command sbatch my-openmp-job.sh to submit it to the scheduler for execution. The above job script requests 16 cores (-n) on a single node (-N) for at most 10 hours (-t) with 4 GiB (4096 MiB, --mem-per-core) or RAM per core, or 64 GiB of RAM total. The OpenMP code is in ~/bin/my-openmp-code and the run will start in the /lustre/joeuser/data directory.

Running MPI jobs

The Message Passing Interface (MPI) is a standardized and portable system for communication between the various tasks of parallized jobs in HPC environments. A number of different implementations of MPI libraries are available on the Division of Information Technology maintained HPC clusters, and usually several versions of each. Although the MPI interface itself is somewhat standardized, the different versions are not binary compatible. It is important that you match the MPI implementation (and version, and to be safe the compiler and compiler version) you use and with which your code was compiled.

The best supported MPI library on the Deepthought clusters is OpenMPI, and the later versions are better supported than the earlier ones. OpenMPI is supported for both GNU and Intel compilers, and is compiled to support all of the various interconnect hardware available on our clusters. I.e., it has the best support for infiniband on the clusters.

If you are using the Intel compilers, you can also use the native Intel MPI libraries. Again, the later versions are better supported than earlier ones. We do not support using Intel MPI with GNU compilers.

Use of the MPICH or LAM MPI libraries is no longer supported on the Deepthought HPC clusters. Use either the latest OpenMPI or Intel MPI libraries instead.

Slurm works well with both OpenMPI and Intel MPI libraries, making these the easiest to use as well.

Do not tap or module load multiple MPI implementations or versions at the same time (including same version for different compilers). You can only have one loaded at any given time. At best, if you tap or module load two versions, only the last will work, if even that. The module command should complain if you attempt to do so.

The version of the MPI commands you run must match the version of the MPI library used when your code was compiled, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. If you are only using one version, you can simply add the tap or module load command to your shell initialization scripts (e.g. ~/.cshrc.mine. For the newer OpenMPI and Intel libraries, with slurm, you can just include the appropriate tap or module load commands in your job script, also.

%] NOTE: Your code must be MPI aware to use MPI. Running a non-MPI code with mpirun might succeed, but you will most likely have every core assigned to your job running the exact same calculations, duplicating each others work, and wasting resources.

Running OpenMPI jobs

OpenMPI is the preferred MPI unless your application specifically requires one of the alternate MPI variants. OpenMPI knows about slurm, so it makes it easier to invoke; no need to specify the number of tasks to the mpirun command because it can get it from slurm environmental variables. OpenMPI is also compiled to support all of the various interconnect hardware, so for nodes with fast transport (e.g. InfiniBand), the fastest interface will be selected automatically.

Since slurm and OpenMPI get along well, invoking your OpenMPI based code in a slurm job script is fairly simple; for the case of a GNU based OpenMPI code, you can do something like:

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH -t 00:00:30

. ~/.bashrc

cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c

echo "openmpi-gnu-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module unload intel
#It is recommended that you add the exact version of the
#compiler and MPI library used when you compiled the code
#to improve long-term reproducibility
module load gcc
module load openmpi
module list

which mpirun
mpirun openmpi-gnu-hello

The above is for a bash shell, and most of the lines are just informational, printing status. The main components are the lines to load the GNU openmpi libraries (latest) and the mpirun line which actually invokes our code (openmpi-gnu-hello in this case). The unloading of the intel module is optional but a good idea; if you somehow got it loaded, it would conflict with the loading of the gcc and openmpi module, causing your job to fail.

Performance Alert
We have seen significant performance issues when using OpenMPI version 1.6 or 1.6.5 with more than about 15 cores/node and when the setting -bind-to-core is NOT used. Deepthought2 users are encouraged to add the -bind-to-core to their mpirun command.

OpenMPI versions after 1.6.5 have -bind-to-core set as the default. This setting seems is not appropriate in all cases, but seems to give significantly better performance in most common cases. See e.g. http://blogs.cisco.com/performance/open-mpi-binding-to-core-by-default/ for a discussion of this change. One case in particular where --bind-to-core is problematic is if you have multiple MPI jobs running on the same node; each MPI job will start binding tasks to cores, but will go through the cores in the same order --- so the first task of each job will be bound to the first core; the second task to the second core, and so on. So if running small MPI jobs with the --share flag, you are probably best off to NOT use --bind-to-core (for OpenMPI 1.6.5 or earlier) (or to use --bind-to none for OpenMPI 1.8.1 or higher).

A similar example, for the Intel compiler version of OpenMPI, using the C-shell, and including the -bind-to-core option would be:

#!/bin/csh
#SBATCH --ntasks=82c1;2c1;2c
#SBATCH -t 00:00:30

cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c

echo "openmpi-intel-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module load intel/2013.1.039
module load openmpi/1.6.5
module list

which mpirun
mpirun -bind-to-core openmpi-intel-hello

Because the tap or module load commands for the Intel compilers set up both the compilers and the Intel implementation of the MPI libraries, you MUST NOT tap the Intel compiler AFTER you tap the OpenMPI libraries. At least for older versions of the Intel compilers, doing so will cause the Intel MPI libraries setup to override the OpenMPI libraries. Even re-tap-ing the OpenMPI libraries again will not fix that; you will need to log out and back in. This has been fixed with the tap/module load scripts for the 2012 and later versions of the Intel compilers.

Note that when submitting jobs that do not involve GPUs, you are likely to get warnings about CUDA libraries, as described in this FAQ. These are harmless and can be ignored --- since your job is not using GPUs, it was likely assigned to non-GPU enabled nodes which do not have the hardware specific CUDA libraries therefore generating the warning. But since the job is not using GPUs, it does not need the CUDA libraries, so the warning can be ignored.

For more information, see the examples.

Running Intel MPI jobs

The Intel compiler suite includes its own implementation of MPI libraries, which is available for use if you compile your code with the Intel compilers. The Intel MPI libraries are made available when you tap or module load the Intel compilers. They are not available with the GNU compilers.

You are advised to always use the latest Intel compiler suite when compiling your code.

Since slurm and Intel MPI get along well, invoking your Intel MPI based code in a slurm job script is fairly simple;

#!/bin/tcsh
#SBATCH --ntasks=8
#SBATCH -t 00:00:30

cd /homes/payerle/slurm-tests/mpi-tests/helloworld/c

echo "intel-intel-hello.sh"
echo "Nodelist is $SLURM_NODELIST"
module unload openmpi
module load intel/2013.1.039

which mpirun
mpirun intel-intel-hello

The above is for a C-style shell, and most of the lines are just informational, printing status. The main components are the lines to load the intel MPI libraries and the mpirun actually invokes our code (intel-intel-hello in this case). The unloading of the openmpi module is optional but a good idea; if you somehow got it loaded, it would conflict with the loading of the Intel MPI module, probably causing your job to fail.

For more information, see the examples.

Running LAM jobs

Use of the LAM MPI libraries is no longer supported. The LAM libraries do not work properly with Slurm. Use either the latest OpenMPI or Intel MPI libraries instead.

The LAM MPI library function which parses the host string from Slurm appears to be broken. As the LAM MPI libraries are no longer maintained by the authors, it cannot be fixed by upgrading. The following provides a workaround, but it is STRONGLY recommended that you move to another MPI library.

The LAM MPI library requires you to explicitly setup the MPI daemons on all the nodes before you start using MPI, and tear them down after your code exits. So to run an MPI code you would typically have the following lines:

#!/bin/tcsh
#SBATCH -t 0:30
#SBATCH --ntasks=8

#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to LAM's desired format
set MPI_NODEFILE=$WORKDIR/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s cpu=%s\n", $2, $1); }' > $MPI_NODEFILE

tap lam-gnu
lamboot $MPI_NODEFILE
mpirun -np $SLURM_NTASKS  C YOUR_APPLICATION
lamclean
lamhalt

The lines from the "Generate a PBS_NODEFILE" to the sort | uniq | awk pipeline are a hack. LAM is supposed to be able to obtain the list of nodes assigned to your job from environmental variables passed by Slurm, but there appears to be an error in handling node lists that are simply comma separated lists of nodes (no range lists). As a result, the lamboot can sometimes work without providing a nodes file, but often it cannot. The "hack" provides a nodes list file in the format LAM wants. It starts by calling a script to generate a PBS format node list file, and then massages that into the format that LAM wants.

The first line after the tap command sets up the MPI pool between the nodes assigned to your job. Since lamboot cannot reliably get this information directly from slurm, we provide the host list file we generated above. The second line starts up a copy of YOUR_APPLICATION on each cores (hence the 'C") assigned to your job. The last line cleans up the MPI pool.

For more information, see the examples.

Running an MPICH MPI Job

Use of the MPICH or LAM MPI libraries is no longer supported on the Deepthought HPC clusters. Use either the latest OpenMPI or Intel MPI libraries instead.

Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)

The MPICH implementation of MPI also requires the MPI pool to be explicitly set up and torn down. The set up step involves starting mpd daemon processes on each of the nodes assigned to your job.

A typical MPICH job will look something like:

#!/bin/tcsh
#SBATCH -t 1:00
#SBATCH --ntask=8

set WORKDIR=/export/lustre_1/my_workdir
cd $WORKDIR

tap mpich-gnu

#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to MPICH's desired format
set MPI_NODEFILE=$WORKDIR/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s:%s\n", $2, $1); }' > $MPI_NODEFILE

mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE
mpiexec -n $SLURM_NTASKS YOUR_PROGRAM
mpdallexit

After specifying our job requirements, and going to our work directory, we tap the MPI libraries for our code. We then proceed to generate a file containing the list of nodes for our job. Slurm returns this in an environemntal variable, in an abbreviated format. To avoid having to convert that directly to MPICH format, we call Slurm's Moab/Maui/PBS/Torque compatibility tool, generate_pbs_nodefile which generates a nodefile in the PBS format. This is not quite what we want, as MPICH wants the name of each node to appear only once, followed by a : and the number of tasks to run on the node. The sort, uniq, and awk pipeline converts it for us.

The mpdboot line starts an mpd daemon on each of your nodes (using the Slurm SLURM_JOB_NUM_NODES variable to specify how many nodes we have. The nodes are passed via the nodefile we just created.

The mpiexec line starts SLURM_NTASKS instances of your code, one for each core assigned to your job. And the mpdallexit line cleans up the MPICH daemons, etc.

For more information, see the examples.

The above will work as long as you do not run more than one MPI job on the same node at the same time; since most MPI jobs use all the cores on a node anyway, it is fine for most people. If you do run into the situation where multiple MPI jobs are sharing nodes, when the first job calls mpdallexit, all the mpds for all jobs will be killed, which will make the second and later jobs unhappy. In these cases, you will want to set the environmental variable MPD_CON_EXT to something unique (e.g. the job id) before calling mpdboot, and add the --remcons option to mpdboot, e.g.

mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE --remcons

Running hybrid OpenMP/MPI jobs

If you are doing hybrid OpenMP/OpenMPI parallelization, the simple examples above will not suffice. In particular, you will need to provide more input to the scheduler about how you wish to split the cores assigned to you between OpenMP and MPI. The sbatch has additional parameters that can be used for this; you can tell it how many MPI tasks you wish to run, and how many cores each task should get. Typically, in addition to --ntasks, you would want to specify

--ntasks-per-node: How many MPI tasks to run on each node, typically in a hybrid OpenMP/MPI case it would be set to 1.
--cpus-per-task: How many cores should get assigned to each task.

So if your job wants 12 nodes, with each node running a 8 core OpenMP process that talks to the other nodes over MPI, you would do something like:

#!/bin/bash
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH -t 72:00:00

module load openmpi/1.6.5
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS

mpirun my-hybrid-code

Running Non-MPI jobs on multiple nodes

MPI is currently the most standard way of launching, controlling, synchronizing, and communicating across multi-node jobs, but it is not the only way. Some applications have their own process for running across multiple nodes, and in such cases you should follow their instructions.

The examples page shows an example of using the basic ssh command to start a process on each of the nodes assigned to your job. Something like this could be used to break a problem into N chunks that can be processed independently, and send each chunk to a different core/node. However, most real parallel jobs require much more than just launching the code: the passing of data back and forth, synchronization, etc. And for a simple job as described is often better to submit separate jobs in the batch system for each chunk.

Note: To successfully use ssh to start processes on nodes assigned to your job, you will need to set up passwordless ssh among the compute nodes.