MPI Job Submission Examples

This page provides some examples of submitting a simple MPI job using both the OpenMPI and Intel MPI libraries. It is based on the Basic_MPI_Job and similar job templates in the OnDemand portal.

This job makes use of a simple Hello World! program called hello-umd available in the UMD HPC cluster software library and which supports sequential, multithreaded, and MPI modes of operation. The code simply prints an identifying message from each thread of each task --- for this pure MPI case each task consists of a single thread, so it will print a message from each MPI task.

Overview

These examples are being treated together because they are nearly identical. The examples basically consists of a single file, the job script submit.sh (see for a listing and explanation of the script) which gets submitted to the cluster via the sbatch command.

The scripts are designed to show many good practices; including:

setting standard sbatch options within the script
loading the needed modules within the script
printing some useful diagnostic information at start of the script
creating a job specific work directory
running the code and saving the exit code
exiting with the exit code from the main application

Many of the practices above are rather overkill for such a simple job --- indeed, the vast majority of lines are for these "good practices" rather than the running of the intended code, but are included for educational purposes.

This code runs hello-umd in MPI mode, saving the output to a file in a job specific work directory (which the script creates). A symbolic link to this work directory is put in the submission directory to allow users of the OnDemand portal. to easily access the work directory. We could have forgone all that and simply have the output of hello-umd go to standard output, which would be available in the slurm-JOBNUMBER.out file (or whatever file you instructed Slurm to use instead). Doing such is acceptable as long as the code is not producing an excessive amount (many MBs) of output --- if the code produces a lot of output having it all sent to Slurm output file can cause problems, and it is better to redirect to a file.

The submission scripts

We present two cases, one using the library with the GNU compiler suite, and the other using the Intel MPI library with the Intel compiler suite. Afterwards, we give a detailed, line-by-line commentary of the scripts..

As can be seen by the similarities in the two scripts, there is little difference in the two cases. For the most part, the choice of which MPI library to use will depend on the code being run --- the mpirun or similar command must be from the same MPI library (down to the version of the MPI library and even the version of the compiler used to compile the MPI library) as the MPI application being used was built with. I.e., if your application was built against the OpenMPI libraries, use the same build of OpenMPI at runtime. Likewise, if your application was built against the Intel MPI libraries, use the same build of the Intel MPI libraries (i.e. the same Intel compiler) at runtime.

For codes built by system staff and made available via the module command, the module command will generally enforce this. (You might be able to get around this, but it will take some effort. Do not circumvent this.) These sample jobs use hello-umd, which has builds with both Intel MPI and OpenMPI, and the module command automatically select the appropriate version for you based on the previously loaded compiler and MPI library.

For codes you are building yourself, the choice of compiler and MPI library to use in the build stage is generally up to you. You should look at the recommendations of the authors of the software for guidance. Very broadly speaking, the Intel compilers and Intel MPI generally will have the best optimizations on Intel processors, but these "bleeding edge" optimizations can sometimes problems. The GNU compilers and OpenMPI are likely to be better supported by most open-source packages, but are usually not quite as highly optimized.

OpenMPI case

We first look at the submission script for the case when using the GCC compiler suite and OpenMPI You can download the source code as plain text. We also present a copy with line numbers here for discussion:


#!/bin/bash
# The line above this is the "shebang" line.  It must be first line in script
#-----------------------------------------------------
#	OnDemand Job Template for Hello-UMD, MPI version
#	Runs a simple MPI enabled hello-world code
#-----------------------------------------------------
#
# Slurm sbatch parameters section:
#	Request 60 MPI tasks with 1 CPU core each
#SBATCH -n 60
#SBATCH -c 1
#	Request 5 minutes of walltime
#SBATCH -t 5
#	Request 1 GB of memory per CPU core
#SBATCH --mem-per-cpu=1024
#	Do not allow other jobs to run on same node
#SBATCH --exclusive
#	Run on debug partition for rapid turnaround.  You will need
#	to change this (remove the line) if walltime > 15 minutes
#SBATCH --partition=debug
#       Do not inherit the environment of the process running the
#       sbatch command.  This requires you to explicitly set up the
#       environment for the job in this script, improving reproducibility
#SBATCH --export=NONE
#

# This job will run the MPI enabled version of hello-umd
# We create a directory on parallel filesystem from where we actually 
# will run the job.

# Section to ensure we have the "module" command defined
unalias tap >& /dev/null
if [ -f ~/.bash_profile ]; then
	source ~/.bash_profile
elif [ -f ~/.profile ]; then
	source ~/.profile
fi

# Set SLURM_EXPORT_ENV to ALL.  This prevents the --export=NONE flag
# from being passed to mpirun/srun/etc, which can cause issues.
# We want the environment of the job script to be passed to all 
# tasks/processes of the job
export SLURM_EXPORT_ENV=ALL

# Module load section
# First clear our module list 
module purge
# and reload the standard modules
module load hpcc/deepthought2
# Load the desired compiler,  MPI, and package modules
# NOTE: You need to use the same compiler and MPI module used
# when compiling the MPI-enabled code you wish to run (in this
# case hello-umd).  The values # listed below are correct for the
# version of hello-umd we will be using, but you may need to 
# change them if you wish to run a different package.
module load gcc/8.4.0
module load openmpi/3.1.5
module load hello-umd/1.5

# Section to make a scratch directory for this job
# Because different MPI tasks, which might be on different nodes, and will need
# access to it, we put it in a parallel file system.  
# We include the SLURM jobid in the directory name to avoid interference 
# if multiple jobs running at same time.
TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}"
mkdir $TMPWORKDIR
cd $TMPWORKDIR

# Section to output information identifying the job, etc.
echo "Slurm job ${SLURM_JOBID} running on"
hostname
echo "To run on ${SLURM_NTASKS} CPU cores across ${SLURM_JOB_NUM_NODES} nodes"
echo "All nodes: ${SLURM_JOB_NODELIST}"
date
pwd
echo "Loaded modules are:"
module list
echo "Job will be started out of $TMPWORKDIR"


# Setting this variable will suppress the warnings
# about lack of CUDA support on non-GPU enabled nodes.  We
# are not using CUDA, so warning is harmless.
export OMPI_MCA_mpi_cuda_support=0

# Get the full path to our hello-umd executable.  It is best
# to provide the full path of our executable to mpirun, etc.
MYEXE=`which hello-umd`
echo "Using executable $MYEXE"

# Run our script using mpirun
# We do not specify the number of tasks here, and instead rely on
# it defaulting to the number of tasks requested of Slurm
mpirun  ${MYEXE}  > hello.out 2>&1
# Save the exit code from the previous command
ECODE=$?

# Output from the above command was placed in a work directory in a parallel
# filesystem.  That parallel filesystem does _not_ get cleaned up automatically.
# And it is not normally visible from the Job Composer.
# To deal with this, we make a symlink from the job submit directory to
# the work directory for the job.  
#
# NOTE: The work directory will continue to exist until you delete it.  It will
# not get deleted when you delete the job in Job Composer.

ln -s ${TMPWORKDIR} ${SLURM_SUBMIT_DIR}/work-dir

echo "Job finished with exit code $ECODE.  Work dir is $TMPWORKDIR"
date

# Exit with the cached exit code
exit $ECODE

HelloUMD-MPI_gcc_openmpi job submission script
Line#	Code
`1`	`#!/bin/bash`
`2`	`# The line above this is the "shebang" line. It must be first line in script`
`3`	`#-----------------------------------------------------`
`4`	`# OnDemand Job Template for Hello-UMD, MPI version`
`5`	`# Runs a simple MPI enabled hello-world code`
`6`	`#-----------------------------------------------------`
`7`	`#`
`8`	`# Slurm sbatch parameters section:`
`9`	`# Request 60 MPI tasks with 1 CPU core each`
`10`	`#SBATCH -n 60`
`11`	`#SBATCH -c 1`
`12`	`# Request 5 minutes of walltime`
`13`	`#SBATCH -t 5`
`14`	`# Request 1 GB of memory per CPU core`
`15`	`#SBATCH --mem-per-cpu=1024`
`16`	`# Do not allow other jobs to run on same node`
`17`	`#SBATCH --exclusive`
`18`	`# Run on debug partition for rapid turnaround. You will need`
`19`	`# to change this (remove the line) if walltime > 15 minutes`
`20`	`#SBATCH --partition=debug`
`21`	`# Do not inherit the environment of the process running the`
`22`	`# sbatch command. This requires you to explicitly set up the`
`23`	`# environment for the job in this script, improving reproducibility`
`24`	`#SBATCH --export=NONE`
`25`	`#`
`26`
`27`	`# This job will run the MPI enabled version of hello-umd`
`28`	`# We create a directory on parallel filesystem from where we actually`
`29`	`# will run the job.`
`30`
`31`	`# Section to ensure we have the "module" command defined`
`32`	`unalias tap >& /dev/null`
`33`	`if [ -f ~/.bash_profile ]; then`
`34`	`source ~/.bash_profile`
`35`	`elif [ -f ~/.profile ]; then`
`36`	`source ~/.profile`
`37`	`fi`
`38`
`39`	`# Set SLURM_EXPORT_ENV to ALL. This prevents the --export=NONE flag`
`40`	`# from being passed to mpirun/srun/etc, which can cause issues.`
`41`	`# We want the environment of the job script to be passed to all`
`42`	`# tasks/processes of the job`
`43`	`export SLURM_EXPORT_ENV=ALL`
`44`
`45`	`# Module load section`
`46`	`# First clear our module list`
`47`	`module purge`
`48`	`# and reload the standard modules`
`49`	`module load hpcc/deepthought2`
`50`	`# Load the desired compiler, MPI, and package modules`
`51`	`# NOTE: You need to use the same compiler and MPI module used`
`52`	`# when compiling the MPI-enabled code you wish to run (in this`
`53`	`# case hello-umd). The values # listed below are correct for the`
`54`	`# version of hello-umd we will be using, but you may need to`
`55`	`# change them if you wish to run a different package.`
`56`	`module load gcc/8.4.0`
`57`	`module load openmpi/3.1.5`
`58`	`module load hello-umd/1.5`
`59`
`60`	`# Section to make a scratch directory for this job`
`61`	`# Because different MPI tasks, which might be on different nodes, and will need`
`62`	`# access to it, we put it in a parallel file system.`
`63`	`# We include the SLURM jobid in the directory name to avoid interference`
`64`	`# if multiple jobs running at same time.`
`65`	`TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}"`
`66`	`mkdir $TMPWORKDIR`
`67`	`cd $TMPWORKDIR`
`68`
`69`	`# Section to output information identifying the job, etc.`
`70`	`echo "Slurm job ${SLURM_JOBID} running on"`
`71`	`hostname`
`72`	`echo "To run on ${SLURM_NTASKS} CPU cores across ${SLURM_JOB_NUM_NODES} nodes"`
`73`	`echo "All nodes: ${SLURM_JOB_NODELIST}"`
`74`	`date`
`75`	`pwd`
`76`	`echo "Loaded modules are:"`
`77`	`module list`
`78`	`echo "Job will be started out of $TMPWORKDIR"`
`79`
`80`
`81`	`# Setting this variable will suppress the warnings`
`82`	`# about lack of CUDA support on non-GPU enabled nodes. We`
`83`	`# are not using CUDA, so warning is harmless.`
`84`	`export OMPI_MCA_mpi_cuda_support=0`
`85`
`86`	`# Get the full path to our hello-umd executable. It is best`
`87`	`# to provide the full path of our executable to mpirun, etc.`
`88`	MYEXE=`which hello-umd`
`89`	`echo "Using executable $MYEXE"`
`90`
`91`	`# Run our script using mpirun`
`92`	`# We do not specify the number of tasks here, and instead rely on`
`93`	`# it defaulting to the number of tasks requested of Slurm`
`94`	`mpirun ${MYEXE} > hello.out 2>&1`
`95`	`# Save the exit code from the previous command`
`96`	`ECODE=$?`
`97`
`98`	`# Output from the above command was placed in a work directory in a parallel`
`99`	`# filesystem. That parallel filesystem does _not_ get cleaned up automatically.`
`100`	`# And it is not normally visible from the Job Composer.`
`101`	`# To deal with this, we make a symlink from the job submit directory to`
`102`	`# the work directory for the job.`
`103`	`#`
`104`	`# NOTE: The work directory will continue to exist until you delete it. It will`
`105`	`# not get deleted when you delete the job in Job Composer.`
`106`
`107`	`ln -s ${TMPWORKDIR} ${SLURM_SUBMIT_DIR}/work-dir`
`108`
`109`	`echo "Job finished with exit code $ECODE. Work dir is $TMPWORKDIR"`
`110`	`date`
`111`
`112`	`# Exit with the cached exit code`
`113`	`exit $ECODE`

Intel MPI case

We first look at the submission script for the case when using the Intel compiler suite and Intel MPI . You can download the source code as plain text. We also present a copy with line numbers here for discussion:


#!/bin/bash
# The line above this is the "shebang" line.  It must be first line in script
#-----------------------------------------------------
#	OnDemand Job Template for Hello-UMD, MPI version
#	Runs a simple MPI enabled hello-world code
#-----------------------------------------------------
#
# Slurm sbatch parameters section:
#	Request 60 MPI tasks with 1 CPU core each
#SBATCH -n 60
#SBATCH -c 1
#	Request 5 minutes of walltime
#SBATCH -t 5
#	Request 1 GB of memory per CPU core
#SBATCH --mem-per-cpu=1024
#	Do not allow other jobs to run on same node
#SBATCH --exclusive
#	Run on debug partition for rapid turnaround.  You will need
#	to change this (remove the line) if walltime > 15 minutes
#SBATCH --partition=debug
#       Do not inherit the environment of the process running the
#       sbatch command.  This requires you to explicitly set up the
#       environment for the job in this script, improving reproducibility
#SBATCH --export=NONE
#

# This job will run the MPI enabled version of hello-umd
# We create a directory on parallel filesystem from where we actually 
# will run the job.

# Section to ensure we have the "module" command defined
unalias tap >& /dev/null
if [ -f ~/.bash_profile ]; then
	source ~/.bash_profile
elif [ -f ~/.profile ]; then
	source ~/.profile
fi

# Set SLURM_EXPORT_ENV to ALL.  This prevents the --export=NONE flag
# from being passed to mpirun/srun/etc, which can cause issues.
# We want the environment of the job script to be passed to all 
# tasks/processes of the job
export SLURM_EXPORT_ENV=ALL

# Module load section
# First clear our module list 
module purge
# and reload the standard modules
module load hpcc/deepthought2
# Load the desired compiler,  MPI, and package modules
# NOTE: You need to use the same compiler and MPI module used
# when compiling the MPI-enabled code you wish to run (in this
# case hello-umd).  The values # listed below are correct for the
# version of hello-umd we will be using, but you may need to 
# change them if you wish to run a different package.
module load intel/2020.1
# When using intelmpi with intel compiler, the MPI libraries are already loaded.
module load hello-umd/1.5

# Section to make a scratch directory for this job
# Because different MPI tasks, which might be on different nodes, and will need
# access to it, we put it in a parallel file system.  
# We include the SLURM jobid in the directory name to avoid interference 
# if multiple jobs running at same time.
TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}"
mkdir $TMPWORKDIR
cd $TMPWORKDIR

# Section to output information identifying the job, etc.
echo "Slurm job ${SLURM_JOBID} running on"
hostname
echo "To run on ${SLURM_NTASKS} CPU cores across ${SLURM_JOB_NUM_NODES} nodes"
echo "All nodes: ${SLURM_JOB_NODELIST}"
date
pwd
echo "Loaded modules are:"
module list
echo "Job will be started out of $TMPWORKDIR"




# Get the full path to our hello-umd executable.  It is best
# to provide the full path of our executable to mpirun, etc.
MYEXE=`which hello-umd`
echo "Using executable $MYEXE"

# Run our script using mpirun
# We do not specify the number of tasks here, and instead rely on
# it defaulting to the number of tasks requested of Slurm
mpirun  ${MYEXE}  > hello.out 2>&1
# Save the exit code from the previous command
ECODE=$?

# Output from the above command was placed in a work directory in a parallel
# filesystem.  That parallel filesystem does _not_ get cleaned up automatically.
# And it is not normally visible from the Job Composer.
# To deal with this, we make a symlink from the job submit directory to
# the work directory for the job.  
#
# NOTE: The work directory will continue to exist until you delete it.  It will
# not get deleted when you delete the job in Job Composer.

ln -s ${TMPWORKDIR} ${SLURM_SUBMIT_DIR}/work-dir

echo "Job finished with exit code $ECODE.  Work dir is $TMPWORKDIR"
date

# Exit with the cached exit code
exit $ECODE

HelloUMD-MPI_intel_intelmpi job submission script
Line#	Code
`1`	`#!/bin/bash`
`2`	`# The line above this is the "shebang" line. It must be first line in script`
`3`	`#-----------------------------------------------------`
`4`	`# OnDemand Job Template for Hello-UMD, MPI version`
`5`	`# Runs a simple MPI enabled hello-world code`
`6`	`#-----------------------------------------------------`
`7`	`#`
`8`	`# Slurm sbatch parameters section:`
`9`	`# Request 60 MPI tasks with 1 CPU core each`
`10`	`#SBATCH -n 60`
`11`	`#SBATCH -c 1`
`12`	`# Request 5 minutes of walltime`
`13`	`#SBATCH -t 5`
`14`	`# Request 1 GB of memory per CPU core`
`15`	`#SBATCH --mem-per-cpu=1024`
`16`	`# Do not allow other jobs to run on same node`
`17`	`#SBATCH --exclusive`
`18`	`# Run on debug partition for rapid turnaround. You will need`
`19`	`# to change this (remove the line) if walltime > 15 minutes`
`20`	`#SBATCH --partition=debug`
`21`	`# Do not inherit the environment of the process running the`
`22`	`# sbatch command. This requires you to explicitly set up the`
`23`	`# environment for the job in this script, improving reproducibility`
`24`	`#SBATCH --export=NONE`
`25`	`#`
`26`
`27`	`# This job will run the MPI enabled version of hello-umd`
`28`	`# We create a directory on parallel filesystem from where we actually`
`29`	`# will run the job.`
`30`
`31`	`# Section to ensure we have the "module" command defined`
`32`	`unalias tap >& /dev/null`
`33`	`if [ -f ~/.bash_profile ]; then`
`34`	`source ~/.bash_profile`
`35`	`elif [ -f ~/.profile ]; then`
`36`	`source ~/.profile`
`37`	`fi`
`38`
`39`	`# Set SLURM_EXPORT_ENV to ALL. This prevents the --export=NONE flag`
`40`	`# from being passed to mpirun/srun/etc, which can cause issues.`
`41`	`# We want the environment of the job script to be passed to all`
`42`	`# tasks/processes of the job`
`43`	`export SLURM_EXPORT_ENV=ALL`
`44`
`45`	`# Module load section`
`46`	`# First clear our module list`
`47`	`module purge`
`48`	`# and reload the standard modules`
`49`	`module load hpcc/deepthought2`
`50`	`# Load the desired compiler, MPI, and package modules`
`51`	`# NOTE: You need to use the same compiler and MPI module used`
`52`	`# when compiling the MPI-enabled code you wish to run (in this`
`53`	`# case hello-umd). The values # listed below are correct for the`
`54`	`# version of hello-umd we will be using, but you may need to`
`55`	`# change them if you wish to run a different package.`
`56`	`module load intel/2020.1`
`57`	`# When using intelmpi with intel compiler, the MPI libraries are already loaded.`
`58`	`module load hello-umd/1.5`
`59`
`60`	`# Section to make a scratch directory for this job`
`61`	`# Because different MPI tasks, which might be on different nodes, and will need`
`62`	`# access to it, we put it in a parallel file system.`
`63`	`# We include the SLURM jobid in the directory name to avoid interference`
`64`	`# if multiple jobs running at same time.`
`65`	`TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}"`
`66`	`mkdir $TMPWORKDIR`
`67`	`cd $TMPWORKDIR`
`68`
`69`	`# Section to output information identifying the job, etc.`
`70`	`echo "Slurm job ${SLURM_JOBID} running on"`
`71`	`hostname`
`72`	`echo "To run on ${SLURM_NTASKS} CPU cores across ${SLURM_JOB_NUM_NODES} nodes"`
`73`	`echo "All nodes: ${SLURM_JOB_NODELIST}"`
`74`	`date`
`75`	`pwd`
`76`	`echo "Loaded modules are:"`
`77`	`module list`
`78`	`echo "Job will be started out of $TMPWORKDIR"`
`79`
`80`
`81`
`82`
`83`	`# Get the full path to our hello-umd executable. It is best`
`84`	`# to provide the full path of our executable to mpirun, etc.`
`85`	MYEXE=`which hello-umd`
`86`	`echo "Using executable $MYEXE"`
`87`
`88`	`# Run our script using mpirun`
`89`	`# We do not specify the number of tasks here, and instead rely on`
`90`	`# it defaulting to the number of tasks requested of Slurm`
`91`	`mpirun ${MYEXE} > hello.out 2>&1`
`92`	`# Save the exit code from the previous command`
`93`	`ECODE=$?`
`94`
`95`	`# Output from the above command was placed in a work directory in a parallel`
`96`	`# filesystem. That parallel filesystem does _not_ get cleaned up automatically.`
`97`	`# And it is not normally visible from the Job Composer.`
`98`	`# To deal with this, we make a symlink from the job submit directory to`
`99`	`# the work directory for the job.`
`100`	`#`
`101`	`# NOTE: The work directory will continue to exist until you delete it. It will`
`102`	`# not get deleted when you delete the job in Job Composer.`
`103`
`104`	`ln -s ${TMPWORKDIR} ${SLURM_SUBMIT_DIR}/work-dir`
`105`
`106`	`echo "Job finished with exit code $ECODE. Work dir is $TMPWORKDIR"`
`107`	`date`
`108`
`109`	`# Exit with the cached exit code`
`110`	`exit $ECODE`

Discussion of submit.sh

Line 1: The Unix shebang

This is the standard Unix shebang line which defines which program should be used to interpret the script. This "shebang" MUST be the first line of the script --- it is not recognized if there are any line, even comment lines and/or blank lines before it. The Slurm scheduler requires that your job script starts with a shebang line.

Like most of our examples, this shebang uses the /bin/bash interpretter, which is the bash (Bourne-again) shell. This is a compatible replacement to and enhancement of the original Unix Bourne shell. You can opt to specify another shell or interpretter if you so desire, common choices are:

the Bourne shell (/bin/sh) in your shebang (note that this basically just uses bash in a restricted mode)
or one of the C shell variants (/bin/csh or /bin/tcsh)

However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.

Lines 3-6: Comments

These are comment lines describing the script. Note that the bash (as well as sh, csh, and tcsh) shells will treat any line starting with an octothorpe/pound/number sign (#) as a comment. This includes some special lines which are significant and effect the Slurm scheduler:

The "shebang" line is a comment to the shell, but is not ignored by the system or the Slurm commands, and controls which shell is used to interpret the rest of the script file.
The various lines starting with #SBATCH are used to control the Slurm scheduler and will be discussed elsewhere.

But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.

Lines 10-24: Sbatch options

The various lines starting with #SBATCH can be used to control how the Slurm sbatch command submits the job. Basically, any command line flags can be provided witha #SBATCH line in the script, and you can mix and match command line options and options in #SBATCH.

NOTE: any #SBATCH must precede any "executable lines" in the script. It is recommended that you have nothing but the shebang line, comments and blank lines before any #SBATCH lines.

Lines 10-11: Set task/core requirements

This line requests 60 MPI tasks (--ntasks=1 or -n 1) with one CPU core for each MPI task (--cpus-per-task=1 or -c 1).

Note that we do not specify a number of nodes, and we recommend that you do not for MPI jobs --- by default Slurm will allocate enough nodes to satisfy this job's needs, and if you specify a value which is incorrect it will only cause problems.

We choose 60 MPI tasks as this will require multiple nodes on both Deepthought2 and Juggernaut, so makes a better demonstration, but will still fit in the debug partition.

Line 13: Specify walltime limit

This line requests a walltime of 5 minutes.The #SBATCH -t TIME line sets the time limit for the job. The requested TIME value can take any of a number of formats, including:

MINUTES
MINUTES:SECONDS
HOURS:MINUTES:SECONDS
DAYS-HOURS
DAYS-HOURS:MINUTES
DAYS-HOURS:MINUTES:SECONDS

It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.

You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.

In general, you should estimate the maximum run time, and pad it by 10% or so.

In this case, the hello-umd will run very quickly; much less than 5 minutes.

Line 14: Specify memory requirements

This sets the amount of memory to be requested for the job.

There are several parameters you can give to Slurm/sbatch to specify the memory to be allocated for the job. It is recommended that you always include a memory request for your job --- if omitted it will default to 6GB per CPU core. The recommended way to request memory is with the --mem-per-cpu=N flag. Here N is in MB. This will request N MB of RAM for each CPU core allocated to the job. Since you often wish to ensure each process in the job has sufficient memory, this is usually the best way to do so.

An alternative is with the --mem=N flag. This sets the maximum memory use by node. Again, N is in MB. This could be useful for single node jobs, especially multithreaded jobs, as there is only a single node and threads generally share significant amounts of memory. But for MPI jobs the --mem-per-cpu flag is usually more appropriate and convenient.

For MPI codes, we recommend using --mem-per-cpu instead of --mem since you generally wish to ensure each MPI task has sufficient memory.

The hello-umd does not use much memory, so 1 GB per core is plenty.

Line 17: Specify exclusive mode

This line tells Slurm that you do not wish to allow any other jobs to run on the nodes allocated to your job while it is running.

The lines SBATCH --share, SBATCH --oversubscribe, or SBATCH --exclusive decide whether or not other jobs are able to run on the same node(s) are your job.

NOTE: The Slurm scheduler changed the name of the flag for "shared" mode. The proper flag is now #SBATCH --oversubscribe. You must use the "oversubscribe" flag on Juggernaut. You can currently use either form on Deepthought2, but the "#SBATCH --share form is deprecated and at some point will no longer be supported. Both forms effectively do the same thing.

In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.

In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).

Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).

The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.

Again, as a multi-core job, #SBATCH --exclusive is the default, but we recommend explicitly stating this.

Line 20: Specify partition

This line specifies what partition we wish to use.

For a simple job like this, the debug partition would be a good choice, and we use that on the Deepthought2 cluster. However, on the Juggernaut cluster the debug partition does not have access to the high performance (lustre) filesystem, which this job script uses for the working directory. So for the Juggernaut cluster we specify the debug partition (this is not really needed, as that is the default partition to use on Juggernaut anyway).

For real production work, the restrictions of the debug partition (the limited compute resources and maximum 15 minute walltime) will likely make using it unsuitable. So for production work, on both clusters, you will likely not be able to use the debug partition so it is probably best to just omit this line and let the scheduler default the partition apropriately for you.

Line 24: Specify export behavior

This line instructs sbatch not to let the job process inherit the environment of the process which invoked the sbatch command. This requires the job script to explicitly set up its required environment, as it can no longer depend on environmental settings you had when you run the sbatch command. While this may require a few more lines in your script, it is a good practice and improves the reproducibility of the job script --- without this it is possible the job would only run correctly if you had a certain module loaded or variable set when you submit the job.

Lines 28-33: Reading the bash profile

These lines make sure that the module command is available in your script. They are generally only required if the shell specified in the shebang line does not match your default login shell, in which case the proper startup files likely did not get invoked.

The unalias line is to ensure that there is no vestigal tap command. It is sometimes needed on RHEL6 systems, should not be needed on the newer platforms but is harmless when not needed. The remaining lines will read in the appropriate dot files for the bash shell --- the if, then, elif construct enables this script to work on both the Deepthought2 and Juggernaut clusters, which have a slightly different name for the bash startup file.

Line 43: Setting SLURM_EXPORT_ENV

This line changes an environemntal variable that affects how various Slurm commands operate. This line sets the variable SLURM_EXPORT_ENV to the value ALL, which causes the environment to be shared with other processes spawned by Slurm commands (which also includes mpirun and similar).

At first this might seem to contradict our recommendation to use #SBATCH --export=NONE, but it really does not. The #SBATCH --export=NONE setting will cause the job script not to inherit the environment of the shell in which you ran the sbatch command. But we are now in the job script, which because of the --export=NONE flag, has it's own environment which was set up in the script. We want this environment to be shared with other MPI tasks and processes spawned by this job. These MPI tasks and processes will inherit the environment set up in this script, not the environment from which the sbatch command ran.

This important for MPI jobs like this, because otherwise the mpirun code might not spawn properly.

Lines 47-58: Module loads

These lines ensure that the proper modules are loaded.

We recommend that you always load the compiler module first, then the MPI library if needed, and then any higher level modules for applications, etc. Many packages have different builds for different compilers, MPI libraries, etc., and the module command is smart enough to load the correct versions of these if you load the modules in the aforementioned order.

For codes using the OpenMPI library, you should module load the compiler, and then the apropriate OpenMPI library, and then your application (hello-umd in this case).

For codes using the Intel MPI library, the environment for this is set up automatically when you load the Intel compiler suite. Thus in these cases you do not need to explicitly module load an MPI library.

We recommend that you always specify the specific version you want in your job scripts --- this makes your job more reproducible. Systems staff may add newer versions of existing packages without notification, and if you do not specify a version, the default version may change without your expecting it. In particular, a job that runs fine today using today's default version might crash unexpectedly when you try running it again in six months because the packages it uses were updated and your inputs are not compatible with the new version of the code.

Lines 65-67: Creating a working directory

These lines generate a working directory on the high-performance lustre filesystem. Generally for MPI jobs you want a working directory which is accessible to all tasks running as part of the job, regardless of what node it is running on. The /tmp is specific to a single node, so that is usually not suitable for MPI jobs. The lustre filesystem is accessible by all of the compute nodes of cluster, so it is a good choice for MPI jobs.

The TMPWORKDIR="/lust/$USER/ood-job.${SLURM_JOBID}"> line defines an environmental variable containing the name of our chosen work directory. The ${SLURM_JOBID} references another environmental variable which is automatically set by Slurm (when the job starts to run) to the job number for this job --- using this in our work directory names helps ensure it will not conflict with any other job. The mkdir command creates this work directory, and the cd changes our working directory to that directory--- note in those last commands the use of the environmental variable we just created to hold the directory name.

Lines 70-78: Identifiying ourselves

These lines print some information about the job into the Slurm output file. It uses the environmental variables SLURM_JOBID, SLURM_NTASKS, SLURM_JOB_NUM_NODES, and SLURM_JOB_NODELIST which are set by Slurm at the start of the job to list the job number, the number of MPI tasks, the number of nodes, and the names of the nodes allocated to the job. It also prints the time and date that the job started (the date command), the working directory (the pwd command), and the list of loaded modules (the module list command). Although you are probably not interested in any of that information if the job runs as expected, they can often be helpful in diagnosing why things did not work as expected.

Line 84 (OpenMPI jobs only): OpenMPI CUDA fix

This line sets a parameter which controls the OpenMPI behavior. This particular setting suppresses a warning message from OpenMPI when it is unable to find the CUDA libraries --- our OpenMPI is built with CUDA support but the CUDA runtime libraries are only present on the GPU enabled nodes. This job does not support or use CUDA, so we do not want to get the warning message.

Because this sets an OpenMPI parameter, it is only relevant for job scripts using the OpenMPI libaries.

Lines 85-86 (IntelMPI jobs), 88-89 (OpenMPI jobs)

These lines determine the full path to the hello-umd command, and stores it in an environmental variable named MYEXE, and then outputs the path for added diagnostics. We find that MPI jobs run better when you provide the absolute path to the executable to the mpirun or similar command.

Line 91 (IntelMPI jobs), 94 (OpenMPI jobs): Actually run the command

Finally! We actually run the command for this job script. Since this is an MPI job, we run the mpirun command with the absolute path to our hello-umd executable as the argument. Each MPI task is running hello-umd in a single-thread, so no arguments are needed to the hello-umd command. (If you needed to pass arguments, they would go after the path to your executable in the mpirun line.

We run the code so as to save the output in the file hello.out in the current working directory. The > operator does output redirection, meaning that all of the standard output goes to the specified file (hello.out in this case). The >&1 operator causes the standard error output to be sent to the standard output stream (1 is the number for the standard output stream), and since standard output was redirected to the file, so will the standard error be.

For this simple case, we could have omitted the redirection of standard output and standard error, in which case any such output would end up in the Slurm output file (usually named slurm-JOBNUMBER.out. However, if your job produces a lot (many MBs) of output to standard output/standard error, this can be problematic. It is good practice to redirect output if you know your code will produce more than 1000 or so lines of output.

Line 93 (IntelMPI jobs), 96 (OpenMPI jobs): Store the error code

This line stores the shell error code from the previous command (which actually ran the code we are interested in). This is not needed if the code is that last line in your job script, but it is not in this case (we have to copy some files, etc). Slurm will look at the exit code of the last command run in the script file when trying to determine if the job succeeded or failed, and we do not want it to incorrectly report the job as succeeding if the application we wanted to run failed but a copy command in our clean-up section was successful.

The special shell variable $? stores the exit code from the last command. Normally it will be 0 if the command was successful, and non-zero otherwise. But it only works for the last command, so we save it in the variable ECODE.

Line 104 (IntelMPI jobs), 107 (OPENMPI jobs): Symlink work dir

This line creates a symlink from the submission directory to the working directory. This line is really only present for the sake of users running the code from within the OnDemand Portal, since the portal's Job Composer only shows the submission directory, and without this line would not see the output in the working directory.

Lines 106-107 (IntelMPI jobs), 109-110 (OpenMPI jobs): Say goodbye

These lines print some useful information at the end of the job. Basically they just say that the job finished, and prints the exit code we stored in ECODE, and then prints the date/time of completion using the date command

Line 110 (IntelMPI jobs), 113 (OpenMPI jobs): Exit

This line exits the script, setting the exit code for the script to the exit code of our application that we saved in the environment variable ECODE. This means that the script will have the same exit code as the application, which will allow Slurm to better determine if the job was successful or not. (If we did not do this, the error code of the script will be the error code of the last command that ran, in this case the date command which should never fail. So even if your application aborted, the script would return a successful (0) error code, and Slurm would think the job succeeded if this line was omitted).

Line 111 (IntelMPI jobs), 114 (OpenMPI jobs): Trailing blank line

We recommend that you get into the habit of leaving one or more blank lines at the end of your script. This is especially true if you write the scripts in Windows and then transfer to the cluster.

The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).

This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.

Running the examples

The easiest way to run this example is with the Job Composer of the OnDemand portal, using the HelloUMD-MPI_gcc_openmpi template for the GNU compiler suite and OpenMPI library, or using the HelloUMD-MPI_intel_intelmpi template for the Intel compiler suite and Intel MPI library case..

To submit from the command line, just

Download the submit script as plain text to the HPC login node. Use
- submit.sh (GCC/OpenMPI version)
- submit.sh (ICC/IntelMPI version)
Run the command sbatch submit.sh. This will submit the job to the scheduler, and should return a message like Submitted batch job 23767 --- the number will vary (and is the job number for this job). The job number can be used to reference the job in Slurm, etc. (Please always give the job number(s) when requesting help about a job you submitted).

Whichever method you used for submission, the job will be queued for the debug partition and should run within 15 minutes or so. When it finishes running, the slurm-JOBNUMBER.out should contain the output from our diagnostic commands (time the job started, finished, module list, etc). The output of the hello-umd will be in the file hello.out in the job specific work directory created in your lustre directory. For the convenience of users of the OnDemand portal, a symlink to this directory is created in the submission directory. So if you used OnDemand, a symlink to the work directory will appear in the Folder contents section on the right.

The slurm-JOBNUMBER.out file will resemble (from an Intel MPI example):

Slurm job 23868 running on
compute-10-0.juggernaut.umd.edu
To run on 60 CPU cores across 2 nodes
All nodes: compute-10-0
Thu Mar 11 13:29:20 EST 2021
/lustre/jn10/payerle/ood-job.23868
Loaded modules are:
Currently Loaded Modulefiles:
 1) hpcc/juggernaut                                         
 2) intel/2020.1                                            
 3) hello-umd/1.5/intel/2020.1/intelmpi/broadwell(default)  
Job will be started out of /lustre/jn10/payerle/ood-job.23868

Most of the details in your file will be different than in the example above, but you should get the drift.

The output in the hello.out file will resemble (from an OpenMPI example):

Hello UMD from thread 0 of 1, task 3 of 60 (pid=87441 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 9 of 60 (pid=87447 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 1 of 60 (pid=87439 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 10 of 60 (pid=87448 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 15 of 60 (pid=87454 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 24 of 60 (pid=87464 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 26 of 60 (pid=87466 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 27 of 60 (pid=87467 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 29 of 60 (pid=87470 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 7 of 60 (pid=87445 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 8 of 60 (pid=87446 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 2 of 60 (pid=87440 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 12 of 60 (pid=87450 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 13 of 60 (pid=87451 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 19 of 60 (pid=87458 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 20 of 60 (pid=87459 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 21 of 60 (pid=87461 on host compute-10-0.juggernaut.umd.edu
hello-umd: Version 1.5
Built for compiler: intel/20.0.1
with MPI support( usgin MPI library intel-parallel-studio/cluster.2020.1)
Hello UMD from thread 0 of 1, task 17 of 60 (pid=87456 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 11 of 60 (pid=87449 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 25 of 60 (pid=87465 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 0 of 60 (pid=87438 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 23 of 60 (pid=87463 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 28 of 60 (pid=87468 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 5 of 60 (pid=87443 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 16 of 60 (pid=87455 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 18 of 60 (pid=87457 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 22 of 60 (pid=87462 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 4 of 60 (pid=87442 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 6 of 60 (pid=87444 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 14 of 60 (pid=87453 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 35 of 60 (pid=269472 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 42 of 60 (pid=269479 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 44 of 60 (pid=269481 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 40 of 60 (pid=269477 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 31 of 60 (pid=269468 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 56 of 60 (pid=269493 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 57 of 60 (pid=269494 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 59 of 60 (pid=269496 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 37 of 60 (pid=269474 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 49 of 60 (pid=269486 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 36 of 60 (pid=269473 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 53 of 60 (pid=269490 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 58 of 60 (pid=269495 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 43 of 60 (pid=269480 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 48 of 60 (pid=269485 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 55 of 60 (pid=269492 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 50 of 60 (pid=269487 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 46 of 60 (pid=269483 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 52 of 60 (pid=269489 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 47 of 60 (pid=269484 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 41 of 60 (pid=269478 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 34 of 60 (pid=269471 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 39 of 60 (pid=269476 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 33 of 60 (pid=269470 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 38 of 60 (pid=269475 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 45 of 60 (pid=269482 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 51 of 60 (pid=269488 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 54 of 60 (pid=269491 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 32 of 60 (pid=269469 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 0 of 1, task 30 of 60 (pid=269467 on host compute-10-1.juggernaut.umd.edu

Basically, you should see a message from each task 0 to 59, all from thread 0 of 1 (since this is a pure MPI code), in some random order. The identifying comments (version number, compiler and MPI library built for) will appear somewhere in the mix. Because everything is running in parallel, the order will not be constant. Note that the tasks will be divided across multiple nodes (in this case compute-10-0 and compute-10-1). On Juggernaut, the 60 cores will require two nodes, and on Deepthought2 it would require some three nodes.