Multithreaded Job Submission Example

Many codes can support multithreading as a means of parallelism, as standards like OpenMP generally make it easier to code than other parallel paradigms and standard workstations and even laptops can see performance improvements from multithreading.

Multithreading is a form of shared-memory parallelism and so all of the threads need to be running on the same node in order that the tasks can access the same memory for communication over shared-memory.

This page provides an example of submitting such a multithreaded job . It is based on the HelloUMD-Multithreaded job templates in the OnDemand portal.

This job makes use of a simple Hello World! program called hello-umd available in the UMD HPC cluster software library and which supports sequential , multithreaded , and MPI modes of operation. The code simply prints an identifying message from each thread of each task --- for this simple multithreaded case only a single task will be used, but the task will have 10 threads . The scheduler will always allocate all of the CPU cores for a specific task on the same node which will satisfy the shared-memory requirement.

Overview

This example basically consists of a single file, the job script submit.sh (see for a listing and explanation of the script) which gets submitted to the cluster via the sbatch command.

The script is designed to show many good practices; including:

  • setting standard sbatch options within the script
  • loading the needed modules within the script
  • printing some useful diagnostic information at start of the script
  • creating a job specific temporary work directory
  • running the code and saving the exit code
  • copying back any files that should be retained
  • exiting with the exit code from the main application

Many of the practices above are rather overkill for such a simple job --- indeed, the vast majority of lines are for these "good practices" rather than the running of the intended code, but are included for educational purposes.

This code runs hello-umd in multithreaded mode, saving the output to a file in the temporary work directory, and then copying back to the submission directory. We could have forgone all that and simply have the output of hello-umd go to standard output, which would be available in the slurm-JOBNUMBER.out file (or whatever file you instructed Slurm to use instead). Doing such is acceptable as long as the code is not producing an excessive amount (many MBs) of output --- if the code produces a lot of output having it all sent to Slurm output file can cause problems, and it is better to redirect to a file.

The submission script

The submission script submit.sh can be downloaded as plain text. We present a copy with line numbers for discussion below (click on lines to link to discussion for those lines):

Source of submit.sh

HelloUMD-Multithreaded job submission script
Line# Code
1
#!/bin/bash
2
# The line above this is the "shebang" line.  It must be first line in script
3
#-----------------------------------------------------
4
#	Default OnDemand Job Template
5
#	For a basic Hello World multi-threaded job
6
#-----------------------------------------------------
7
#
8
# Slurm sbatch parameters section:
9
#	Request a single task using 10 CPU core
10
#SBATCH --ntasks=1
11
#SBATCH --cpus-per-task=15
12
#	Request 5 minute of walltime
13
#SBATCH -t 5
14
#	Request 10 GB of memory for the job
15
#SBATCH --mem=10240
16
#	Allow other jobs to run on same node
17
#SBATCH --oversubscribe
18
#	Run on debug partition for rapid turnaround.  You will need
19
#	to change this (remove the line) if walltime > 15 minutes
20
#SBATCH --partition=debug
21
#	Do not inherit the environment of the process running the
22
#	sbatch command.  This requires you to explicitly set up the
23
#	environment for the job in this script, improving reproducibility
24
#SBATCH --export=NONE
25
#
26
27
# This job will run our hello-umd demo binary in multithreaded mode
28
# Output will go to local /tmp scratch space on the node we are running
29
# on, and then will be copied back to our work directory.
30
31
# Section to ensure we have the "module" command defined
32
unalias tap >& /dev/null
33
if [ -f ~/.bash_profile ]; then
34
	source ~/.bash_profile
35
elif [ -f ~/.profile ]; then
36
	source ~/.profile
37
fi
38
39
# Set SLURM_EXPORT_ENV to ALL.  This prevents the --export=NONE flag
40
# from being passed to mpirun/srun/etc, which can cause issues.
41
# We want the environment of the job script to be passed to all 
42
# tasks/processes of the job
43
export SLURM_EXPORT_ENV=ALL
44
45
# Module load section
46
# First clear our module list 
47
module purge
48
# and reload the standard modules
49
module load hpcc/deepthought2
50
# Set up environment for hello-umd
51
module load hello-umd/1.5
52
53
# Section to make a scratch directory for this job
54
# For sequential jobs, local /tmp filesystem is a good choice
55
# We include the SLURM jobid in the #directory name to avoid interference if 
56
# multiple jobs running at same time.
57
TMPWORKDIR="/tmp/ood-job.${SLURM_JOBID}"
58
mkdir $TMPWORKDIR
59
cd $TMPWORKDIR
60
61
# Section to output information identifying the job, etc.
62
echo "Slurm job ${SLURM_JOBID} running on"
63
hostname
64
echo "To run in ${SLURM_CPUS_PER_TASK} tasks on a single nodes"
65
echo "All nodes: ${SLURM_JOB_NODELIST}"
66
date
67
pwd
68
echo "Loaded modules are:"
69
module list
70
71
72
# Run our code, giving -t 0 to use all available CPUs (15 in this case)
73
hello-umd -t 0 > hello.out 2>&1
74
# Save the exit code from the previous command
75
ECODE=$?
76
77
# Copy results back to submit dir
78
cp hello.out ${SLURM_SUBMIT_DIR}
79
80
echo "Job finished with exit code $ECODE"
81
date
82
83
# Exit with the cached exit code
84
exit $ECODE

Discussion of submit.sh

Line 1: The Unix shebang
This is the standard Unix shebang line which defines which program should be used to interpret the script. This "shebang" MUST be the first line of the script --- it is not recognized if there are any line, even comment lines and/or blank lines before it. The Slurm scheduler requires that your job script starts with a shebang line.

Like most of our examples, this shebang uses the /bin/bash interpretter, which is the bash (Bourne-again) shell. This is a compatible replacement to and enhancement of the original Unix Bourne shell. You can opt to specify another shell or interpretter if you so desire, common choices are:

  • the Bourne shell (/bin/sh) in your shebang (note that this basically just uses bash in a restricted mode)
  • or one of the C shell variants (/bin/csh or /bin/tcsh)

However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.

Lines 3-6: Comments
These are comment lines describing the script. Note that the bash (as well as sh, csh, and tcsh) shells will treat any line starting with an octothorpe/pound/number sign (#) as a comment. This includes some special lines which are significant and effect the Slurm scheduler:
  • The "shebang" line is a comment to the shell, but is not ignored by the system or the Slurm commands, and controls which shell is used to interpret the rest of the script file.
  • The various lines starting with #SBATCH are used to control the Slurm scheduler and will be discussed elsewhere.

But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.

Lines 10-24: Sbatch options
The various lines starting with #SBATCH can be used to control how the Slurm sbatch command submits the job. Basically, any command line flags can be provided witha #SBATCH line in the script, and you can mix and match command line options and options in #SBATCH.

NOTE: any #SBATCH must precede any "executable lines" in the script. It is recommended that you have nothing but the shebang line, comments and blank lines before any #SBATCH lines.

Lines 10-11
These lines requests CPU cores . Actually, it is requesting single task (--ntasks=1 or -n 1) with threads (--cpus-per-task= or -c ).

The scheduler will place all of the cores for a single task on the same node , which is what we need for shared memory parallelism techniques like multi-threading .

Although multithreaded processes can in theory run on fewer cores than they have threads, in such cases you do not get the full parallelism benefit (as some threads will be waiting until a CPU core becomes available after another thread finishes). In general, for high-performance computing , you want to have a separate CPU core for each thread to ensure maximal performance.

Line 13
This line requests a walltime of 5 minutes.The #SBATCH -t TIME line sets the time limit for the job. The requested TIME value can take any of a number of formats, including:
  • MINUTES
  • MINUTES:SECONDS
  • HOURS:MINUTES:SECONDS
  • DAYS-HOURS
  • DAYS-HOURS:MINUTES
  • DAYS-HOURS:MINUTES:SECONDS

It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.

You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.

In general, you should estimate the maximum run time, and pad it by 10% or so.

In this case, the hello-umd will run very quickly; much less than 5 minutes.

Line 15
This sets the amount of memory to be requested for the job.

There are several parameters you can give to Slurm/sbatch to specify the memory to be allocated for the job. It is recommended that you always include a memory request for your job --- if omitted it will default to 6GB per CPU core. The recommended way to request memory is with the --mem-per-cpu=N flag. Here N is in MB. This will request N MB of RAM for each CPU core allocated to the job. Since you often wish to ensure each process in the job has sufficient memory, this is usually the best way to do so.

An alternative is with the --mem=N flag. This sets the maximum memory use by node. Again, N is in MB. This could be useful for single node jobs, especially multithreaded jobs, as there is only a single node and threads generally share significant amounts of memory. But for MPI jobs the --mem-per-cpu flag is usually more appropriate and convenient.

We request 10 GB of memory for the job, which is really well more than this simple hello world code needs. We could have instead used something like #SBATCH --mem-per-cpu=1024 to request 1 GB per CPU core . Since this is a multithreaded job using threads, that would also have resulted in requesting GB of RAM. However, for multithreaded jobs, the memory use generally tends to be independent of the number of threads, so specifying the total memory needed is usually more convenient.

Line 17
This line tells Slurm that the job is willing to allow other jobs to be running on the node allocated to it while it is running.

The lines SBATCH --share, SBATCH --oversubscribe, or SBATCH --exclusive decide whether or not other jobs are able to run on the same node(s) are your job.

NOTE: The Slurm scheduler changed the name of the flag for "shared" mode. The proper flag is now #SBATCH --oversubscribe. You must use the "oversubscribe" flag on Juggernaut. You can currently use either form on Deepthought2, but the "#SBATCH --share form is deprecated and at some point will no longer be supported. Both forms effectively do the same thing.

In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.

In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).

Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).

The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.

Again, as a single core job, #SBATCH --oversubscribe is the default for single core jobs, but we recommend explicitly stating this.

Line 20
This line states that we wish to submit this job to the debug partition. The debug partition has limited resources, and a maximum 15 minute walltime, but this is a very short and small job, so the debug partition suits it well.

For real production work, the debug queue is probably not adequate, in which case it is recommended that you just omit this line and let the scheduler select an appropriate partition for you.

Line 24
This line instructs sbatch not to let the job process inherit the environment of the process which invoked the sbatch command. This requires the job script to explicitly set up its required environment, as it can no longer depend on environmental settings you had when you run the sbatch command. While this may require a few more lines in your script, it is a good practice and improves the reproducibility of the job script --- without this it is possible the job would only run correctly if you had a certain module loaded or variable set when you submit the job.
Lines 32-37: Reading the bash profile

To begin with, we do a module purge to clear out any previously loaded modules. This prevents them from interfering with subsequent module loads. Then we load the default module for the cluster with module load hpcc/deepthought2; this line should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut for the Juggernaut cluster).

These lines make sure that the module command is available in your script. They are generally only required if the shell specified in the shebang line does not match your default login shell, in which case the proper startup files likely did not get invoked.

The unalias line is to ensure that there is no vestigal tap command. It is sometimes needed on RHEL6 systems, should not be needed on the newer platforms but is harmless when not needed. The remaining lines will read in the appropriate dot files for the bash shell --- the if, then, elif construct enables this script to work on both the Deepthought2 and Juggernaut clusters, which have a slightly different name for the bash startup file.

Line 43: Setting SLURM_EXPORT_ENV
This line changes an environemntal variable that affects how various Slurm commands operate. This line sets the variable SLURM_EXPORT_ENV to the value ALL, which causes the environment to be shared with other processes spawned by Slurm commands (which also includes mpirun and similar).

At first this might seem to contradict our recommendation to use #SBATCH --export=NONE, but it really does not. The #SBATCH --export=NONE setting will cause the job script not to inherit the environment of the shell in which you ran the sbatch command. But we are now in the job script, which because of the --export=NONE flag, has it's own environment which was set up in the script. We want this environment to be shared with other MPI tasks and processes spawned by this job. These MPI tasks and processes will inherit the environment set up in this script, not the environment from which the sbatch command ran.

This really is not needed for a simple single-core job like this, since there are no additional MPI tasks , etc. being spawned. But it is a good habit.

Lines 47-51: Module loads
These lines ensure that the proper modules are loaded.

To begin with, we do a module purge to clear out any previously loaded modules. This prevents them from interfering with subsequent module loads. Then we load the default module for the cluster with module load hpcc/deepthought2; this line should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut for the Juggernaut cluster).

Finally, the line module load hello-umd/1.5 loads the correct version of the hello-umd application. Note that we specify the version; if that is omitted the module command will usually try to load the most recent version installed. We recommend that you always specify the specific version you want in your job scripts --- this makes your job more reproducible. Systems staff may add newer versions of existing packages without notification, and if you do not specify a version, the default version may change without your expecting it. In particular, a job that runs fine today using today's default version might crash unexpectedly when you try running it again in six months because the packages it uses were updated and your inputs are not compatible with the new version of the code.

Lines 57-59: Creating a working directory
These lines generate a scratch/working directory for your job. Because this is a single node job, we can use a local filesystem, in which case /tmp is a good choice.

/tmp is a directory on Unix systems where all users can write temporary files. On the compute nodes, /tmp will be cleaned after every job runs, so it is a tempory file system and we need to remember to copy any files we which to retain someplace where they will not be automatically deleted.

The TMPWORKDIR="/tmp/ood-job.${SLURM_JOBID}"> line defines an environmental variable containing the name of our chosen work directory. The ${SLURM_JOBID} references another environmental variable which is automatically set by Slurm (when the job starts to run) to the job number for this job --- using this in our work directory names helps ensure it will not conflict with any other job. The mkdir command creates this work directory, and the cd changes our working directory to that directory--- note in those last commands the use of the environmental variable we just created to hold the directory name.

Lines 62-69: Identifiying ourselves
These lines print some information about the job into the Slurm output file. It uses the environmental variables SLURM_JOBID, SLURM_NTASKS, SLURM_JOB_NUM_NODES, and SLURM_JOB_NODELIST which are set by Slurm at the start of the job to list the job number, the number of MPI tasks, the number of nodes, and the names of the nodes allocated to the job. It also prints the time and date that the job started (the date command), the working directory (the pwd command), and the list of loaded modules (the module list command). Although you are probably not interested in any of that information if the job runs as expected, they can often be helpful in diagnosing why things did not work as expected.

Line 73: Actually run the command
Finally! We actually run the command for this job script. In this case, we run the hello-umd command with the -t 0. As per the man page for hello-umd, this causes it to use as many threads as CPUs are available, which when run in a job like this will result in threads being used. We could have also done this using the argument -t $SLURM_CPUS_PER_TASK, where the environmental variable $SLURM_CPUS_PER_TASK is set by Slurm at the start of the job to be equal to the value we gave for --cpus-per-task.

In general, we recommend that you use such short cuts (either setting the value for -t to 0 or to the value of variable Slurm sets) to avoid inconsistencies in your scripts. E.g., if you explicitly gave -t 10 in this script, and later experiment with different number of threads, it is easy to forget to change a value in some place, resulting in a discrepency between the number of cores requested from Slurm and the number of threads being run. If you request more cores than the number of threads being used, you waste CPU resources. If you request fewer cores than threads being used, the code will likely still run but performance will be significantly degraded. So for best efficiency, we recommend avoiding having to specify any setting more than once wherever possible to avoid potential discrepencies.

Line 75: Store the error code
This line stores the shell error code from the previous command (which actually ran the code we are interested in). This is not needed if the code is that last line in your job script, but it is not in this case (we have to copy some files, etc). Slurm will look at the exit code of the last command run in the script file when trying to determine if the job succeeded or failed, and we do not want it to incorrectly report the job as succeeding if the application we wanted to run failed but a copy command in our clean-up section was successful.

The special shell variable $? stores the exit code from the last command. Normally it will be 0 if the command was successful, and non-zero otherwise. But it only works for the last command, so we save it in the variable ECODE.

Line 78: Copy files from temporary work directory
As stated previously, the /tmp directory will be erased after your job completes. So we need to copy any important files somewhere safe before the job ends. In this case, the only important file is hello.out, which we copy back to the directory from which the sbatch command was run (which is stored in the environmental variable SLURM_SUBMIT_DIR by Slurm when the job starts).
Lines 80-81: Say goodbye
These lines print some useful information at the end of the job. Basically they just say that the job finished, and prints the exit code we stored in ECODE, and then prints the date/time of completion using the date command
Line 84: Exit
This line exits the script, setting the exit code for the script to the exit code of our application that we saved in the environment variable ECODE. This means that the script will have the same exit code as the application, which will allow Slurm to better determine if the job was successful or not. (If we did not do this, the error code of the script will be the error code of the last command that ran, in this case the date command which should never fail. So even if your application aborted, the script would return a successful (0) error code, and Slurm would think the job succeeded if this line was omitted).
Line 85: Trailing blank line
We recommend that you get into the habit of leaving one or more blank lines at the end of your script. This is especially true if you write the scripts in Windows and then transfer to the cluster.

The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).

This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.

Running the example

The easiest way to run this example is with the Job Composer of the OnDemand portal, using the HelloUMD-Multithreaded template.

To submit from the command line, just

  1. Download the submit.sh script to the HPC login node.
  2. Run the command sbatch submit.sh. This will submit the job to the scheduler, and should return a message like Submitted batch job 23767 --- the number will vary (and is the job number for this job). The job number can be used to reference the job in Slurm, etc. (Please always give the job number(s) when requesting help about a job you submitted).

Whichever method you used for submission, the job will be queued for the debug partition and should run within 15 minutes or so. When it finishes running, the slurm-JOBNUMBER.out should contain the output from our diagnostic commands (time the job started, finished, module list, etc). The output of the hello-umd will be in the file hello.out in the directory from which you submitted the job. If you used OnDemand, these file will appear listed in the Folder contents section on the right.

The hello-umd file should look something like:

hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
Hello UMD from thread 7 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 0 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 9 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 2 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 3 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 4 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 8 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 6 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 1 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 5 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu

There should be two lines (one with version, one with compiler) identifying the hello-umd command, followed by 10 messages, one from each thread 0 to 9 of task 0. They should all have the same pid and hostname, although the pid and hostname for your job will likely differ from above.