Submitting Jobs

Basic Job Submission
Your Job Script
Choosing a Queue
Specifying how long the job will run
Specifying node and core requirements
Specifying memory requirements
Requesting nodes with specific features
Requesting nodes with specific CPUs
Using Infiniband Nodes
Using GPUs
Specifying the amount/type of scratch space needed
Specifying the account to be charged
Specifying email options
Specifying output options
Specifying which shell to run the job in
Specifying the directory to run the job in
Specifying whether or not other jobs can be on the same node
Specifying a reservation

Basic Job Submission

The Zaratan HPC cluster uses a batch scheduling system called Slurm to handle the queuing, scheduling, and execution of jobs. This scheduler is used in many recent HPC clusters throughout the world. This page will attempt to discuss the Slurm commands for submitting jobs, and how to specify the job requirements. For users familiar with PBS/Torque or Maui/Torque or Moab/Torque based clusters, we have a document which translates commonly used commands from those scheduler systems into their Slurm equivalents..

Users generally submit jobs by writing a job script file and submitting the job to Slurm with the sbatch command. The sbatch command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster. It is also possible to submit an interactive job, but that is usually most useful for debugging purposes.

The options to sbatch can be given on the command line, or in most cases inside the job script. When given inside the job script, the option is placed alone on a line starting with #SBATCH (you must include a space after the SBATCH). These #SBATCH lines SHOULD come before any non-comment/non-blank line in the script --- any #SBATCH lines AFTER a non-comment/non-blank line in the script might get ignored by the scheduler. See the examples page for examples. The # at the start of these lines means they will be ignored by the shell; i.e. only the sbatch command will read them.

Your Job Script

The most basic parameter given to the sbatch command is the script to run. This obviously must be given on command line, not inside the script file. This job script must start with a shebang line which specifies the shell under which the script is to run. I.e., the very first line of your script should generally be either

#!/bin/tcsh

#!/bin/bash

for the tcsh or bash shell, respectively. This must be the first line, and no spaces before the name of the shell. This line is typically followed by a bunch of #SBATCH lines specifying the job requirements (these will be discussed below), and then the actual commands that you wish to have executed when the job is started on the compute nodes.

There are many options you can give to sbatch either on the command line or using #SBATCH lines within your script file. Other parts of this page discuss the more common ones. It is strongly recommended that you include at least the following directives to specify:

how long the job will run
the node and core requirements
the memory requirements
whether or not other jobs can be on the same node
the partition to run in

NOTE: The #SBATCH lines should come BEFORE any non-blank/non-comment lines in your script file. Any #SBATCH lines which come after non-blank/non-comment lines might get ignored by the scheduler.

If your login shell is tcsh (on the Zaratan cluster, your will be your login shell will be bash unless you explicitly changed it), and you are submitting a bash job script (i.e., the first line of your job script is #!/bin/bash), on the Zaratan cluster it is strongly recommended that the first command after the #SBATCH lines is

. ~/.bashrc

This will properly set up your environment on the cluster, including defining the module command. It is also recommended that after that line, you include module load commands to set up your environment for any software packages you wish to use.

The remainder of the file should be the commands to run to do the calculations you want. After you submit the job, the job will wait for a while in the queue until resources are available. Once resources are available, the scheduler will run this script on the first node assigned to your job. If your job involves multiple nodes (or even multiple cores on the same node), it is this script's responsibility to launch all the tasks for the job. When the script exits, the job is considered to have finished. More information can be found in the section on Running parallel codes and in the examples section.

Do NOT run jobs from your home directory; the home directories are not optimized for intensive I/O. You have a scratch directory at /scratch/zt1/PROJECT/USERNAME on Zaratn. Use that directory instead.

Note: If your script does not end with a proper Unix end-of-line (EOL) character, the last line of your script will usually be ignored by the shell when it is run. This can often happen if you transfer files from Windows (which uses different EOL characters) to Unix, and can sometimes be quite confusing, as you submit you job, it runs, and finishes almost immediately without errors, and there is seemingly no output because the last line of your job script, which got ignored, is the command to actually do the calculation. Although you can use a command like dos2unix to fix the script file, it is usually easiest to just remember to add a couple blank lines to the end of your file. This doesn't actually fix the problem, as the script does not end with a proper Unix EOL, but the last line, which gets ignored, is now blank, and you won't care that it got ignored.

Sometimes your job script might need specific information about job parameters specified to sbatch. To facilitate this, the Slurm scheduler will make a number of these values available in environmental variables. These are detailed in the man page for sbatch, and also in this page listing the more commonly used variables. These can be useful in many cases, e.g. if you are running a program on a single node which needs to be passed an argument with the number of threads to run, you probably want to give it $SLURM_NTASKS for this value to ensure that the number you pass it always equals the number of cores requested from Slurm. This can help avoid issues if you change the number of cores requested in a later run, as you only change things in one place now.

Choosing a Queue/Partition

On the Zaratan cluster, you have a choice of several partitions in which to run your job. Partitions are used to separate resources or job types. For example, a partition for jobs that need GPUs, or one for debug jobs. By default, if you do not specify a partition, your job will be placed in the standard partition.

For quick test jobs you may want to use the debug partition. This partition is intended for quick turn around for small debugging jobs, but has a 15 minute maximum walltime so is not suitable for production.

A complete list of partitions and their function can be found here.

To specify a partition, you just give the sbatch command the --partition=PART argument, or equivalently the -p PART argument, replacing PART with the name of the partition. E.g., to submit a job to the debug partition, you could add to your sbatch command the arguments -p debug. Similarly, to submit a job to the scavenger partition, you could add -p scavenger to the sbatch. In either case, you can either append the partition flag to the end of the command line, or add it near the top of your job script with a #SBATCH prefix, e.g. for the debug partition

#SBATCH -p debug

Specifying the Amount of Time Your Job Will Run

When submitting a job, it is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. So you should always add a buffer to account for variability in run times; you do not want your job to be killed when it is 99.9% complete. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job. See the section on job scheduling for more information on the scheduling process and advice regarding the setting of walltime limits. See the section on Quality of Service levels for more information on the walltime limits on the Zaratan cluster.

In general, on the Zaratan cluster, all users can run submit jobs to the standard partition, which currently allows for jobs of up to one week.

To specify your estimated runtime, use the --time=TIME or -t TIME parameter to sbatch. This value TIME can in any of the following formats:

M (M minutes)
M:S (M minutes, S seconds)
H:M:S (H hours, M minutes, S seconds)
D-H (D days, H hours)
D-H:M (D days, H hours, M minutes)
D-H:M:S (D days, H hours, M minutes, S seconds)

NOTE: If you do not specify a walltime, the default walltime on the Zaratan HPC cluster is 15 minutes. I.e., your job will be killed after 15 minutes. Since that is not likely to be sufficient for it to complete, specify a reasonable walltime. This greatly aids the scheduler in making the best utilization of resources.

The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.

#SBATCH -n 1
#SBATCH -t 0:60

hostname

Specifying Node and Core Requirements

Slurm provides many options for specifying your node and core requirements, and we only cover the basics here. More details can be found at the official Slurm site. Also see the man pages for sbatch (i.e. man sbatch).

From your job's perspective, of most concern is the number of CPU cores and how they are distributed over the nodes Jobs generally use a combination of MPI tasks and/or multithreading for their parallelism. We let N represent the number of MPI tasks, and M represent the number of threads needed by the job. Most jobs then fall into one of these categories:

sequential jobs: these jobs will run on a single CPU core (and therefore a single node). In this case N = M = 1
shared memory parallel jobs : These jobs use some sort of multithreading (e.g. OpenMP ), and so require M CPU cores on a single node. Here M depends on the code and/or the problem being solved, and N=1.
Pure MPI jobs : jobs require a certain number of CPU cores (one for each MPI task ) but they can be spread out over multiple nodes (and the job generally does not care how they are spread over the nodes). In this case, M=1 and N depends on the code/problem being solved.
Hybrid MPI jobs : these jobs use MPI, but each MPI task uses multithreading . For these jobs, if N is the number of MPI tasks, and M is the number of threads for each task, you want N x M CPU cores, but each set of M cores must reside on the same node. Here both N and M depend on the code and the problem, and will be greater than 1.

The sbatch and related commands provide three options for controlling this behavior:

-n N or --ntasks=N: This sets the number of MPI tasks for the job, which should be one for the sequential and pure multithreaded cases, and for the MPI cases it should be set to the number of MPI tasks desired.
-c M or --cpus-per-task=M: This sets the number of CPU cores to use for each task. All the CPU cores for a given task will be allocated on the same node (although cores from more than one tasks might also be allocated on the same node, as long as all the cores from both tasks fit on the same node). This defaults to one. For sequential jobs and pure MPI jobs, this should be set to one. For pure multithreaded jobs, this should be set to the number of threads desired. Similarly, for hybrid MPI jobs, this should be set to the number of threads desired per MPI task.
-N MINNODES[-MAXNODES] or --nodes=MINNODES[-MAXNODES]: This specifies the number of nodes to use. This flag is generally not needed, and we recommend that you do not use this option and instead use the --ntasks and --cpus-per-task options, and let the schedule work out how many nodes are needed. Using it properly requires a good knowledge of the hardware available on the cluster, and using it improperly can result in your job wasting resources, being overcharged, and/or spending excessive time waiting in the queue. If you use it, you can give a range on the number of nodes to use. If MAXNODES is omitted, it will default to the value of MINNODES. But in general, it is best to omit this entirely and just give --ntasks and --cpus-per-task, and the scheduler will allocate just enough nodes to satisfy those specifications.

So for a sequential job you could use the arguments

#SBATCH -n 1
#SBATCH -c 1

We also provide a sequential examplewith a complete job submission script and a line-by-line explanation.

Similarly, for multithreaded case, e.g. where you require 12 cores on a single node, you could use the arguments

#SBATCH -n 1
#SBATCH -c 12

If you know you'll need 12 cores, but don't care how they're distributed, try the following:

#SBATCH --ntasks=12

myjob

The above might allocate 12 cores on a single node for your job or distribute the tasks over several nodes.

If you are concerned about how your cores are allocated, you can also give the --nodes=NUMNODESDESC or -N NUMNODESDESC. NUMNODESDESC can be of the form MINNODES or MINNODES-MAXNODES. In the former case, MAXNODES is set to the same as MINNODES. The scheduler will attempt to allocate between MINNODES and MAXNODES (inclusive) nodes to your job. So for the above example (--ntasks=12), we might have

all cores assigned on the same node if -N 1 is given.
the cores split among two nodes if -N 2 is given. You might get an even split, 6 cores each node, or an assymetric split, e.g. 4 on one node and 8 on the other. But you will get two distinct nodes.
either of the two above cases if -N 1-2 is given.

In general, for distributed memory (e.g. MPI) jobs, we recommend that most users just specify the --ntasks or -n parameter and let Slurm figure out how to best divide the cores among the nodes unless you have specific requirements. Of course, for shared memory (e.g. OpenMP or multithreaded) jobs, you need to give --nodes=1 to ensure that all of the cores assigned to you are on the same node.

Slurm's sbatch command has a large number of other options allowing you to specify node and CPU requirements for a wide variety of cases; the above is just the basics. More detail can be found reading the man page, e.g. man sbatch

Specifying Memory Requirements

If you want to request a specific amount of memory for your job, try something like the following:

#!/bin/sh
#SBATCH -N 2
#SBATCH --mem=1024

myjob

This example requests a two nodes with at least 1 GB (1024 MB) of memory total each. Note that the --mem parameter specifies the memory on a per node basis.

If you want to request a specific amount of memory on a per-core basis, use the following:

#!/bin/sh
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=1024

myjob

This requests 8 cores, with at least 1 GB (1024 MB) per core.

NOTE: for both --mem and --mem-per-cpu, the specified memory size must be in MB.

You should also note that node selection does not count memory used by the operating system, etc. So a node which nominally has 8 GB of RAM might only show 7995 MB available. So if your job specified a requirement of 8192 MB, it would not be able to use that node. So a bit of care should be used in choosing the memory requirements; going a little bit under multiples of GBs may be advisible.

Requesting Nodes with Specific features or resources

Sometimes your job requires nodes with specific features or resources. I.e., some jobs will make use of GPUs for processing, or some jobs may require specific processor models. Such requirements need to be told to the scheduler to ensure you get assigned appropriate nodes.

In slurm, we break this situations into two cases:

features: This refers to something which can be present or not on a system, and if it is present, it is available to all processes on the system. (Obviously, if it is not present, it is not available to any processes on the system). A simply boolean present or not present. E.g., the presence of an infiniband adapter, or whether the processors on the system support the SSSE3 instruction set.
resource: This refers to something which not only is present (or not), but has an amount attached to it. Unlike features, resources have a quantity, both in terms of what is present on the node, but also in terms of what is being consumed by jobs running on the node. I.e, a system can have 0, 1, or 2 GPUs. In addition, a job running on a 2 GPU system might consume 0, 1, or 2 of the GPUs.

You can see which nodes support which features and resources with sinfo. By default, this information is not shown. Features can be shown by using the sinfo --Node --long options; resources require additional fields be specified in the --format. To see both, once can use: (a line is printed for each node/partition combination, so we give "-p scavenger" to only see the nodes in the scavenger partition; without that most nodes would appear in triplicate because they belong to the scavenger, standard, and high-priority partitions):

login-1> sinfo -N -p scavenger --format="%N %5T %.4c %.8z %.8m %.8d %25f %10G"
NODELIST      STATE CPUS    S:C:T   MEMORY TMP_DISK AVAIL_FEATURES            GRES
compute-a20-0 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-a20-1 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-a20-2 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-a20-3 idle    20   2:10:1   128000   750000 rhel8,intel,xeon_e5-2680v (null)
compute-a20-4 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
...
compute-a20-30 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-a20-31 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-a20-33 idle    16    2:8:1    64000   340000 rhel8,intel,xeon_e5-2670  (null)
compute-b17-0  mixed   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v gpu:2
compute-b17-1  mixed   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v gpu:2
...
compute-b18-14 mixed   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v gpu:2
compute-b18-15 mixed   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v gpu:2
compute-b18-16 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
compute-b18-17 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
...
compute-b23-0 idle    40   4:10:1  1020000   750000 rhel6,intel,xeon_e5-4640v (null)
compute-b23-1 mixed   40   4:10:1  1020000   750000 rhel6,intel,xeon_e5-4640v (null)
compute-b23-2 idle    40   4:10:1  1020000   750000 rhel8,intel,xeon_e5-4640v (null)
compute-b23-3 mixed   40   4:10:1  1020000   750000 rhel6,intel,xeon_e5-4640v (null)
compute-b23-4 mixed   40   4:10:1  1020000   750000 rhel6,intel,xeon_e5-4640v (null)
compute-b24-0 alloc   20   2:10:1   128000   750000 rhel6,intel,xeon_e5-2680v (null)
...

To request a specific feature, use the --constraint option to sbatch. In its simplest form, you just give --constraint=TAG, where TAG is the name of the feature you are requesting. E.g., to request a node running Red Hat Enterprise Linux 8, (the rhel8 feature) you would use something like:

#!/bin/tcsh
#SBATCH -t 15:00
#SBATCH --ntasks=8
#SBATCH --constraint="rhel8"

#It is recommended that you add the exact version of the
#compiler and MPI library used when you compiled the code
#to improve long-term reproducibility
module load gcc
module load openmpi
mpirun mycode

The --constraint option can get rather more complicated, as Slurm allows multiple constraints to be given, with the constraints either ANDed or ORed. You can request that only a subset (e.g. 2 out of 4) nodes need to have the constraint, or that either of two features are acceptable, but all nodes assigned must have the same feature. If you need that level of complexity, please see the man page for sbatch (man sbatch).

Resources are requested with the --gres option to sbatch. The usage is --gres=RESOURCE_LIST, where RESOURCE_LIST is a comma delimitted list of resource names, optionally followed by a colon and a count. The resources specified are required on each node assigned to the job. E.g., to request 3 nodes and 2 GPUs on each node (for 6 GPUs total) on the Deepthought2 cluster, one would use something like:

#!/bin/tcsh
#SBATCH -t 15:00
#SBATCH -N 3
#SBATCH -p gpu
#SBATCH --gres=gpu:2

cd /lustre/payerle
./run_my_gpu_code

To get a list of available resources as defined in a cluster, you can use the command sbatch --gres=help temp.sh. NOTE: that temp.sh must be an existing submit script; the basic validation of the script occurs before the --gres=help gets evaluated. When --gres=help is given, the script will not be submitted.

On the Division of IT maintained clusters, the following resources are available:

Resources available on UMD DIT maintained clusters
Resource	Clusters	Description	Comments
gpu:h100	Zaratan	Node has NVIDIA Hopper H100 GPUs (80GB)	This will allocate one or more full H100s
gpu:a100	Zaratan	Node has NVIDIA Ampere A100 GPUs (40GB)	This will allocate one or more full A100s
gpu:a100_1g.5gb	Zaratan	One-seventh Nvidia A100 (5GB)	This will allocate one-seventh of an A100, for those that don't need a full GPU
gpu:p100	Juggernaut	Node has NVIDIA Pascal P100 GPUs	1 node (2 GPUs per node)
gpu:v100	Juggernaut	Node has NVIDIA Volta V100 GPUs	1 node (4 GPUs per node)

NOTE: We no longer allow users to simply specify "gpu"; you must name a specific GPU type, e.g. "gpu:a100"

Similarly, the following features are available on the UMD DIT maintained clusters:

Features available on UMD DIT maintained clusters
Feature	Clusters	Description	Comments
rhel7	Juggernaut	Node is running RHEL7
rhel8	Zaratan Juggernaut	Node is running RHEL8
amd	Zaratan Juggernaut	Node has AMD CPUs
intel	Juggernaut	Node has Intel CPUs
epyc_7763	Zaratan	Node has AMD Epyc 7763 CPUs
epyc_7702	Juggernaut	Node has AMD Epyc 7702 CPUs	Zen2 architecture in "green" partition
xeon_6142	Juggernaut	Node has Intel Xeon 6142 CPUs	Skylake architecture, very limited number (p100 GPU node)
xeon_6148	Juggernaut	Node has Intel Xeon 6148 CPUs	Skylake architecture, most Intel CPUs on "green" partition
xeon_6248	Juggernaut	Node has Intel Xeon 6248 CPUs	Casecadelake architecture, very limited number
xeon_e5_2680v4	Juggernaut	Node has Intel Xeon E5-2680v4 CPUs	Broadwell architecture, most compute nodes in "blue" partition

In the above tables, the clusters column indicates which clustes the feature/resource is available on.

Requesting nodes with specific CPUs

As our clusters are growing, the clusters are becoming less homogeneous. While that might not matter for some workloads, for others you might need a specific CPU architecture.

We have provided features on all of our nodes to help facilitate the specification of the desired CPU architecture. This can be done at various levels, ranging from the most general (whether to use AMD or Intel based cpus with the features "amd" or "intel") to very specific (selecting a specific CPU model).

You can instruct the sbatch command to use a particular subset of nodes matching the desired CPU architectures by adding a --constraint flag, as discussed in the section on 'features'. A listing of the various CPUs you can select from can be found in the table listing the allowed features there.

Using Infiniband Nodes

All nodes in the standard partition of the Zaratan cluster have HDR-100 (100 Gb/s) Infiniband , as do all the compute nodes on Juggernaut. The GPU nodes on Zaratan are connected with HDR (200 Gb/s) Infiniband. So in general, you do not need to request Infiniband nodes, as all the nodes on the default standard partion have it.

On Zaratan, there is also a serial partition. The nodes in this partition do not have infiniband, only 100 Gbps ethernet. But they do have a large amount of RAM (1 TB per node, or 16 GB per core ), and these nodes also have over 10 TB of NVME flash local disk space mounted on /tmp. These nodes are intended for jobs which need to do a lot of disk I/O, as this can be done to and from the very fast NVME storage. These jobs should either fit on a single node or do only do minimal internode communication, or at least not require low latency in such communication.

Using GPUs

Although originally designed to driving high end graphics displays, it turns out the graphical processing units (GPUs) are very good at number crunching, at least for certain types of problems.

The Zaratan cluster has 20 nodes each with 4 Nvidia Ampere A100 (40GB) GPU cards. These are powerful GPUs, and while some jobs require such power, and some may even require multiple such GPUs, many tasks do not require the full power of an A100. However, it is not possible for multiple jobs to share a single GPU, so these smaller tasks still consume a full GPU even though they cannot fully utilize it. The A100 GPUs on Zaratan support multi-instance GPU and the A100 GPUs on several of these nodes have each been split into seven virtual GPUs. Different jobs can run on each of these virtual GPUs at the same time, and therefore this effectively increases the number of GPUs available for jobs, albeit less powerful GPUs, and thereby hopefully increasing the throughput for GPU jobs.

Although there are a lot of cores present in the GPU, they are not compatible with the standard Intel x86 architecture, and codes need to be written especially for these cards, using the CUDA platform . Some applications support for CUDA already, although even in those cases you need to use versions that were built to support CUDA.

See the section on CUDA for more information on using and compiling CUDA and OpenCL programs. See the section on software supporting GPUs for more information on currently installed software which supports GPU processing.

To request GPUs for your job on the Zaratan cluster, you must pass the proper parameters to the sbatch command to tell the scheduler that you want to use GPUs, and how many and what types of GPUs to use. The sbatch command has added a number of arguments for dealing with GPUs, many of which are beyond the scope of this document. The main parameter you must consider is the --gpus flag, this should be provided the type and number of GPUs being requested, in a format like name:number. The number specifies the number of GPUs of the particular type being requested. The name: part is optional; if omitted, the flag only specifies the number of GPUs to allocate, without respect to the GPU model (i.e., you will be allocated any GPUs). E.g.

#SBATCH --gpus=a100:2

to request two a100 GPUs or

#SBATCH --gpus=a100_1g.5gb:1

to request a single small (1/7 of an A100) multi-instance virtual GPU. The full list of GPUs available on Zaratan are:

GPU types and SU costs on Zaratan
Slurm/sbatch label	Description	GPU memory	Hourly SU cost per GPU	Total Number of GPUs on Zaratan
h100	A full physical H100	80 GB	144 SU/hour/GPU	32
a100	A full physical A100	40 GB	48 SU/hour/GPU	76*
a100_1g.5gb	1/7 of an A100	5 GB	7 SU/hour/GPU	28*

Note: *: The number of physical A100 GPUs that are split into smaller virtual GPUs, and potentially the sizes of these smaller virtual GPUs, is subject to fluctation without advanced notice as we gauge how to best distribute the resources to meet user demand (and as the demand changes). The numbers listed are accurate as of the time this was written, and since there are currently 80 physical A100 GPUs on Zaratan, the total of the number of a100 GPUs plus 1/7 of the number a100_1g.5gb virtual GPUs will equal 80.

On Zaratan, we have split the GPU nodes into a separate partition to avoid GPU nodes being consumed by non-GPU jobs. So when submitting jobs requiring GPUs, you also need to specify the GPU partition. You can do this with the -p gpu option to sbatch.

For example:

#!/bin/bash
#SBATCH -t 15:00
#SBATCH -n 8
#SBATCH --gpus=a100:4
#SBATCH --partition=gpu

cd ~/scratch
./run_my_gpu_code

The example above requests 8 CPU cores and 4 physical a100 GPUs for 15 minutes.

GPUs are expensive, and will use up more of your allocation accordingly. At present, a single GPU hour costs about 48 times as much as a single CPU core hour. Consequently, for each GPU hour you use, you will be consuming 48 SUs . You may want to consider the use of a fractional GPU if you do not need the power of a full GPU. At present this functionality is experimental, and Zaratan provides 28 fractional GPUs where four of the A100s have been divided into sevenths. (Why sevenths? Ask Nvidia!) To access these GPUs, you can request --gpus:a100_1g.5gb:1. Note that the fractional GPUs cannot talk to each other, so if you're experimenting with multi-GPU codes you'll need to use the full A100s for that. These fractional GPUs are still more expensive than a CPU, but less expensive than a full physical GPU; in the case of an a100_1g.5gb GPU, it costs 7 times a single CPU core hour.

Currently, the GPU nodes on Zaratan have 4 GPUs each. It is possible for four single GPU jobs to run on the same node in "shared" mode. Slurm will set the environmental variable CUDA_VISIBLE_DEVICES to the GPU(s) which it allocated to your job, e.g. to 0 if it assigned you only the first GPU, 1 if it assigned only the second, or 0,1 if it assigned both. By default, CUDA will use this variable and will only use the specified GPU(s). So two single GPU CUDA jobs should be able to coexist on the same node without interfering with each other. (However, problems might occur if one of the jobs is not CUDA based, or if the job does stuff it should not be doing.)

Specifying the Amount/Type of Temporary Space Needed

All of the compute nodes have a file system mounted at /tmp which you can use to read and write to while your job is running. This is temporary space, and your files will be deleted once your job completes. If your job requires more than a small amount (10GB) of local temporary space, it would be a good idea to specify how much you need when you submit the job so that the scheduler can assign appropriate nodes to you.

All nodes on Zaratan currently have at least 1 TB of temporary space, based on solid state disk technology . The nodes in the serial partition on Zaratan have at least 10 TB of NVMe based SSD temporary disk space, for extra performance. On Juggernaut, the Intel based nodes have at least 700GB of space on /tmp, based on traditional spinning hard drives, while the AMD based nodes have at least 300 GB of SSD based temporary space.

The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.

#!/bin/sh
#SBATCH --ntasks=8
#SBATCH --tmp=5120

myjob

Note that the disk space size must be given in MB.

Specifying the account to be charged

All users of the cluster belong to at least one project associated with the cluster, and each project has at least one account its users can charge against.

Jobs charged to normal accounts take precedence over low-priority (e.g. scavenger queue) jobs. No job will preempted by another job (i.e., kicked off a node once it starts execution) regardless of priority, with the exception of jobs in the scavenger queue, which will be preempted by any job with a higher priority.

To submit jobs to an account other than your default account, use the -A option to sbatch.

login-1:~: sbatch -A test test.sh
Submitted batch job 4194

If no account is explicitly specified, your job will be charged against your default account. You can view and/or change your default account with the sacctmgr list user command. We recommend adding the -p flag when using that command, as otherwise the default allocation accounts will tend to be truncated. The following example shows how the user payerle would change his default allocation account from test to tptest using the sacctmgr command; you should change the user and allocation account names appropriately.

login-1:~: sacctmgr list user payerle -p
User|Def Acct|Admin|
payerle|test|None|

login-1:~/slurm-tests: sacctmgr modify user payerle set DefaultAccount=tptest
 Modified users...
  payerle
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
login-1:~/slurm-tests: 
login-1:~/slurm-tests: sacctmgr list user payerle -p
User|Def Acct|Admin|
payerle|tptest|None|

If you belong to multiple projects, you should charge your jobs against an account for the appropriate project (i.e. if your thesis advisor is Prof. Smith, and you are doing work for Prof. Jones, thesis work should be charged against one of Prof. Smith's accounts, and your work for Prof. Jones against one of his accounts).

The above recommendations assume that there are sufficient funds available in your account. If there do not appear to be sufficient funds to complete the job (and all currently running jobs that are being charged against that allocation), then the job will not start.

For more information on accounts, including monitoring usage of your account, see the section Allocations and Account Management.

Email Options

The scheduler can email you when certain events related to your job occur, e.g. on start of execution, or when it completes. By default, any such mail is sent to your @umd.edu email address, but you can specify otherwise with the --mail-user=EMAILADDR flag to sbatch.

You can control when mail is sent with the --mail-type=TYPE option. Valid options are:

BEGIN: when the job starts to execute
END: when the job completes
FAIL: if and when the job fails
REQUEUE: if and when the job is requeued.
ALL: for all of the above cases.

You can give multiple --mail-type=TYPE options to have mail sent for multiple conditions. The following job script will send mail to hpc-wizard@hpcc.umd.edu when the job starts and when the job finishes:

#!/bin/tcsh
#SBATCH --ntasks=24
#SBATCH --mail-user=hpc-wizard@hpcc.umd.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END

start-long-hpc-job

NOTE: It is recommended that you use care with these options, especially if you are submitting a large number of jobs. Not only will you get a large amount of email, but it can cause issues with some email systems (e.g. GMail imposes limits on the number of emails you can receive in a given time period).

Specifying output options

By default, slurm will direct both the stdout and stderr streams for your job to a file named slurm-JOBNUMBER.out in the directory where you submitted the sbatch command. For job arrays, the file will be slurm-JOBNUMBER_ARRAYINDEX.out. In both cases, JOBNUMBER is the number for the job.

You can override this with the --output=FILESPEC (or -o FILESPEC, for short) option. FILESPEC is the name of the file to write to, but the following replacement symbols are supported:

%A: The master job allocation number for job arrays
%a: The job array index number, only meaningful for job arrays.
%j: The job allocation number.
%N: The name of the first node in the job.
%u: Your username

Multiple replacements symbols are allowed in the same FILESPEC. I.e., the default values are slurm-%j.out and slurm-%A_%a.out for simple and array jobs, respectively.

You can also use --error=FILESPEC (or -e FILESPEC) to have the stderr sent to a different file from stdout. The same replacement symbols are allowed here.

If you use the --output or --error options, be sure to specify the full path of your output file. Otherwise the output might be lost.

Specifying the shell to run in

Under Slurm, your job will be executed in whatever shell is specified by the shebang in the script file specifies.

Thus, the following job script will be processed via the C-shell:

#!/bin/csh
#SLURM --ntasks=16
#SLURM -t 00:15

setenv MYDIR /tmp/$USER
...

and the following job script will be processed with the Bourne again shell:

#!/bin/bash
#SLURM --ntasks=16
#SLURM -t 00:15

. ~/.bashrc

MYDIR="/tmp/$USER"
export MYDIR
...

NOTE: If your login shell is csh or tcsh, and you submit a job with a bourne style shell (e.g. sh or bash), your .bashrc will not be read automatically. Therefore you need to include a . ~/.bashrc near the top of your code to get full functionality. This is needed if you wish to load modules, etc. For the Juggernaut cluster, you should similarly include a . ~/.bash_profile command in the corresponding situation.

NOTE: If you wish to use a bourne style shell, we strongly recommend #!/bin/bash instead of #!/bin/sh. Under linux, both run the same bash executable, but in the latter case certain non-backwards compatible features are disabled which can causes problems.

Running Your Job in a Different Directory

The working directory in which your job runs will be the directory from which you ran the sbatch command, unless you specify otherwise. The easiest way to change this behavior is to add the appropriate cd command before any other commands in your job script.

Also note that if you are using MPI, you may also need to add either the -wd or -wdir option for mpirun to specify the working directory.

The following example switches the working directory to /scratch/zt1/project/bee-research/johndoe/my_program

#!/bin/csh
#SLURM -t 01:00
#SLURM --ntasks=24

module load openmpi

cd /scratch/zt1/project/bee-research/johndoe/my_program

mpirun -wd /scratch/zt1/project/bee-research/johndoe/my_program C alltoall

There is also a --workingdir=DIR option that you can give to sbatch (or add a #SBATCH --workingdir=DIR line in your job script), but use of that method is not recommended. It should work for the scratch file systems, but does not work well with any NFS file systems (since these get automounted, using symlinks, and sbatch appears to expand all symlinks, which breaks the automount mechanism).

Specifying whether or not other jobs can be on the same node

The sbatch command has the (mutually-exclusive) flags --exclusive and --oversubscribe (formerly called --share) which control whether the scheduler should allow multiple jobs to co-exist on the same nodes. This is only an issue when the jobs individually do not consume all of the resources on the node; e.g. consider a node with 8 cores and 8 GB of RAM. If one job requests 2 cores and 4 GB, and a second job requests 4 cores and 3 GB, they should both be able to fit comfortably on that node at the same time. If both jobs have the shared flag set, then the scheduler is free to place them on the same node at the same time. If either has the exclusive flag set, however, then the scheduler should not put them on the same node; the job(s) with the exclusive flag set will be given their own node.

If you are running jobs which contain sensitive information, you should ALWAYS submit jobs in exclusive mode to reduce exposure to security threats. By not allowing other jobs (i.e. other users) access to the nodes where your jobs are running, you reduce the exposure to security threats.

The --oversubscribe flag was formerly named --share. The two flags do the same thing, however on the Zaratan and Juggernaut clusters only the --oversubscribe form is now accepted.

There can be a couple of problems with running jobs in shared mode. First, if your job is processing sensitive information, allowing other jobs (potentially owned by other users) to run on the same node(s) as your job increases your exposure to potential security threats/exploits. It is strongly recommended that jobs processing sensitive information always run in exclusive mode.

Also, there is the possibility of potential interference between the jobs. First off, to optimize performance, we are not perfectly enforcing the core and memory usage of jobs, and it is possible for a job to "escape" its bounds. But even assuming the jobs keep within their requested CPU and memory limits, they still would be sharing IO bandwidth, particularly disk and network, and depending on the jobs this might cause significant performance degradation. On the other hand, it is wasteful to give a smaller job a node all to itself if it will not use all the resources on the node.

From the perspective of you, the user, this potential for interference between jobs means that your job might suffer from slower performance, or even worse, crash (or the node it is running on crash). While that might make submitting jobs in exclusive mode seem like the easy answer, that could significantly impact utilization of the cluster. In other words, if you submit a job in exclusive mode, we will have to charge you for all the cores on the node, not just the ones you asked to use, for the lifetime of your job (since no one else can use those cores). Thus, the funds in your allocation will be depleted faster.

The default behavior on the Zaratan cluster is that all jobs will run in oversubscribe mode. If you want your jobs to run in exclusive mode, you will need to ask for it explicitly.

Specifying a reservation

On rare occasions, a reservation might be set up for a certain allocation account. This means that certain CPU cores/nodes have been reserved for a specific period of time for that allocation. This is not done often, and when it is done it is typically reserving some nodes for a class, during class hours only, so that students can launch jobs and get the results back while the class is still in session (and the instructor is still available to assist them with issues). Again, this is only done rarely, and you should have been informed (e.g. by your instructor) if that is the case. Most users do not have access to reservations and can safely ignore this section.

If you do have access to a reservation which is active, you can submit jobs which can use the reserved resources by adding the following flag to your sbatch command: --reservation=RESERVATION where RESERVATION is the name of the reservation (which should have been provided to you, e.g. by your instructor). If you were not informed of a reservation name, your allocations probably do not have reservations and this section does not apply to you. The --reservation=RESERVATION flag can either be given as an explicit argument on the sbatch command line, or as a #SBATCH --reservation=RESERVATION line in your job script.

NOTE: to effectively use a reservation, the following conditions must hold:

You must be charging the job to an allocation account that has access to the reservation. For class reservations, this typically means that you must be submitting the job from your class temporary login account, and charging it to the class allocation account.
You must specify that the job should use the reservation, i.e. use the --reservation flag described above.
The reservation must be "active". Class reservations are typically only active during the hours the class meets, and often only on specific days that the class is meeting. If you submit a job specifying a reservation when the reservation is not active, instead of expediting things it will likely delay the job until the reservation becomes active.