This is a "quick start" introduction into using the HPC clusters at the University of Maryland. This covers the general activities most users will deal with when using the clusters.
MATLAB DCS Users: The system interaction for users of Matlab Distributed Computing Server is rather different from that of other users of the cluster, and so is not covered in this document. Please see Matlab DCS Quick Start.
- Logging into one of the login nodes
- Submitting a job
- Monitoring job status
- Monitoring your allocation
Prequisites for the Quick Start
This quick start assumes that you already
- have a TerpConnect/Glue account (REQUIRED for the Deepthought clusters, advisable for the others)
- have access to the cluster (i.e., have an allocation or have been granted access to someones allocation) how to use ssh
- have at least a basic familiarity with Unix
Logging into one of the login nodes
All of the clusters have at least 2 nodes available for users to log into. From these nodes you can submit and monitor your jobs, look at results of the jobs, etc.
DO NOT RUN computationally intensive processes on the login nodes!!!. These are in violation of policy, interfere with other users of the clusters, and will be killed without warning. Repeated offenses can lead to suspension of your privilege to use the clusters.
For most tasks you will wish to accomplish, you will start by logging into one of the login nodes for the appropriate cluster. These are:
- For the original Deepthought cluster,
- For the Deepthought2 cluster,
- For the Bluecrab cluster
- For Deepthought2, ssh to
From an unix like system you then would use commands like
#To ssh to a deepthought login node, from a Glue/Terpconnect #system or other system where your username is the same on #both systems ssh login.deepthought.umd.edu #To ssh to DT2 login node, assuming your username on the system your #are ssh-ing from does NOT match your DT2 username. Here we #are assuming
johnsmithis your DT2 username ssh email@example.com # or ssh -l johnsmith login.deepthought2.umd.edu #The same as the above, but to a bluecrab login node ssh "firstname.lastname@example.org"@login.marcc.jhu.edu# or ssh -l email@example.com login.marcc.jhu.edu #To connect to DT2 with a tunnelled X11 connection for graphics as well #If your username is the same on both systems ssh -X login.deepthought2.umd.edu #or if they differ ssh -X -l johnsmith login.deepthought2.umd.edu #or ssh -X firstname.lastname@example.org
Submitting a job
Next, you'll need to create a job script. This is just a simple shell script that will specify the necessary job parameters and then run your program.
Here's an example of a simple script, we'll call
#!/bin/tcsh #SBATCH -t 1:00 #SBATCH -n 4 #SBATCH --share module load python/2.7.8 hostname date
The first line, the shebang, specifies the shell to be used to run the script. Note that you must have a shebang specifying a valid shell in order for Slurm to accept and run your job; this differs from Moab/PBS/Torque which ignores the shebang and runs the job in your default shell unless you gave an option to qsub for a different shell.
The next three lines specify parameters to the scheduler.
-t, specifies the maximum amount of time
you expect your job to run. This parameter accepts the following
formats for the duration of the job:
MM::SSminutes and seconds
HH:MM:SShours, minutes, and seconds
DAYS-HHdays and hours
DAYS-HH:MMdays, hours and minutes
DAYS-HH:MM:SSdays, hours, minutes, and seconds
You should specify a reasonable estimate for this number, with some padding. If you specify too large of a wall time limit, it can negatively impact the queueing of this or other jobs of yours (see e.g. this FAQ and this FAQ). Too large of a wall time limit can also cause excessive consumption of your allocation's funds by misbehaving jobs. However, you wish to make sure you specify enough time for properly behaving jobs to complete, because once the wall time limit is hit, your job WILL be terminated.
If you fail to specify a walltime limit, it defaults to 15 minutes. Since this is insufficient for most HPC jobs, you should always specify a walltime limit.
In this example, we requested 1 minute of walltime. Although a short time, our code is quite trivial so 1 minute is more than sufficient.
The second line,
-n, tells the scheduler on how many tasks your job
will have, and by default Slurm assigns a distinct core for each
task. This method of specification doesn't care how those
cores are distributed across machines or about how those machines are
configured, and that is sufficient for many MPI jobs. But Slurm allows
for quite detailed specifications of CPU and node requirements, as
briefly described here and
in the examples page.
In this example, we are requesting 4 cores (which is way more than needed for this trivial example). We do not specify how Slurm should allocate them across nodes; most likely we will get all 4 cores on a single node, but that is NOT guaranteed. We could possibly get one core on each of 4 nodes, or some allocation of 4 cores on 2 or 3 nodes.
The third line
--shared is important from the perspectives of
billing and efficient use of the cluster. When scheduling jobs, you have
a choice of whether other jobs (either your jobs or from someone else)
can coexist on the same node or nodes. Although Slurm will not overcommit
resources on a node, not everyone specifies all the resources needed.
And even if both do specify the resources they need, jobs can still interfere
with each other if both are heavily using the disk or network. Or in the
most extreme example, if one job does something which causes the system to
crash, both jobs die.
On the other hand, if jobs cannot share nodes, the cluster will not be
as efficiently used. For example, if this sample job were not to allow
other jobs to share a node with it (i.e. requested to get
access to the nodes), and gets assigned to a node with 20 cores,
16 of those cores will be idle while this job is on the node, which is not very efficient
from the perspective of cluster utilization. Nor from the perspective of billing,
we charge jobs based on the number of cores consumed, not used,
so in this exclusive mode case the job would be charged for 20 cores for the
lifetime of the job even though it only requested 4 (since 20 cores are made
unavailable to other jobs).
In the actual sample script, we request
--share for shared access.
In this case if the job is assigned to 4 cores of a 20 core node, the other
16 cores are still available to other jobs, which should improve cluster
utilization. And the job is only charged for 4 cores for its lifetime.
By default, jobs requesting only a single core are run in
mode, and those requesting more than one core are run in
mode. But you can override this with the
It is advisable that large parallel jobs run in
since these tend to use most if not all the cores on a node anyway, and the potential
The remaining lines in the file are just standard commands, you will replace them with whatever your job requires. In this case once the job runs, it will print out the time and hostname to the output file. The script will be run in whatever shell is specified by the shebang on the first line of the script. NOTE: unlike with the Moab scheduler, you MUST provide a valid shebang on the first line.
To submit your job, we just use the
login-1:~: sbatch test.sh Submitted batch job 13222
The number that is returned to you is the identifier for the job, and you should use that anytime you want to find out more information about your job. For information on how to verify that your job is running, see the section Monitoring and Managing Your Jobs.
Once your job completes, unless you've specified otherwise, your
output and any errors that occur will be written to a file in the
same directory from which you submitted your job. The file will be
where the Ns are replaced by the job
Note that by default when you log in to one of the clusters, you are sitting in your home directory, and all output and submissions will be transferred to and from your home directory. For best performance, you should consider running your jobs from a space set aside for them. See Files, Storage, and Securing Your Data, the Specifying which directory to run the job in page, and the examples for more information.
Here's what you should see when your job completes:
l:~: cat slurm-13222.out compute-2-39.deepthought.umd.edu Wed May 21 18:38:06 EDT 2014
As you can see in the output files above, the script ran and printed the hostname and date as specified by the job script.