Glossary of High Performance Computing Terminology
High Performance and related computing paradigms has its own set
of terminology which might be confusing to those new to these fields.
This page tries to define the terminology as used throughout this web site
in simple language.
Terms are listed in alphabetical order, with the following links
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
A
- (hardware) accelerator
- While traditionally in computing the computation takes place on a
general purpose CPU, many workloads can obtain great
improvements in performance when run on specialized processors like
GPUs,
field-programmable gate arrays
and other technologies. The general term for these specialized processors
is
accelerators
or hardware accelerators
.
These accelerators typically must be placed in a node
containing a generalized CPU, and programs must be
specifically coded to take advantage of these technologies. These accelerators
can often be expensive, with nodes containing the accelerators generally
costing several times the cost of a comparable node without the accelerators,
but the performance gains for suitable workloads can be very significant.
Because of this, typically only a small subset of the nodes
have accelerators, and you will need to
instruct the scheduler to use those
nodes.
- Account
- The term "account" can have two disparate usages. The first usage,
sometimes distinguished with the terms "user" or "login", refers to the
user credentials that you use to log into the cluster, and other attributes,
etc. connected with your user identity on the cluster. This sense of the
word "account" is similar to the login/user accounts on other computer
systems.
The second usage refers to allocation accounts,
sometimes simply referred to as allocaitons.
- Allocation
-
Allocations are the means by which HPC resources are made available
to users. Every allocation consists of a set of resources to which HPC members
have been granted access; this consists not only of the type of resource, but
also the amount of that resource that can be used. E.g., allocations will
typically include access to compute resources; this will include the
specification of the cluster access is granted to, as well as the number of
SUs over a time period (and what time period) the members
of the allocation have access to. Allocations typically will give access to
storage resources as well; this will include the specification of which storage
resources (e.g. high-performance or
SHELL storage, and the limits on the amount of data that
can be stored there.
Every allocation is contained within a single project
which represents the research group the allocation is for. Every member
of an allocation must also belong to the containing project. So
every user of the cluster must be a member of a project
and a member of an allocation in that project which provides resources on
that cluster.
- Auristor file system
- The Auristor file system is the
distributed
file system used to back the SHELL storage tier on
the Zaratan
cluster. It is an enhanced version of the
AFS
file system long used on the campus Glue/TerpConnect Unix clusters. It uses
Kerberos based
authentication, and provides a globally distributed file system.
See also:
B
- Backfilling of jobs
- Backfilling is a technique used by the
scheduler to increase the utilization of the cluster. When enabled, it
allows the scheduler to run smaller, shorter jobs from
further down the queue when the job at the top of the
queue is waiting for resources, provided that the smaller, shorter jobs are
expected to complete before the additional resources needed by the larger
job are expected to become available.
For example, consider a small cluster with ten nodes (node #1 to node #10),
and five of the nodes (nodes #1 to #5) are currently in use by jobs which
have another 12 hours of walltime remaining,
and the remaining five nodes (#6 to #10) are idle. At the top of the
queue is a job (job #1) that needs 8 nodes --- since only five nodes are idle,
it is currently pending and the scheduler estimates that it will remain
pending for another 12 hours (when some of
the jobs currently running finish and at least three of nodes #1 to #5 become
idle). Next in the queue are
a bunch of small jobs (jobs #2-6) that need only node for 2 hours.
Without backfilling, the scheduler would hold nodes #6 through #10 for
job #1, and wait until at least three of the
nodes #1 through #5 become idle (which will happen when the jobs running on
them finish), and then allocate nodes to and start job #1.
Since it is expected take about 12 hours for the jobs on nodes #1 through #5
to finish, nodes #6 through #10 will remain idle for about 12 hours.
With backfill, the scheduler sees that jobs #2 through #6 are small, and that
if it started them on the nodes #6 through #10, they could all run and
finish within 2 hours (since the scheduler will kill them if they run past
2 hours), and so running those jobs should not delay the start of job #1.
This allows for more efficient use of resources.
Of course, in reality things are not quite so black and white. The scheduler
knows the walltimes requested by jobs, but a job can finish before its walltime
is up, and indeed they usually will (since the scheduler will terminate jobs
if they exceed their requested walltimes, users are encouraged to provide a
bit of padding in their walltime estimates to ensure the job will finish before
then). However, if the walltimes provided by the users are somewhat reasonable
estimates of the actual runtime, even if padded by 20 or 30% for safety, this
backfill process can help keep the queue moving without introducing much delay
into the larger jobs.
- BeeGFS file system
-
BeeGFS is a high-performance, parallel file system. It
is used to provide the scratch tier storage resource
on the
Zaratan
high-performance computing cluster.
See also:
- Bit
- A bit (from
b
inary digit
) is the smallest unit
of information in a classical computer system. It can hold one of two
values, typically denoted as 0 or 1 (although other labels like true/false
or on/off are equally valid). As such, a bit can be represented by a
simple switch which can be in either an on or off position, which in the most
elementary sense forms the basis of digital electronics.
Although a single bit can only hold either
a 0 or 1, it is possible to string multiple bits together to hold arbitrarily
large numbers (limited only by the number of bits); a common example is
the byte.
See also:
- Glossary entry for byte
- Glossary entry for qubit
- Wikipedia entry for bit
- Byte
- A byte is a collection of 8 bits,
and as such it can take on any of 2^8 distinct values, typically expressed
as the integers between 0 and 255. Bytes are also commonly used to
represent a single character in many language sets, or instructions in
the machine level language for computer codes, etc. Bytes are typically the
smallest addressable unit of information, and as such are often used as the
unit by which digital information is quantified; i.e. the size of
memory used by a program or job, or the size of
files stored on disk, etc. --- since a byte is a rather small unit by today's
standards, this typically is used with a metric multiplier prefix like
k (1000x), M (1,000,000x), G(10^9x), T(10^12x), etc.
See also:
C
- Checkpointing
- Checkpointing is the process by which a program or job periodically saves
its state to disk so that if the code is terminated for some reason, the state
information can be used to resume the calculation from when the checkpoint
was made. This is commonly used to provide some resilience to jobs, especially
long jobs, from hardware and other failures --- when the job is resubmitted,
it can (with appropriate configuration) resume from the last checkpoint rather
than from the very beginning.
See also:
- compute nodes
- A compute node is a node on the cluster
which is intended for heavy computation. These are the "work horses" of
the cluster, where all the real computation gets done and most of the nodes fall
into this category. These nodes can be
further specialized into compute nodes with specific features, like nodes with
large amounts of RAM (large memory nodes) or nodes which have
GPUs or other specialized
accelerators (e.g. GPU nodes).
- Controlled Unclassified Information (CUI)
- Controlled Unclassified Information (CUI) is an
unmbrella term to describe data which is not classified
by the US government but which still has some data sensitivity
which requires it to safeguarded. This encompasses various
categories of sensitive data which is not formally classified,
including personally identifiable information (PII), HIPAA,
and similar.
- core
- A computer core is the basic unit for processing on a
computer processor or GPU. Each core
is capable of performing an independent calculation at the same time, so
in the high performance computing paradigm we typically want
to
parallelize
code with each parallel thread of execution running on a
separate core to gain maximum performance.
From a software perspective, each core looks much like a CPU
, and the cores are sometimes referred to as CPUs, but for this website
we will try to always use CPUs to refer to the chip/socket, and cores (or
CPU cores) for the basic processing units within the CPU.
See also:
- CPU
- The acronym CPU stands for the "central processing unit" of the system,
traditionally the "brain" behind the computer. It is where most of the
processing traditionally occurs.
The CPU consists of one or more cores
(modern CPUs almost always consist of
multiple cores, the CPUs on the Deepthought2 cluster have 10 cores per CPU,
some recent chips have upwards of 60 cores per CPU). The CPU is the
physical "chip"; many nodes have multiple CPUs. The CPU
is sometimes referred to as a "socket" to more clearly distinguish it from
the cores --- from a software perspective the cores look
like CPUs, and some places use the term CPU to refer to cores, so the socket
terminology is useful. On this web site, we will try to always use the term
CPU to refer to the chip/socket, and refer to cores as cores (or sometimes
CPU cores). Note that the Slurm scheduler (e.g. sbatch
and
other commands), tends to use the term CPU to refer to a core, and uses
the term socket
when refering to the physical processor.
See also:
- CUDA
- CUDA is a platform for
parallel computing, primarily used for doing
general purpose computing on GPUs. CUDA can be accessed
via specific libraries, use of compiler directives like
OpenACC, and the specialized
nvcc
compiler.
See also:
- CUDA compute capability
- Different GPUs support different features and command sets. Generally
this is a cumulative hierarchy, so later GPUs support all of the features
of earlier GPUs, and have some additional capabilities not supported by the
earlier ones. This hierarchy is encapsulated as the CUDA compute
capability, which is a simple version number which specifies the features
supported by a given GPU and CUDA driver library. E.g., double precision
floating point operations were introduced in CUDA compute capability 1.3; GPUs
supporting only 1.0, 1.1, or 1.2 do not support double precision, but GPUs
supporting 1.3 or higher do.
The following list gives the CUDA compute capability for GPUs on the various
UMD clusters:
- K20 GPUs: 3.5 (Deepthought2 cluster)
- K80 GPUs: 3.7 (Bluecrab cluster)
- P100 GPUs: 6.0 (Bluecrab and Juggernaut clusters)
- V100 GPUs: 7.0 (Juggernaut cluster)
- A100 GPUs: 8.0 (successor to Deepthought2 ???)
See also:
D
- data transfer nodes (DTNs)
- These are nodes optimized for the transferring of the large amounts of
data, both data sets needed for jobs running on the cluster, and moving results
of finished jobs off the cluster. You should
not run any computationally intensive processes on the login nodes.
- Distributed Memory Parallelism
- Distributed memory parallelism is a paradigm for
parallel computing
in which the parallel processes do not all share a global memory address space.
Because of this, communication among the different processes cannot occur only
over shared memory. This is in contrast with
shared memory parallelism
which can use the shared memory space for inter-process communication.
Distributed memory parallelism is typically implemented by using the
Message Passing Interface (MPI) for communication between the
tasks. Although programming using MPI is generally harder
than using shared memory paradigms like OpenMP, it has
the advantage that it is not restricted to a single node.
E
- Embarrassingly parallel
- A workload or problem is embarrassingly parallel (also referred
to a perfectly, delightfully, or pleasingly parallel) is a workload
such that there no or only minimal communication required between
the parallel tasks. Because of this, such workloads can be parallelized
with only mimimal effort.
Grid searches of a parameter space are often a common case of an
embarrassingly parallel problem. Consider a case wherein you have
an N dimensional grid of parameters, and you need to evaluate some
(potentially complicated) function for the parameters at each point
in the grid. Typically the evaluation at any point in the grid is
independent of any other evaluation, and these evaluations can be
perofrmed as a large number of sequential jobs. The sequential
jobs can be sequentially or simultaneously or in some combination
of the two.
See also:
F
- FLOPS
- FLOPS, short for FLoating-point
OPerations per Second is a measurement
of computer performance, and as the name suggests, basically measures the
maximum number of calculations using floating point arithmetic that a
computer can perform per second. This is the measure commonly used to
compare the performance of HPC clusters, except that we usually talk in terms
of teraFLOPS (1 TFLOPS = 10^12 FLOPS) or petaFLOPS (1 PFLOPS = 10^15 FLOPS)
or even exaFLOPS (1 EFLOPS = 10^18 FLOPS).
See also:
G
- Gibibyte (GiB)
- This is the binary unit corresponding to the
decimal unit gigabyte,
with 1 GiB = 2^10 MiB = 1024 MiB = 2^30 B = 1024^3 B = 1,073,741,824 B =
1.073 GB. This is approximately the size of half an hour of video.
See also:
- Gigabyte (GB)
- This is the decimal unit corresponding to one billion (10^9)
bytes.
This is approximately the size of half an hour of video.
See also:
- Graphics Processing Unit (GPU)
- Graphics Processing Units (GPUs) are hardware
accelerators which were initially designed to
facilitate the creation of graphics for output to a display device. However,
their highly parallel
structure makes them more efficient then general
purpose CPUs for certain types of algorithms, and so they
are quite useful for processing certain "number crunching" workloads.
High-end GPUs can be expensive, and are rapidly evolving, so only a small
subset of nodes typically have GPUs, and if you wish to use them you will
need to instruct the scheduler to
use those nodes. Although a GPU contains a large number of
cores, they are not compatible with the standard Intel
x86 architecture, and codes need to written (and compiled) especially for
these devices, typically using the
CUDA or OpenCL platforms.
Some of the
applications in the standard software
library, but even in those cases you need to use the versions which
specifically support CUDA.
See also:
H
- high-performance storage system (HPFS) (aka scratch filesystem)
-
A high-performance storage system is a disk-like storage system
designed to support sustained high level of input/output operations from many
processes across many nodes. This is generally achieved by spreading data
across many file servers. In particular, with appropriate settings, the data
for a single file may be striped across multiple file servers, thereby
potentially allowing the many different tasks of a large
parallel job to access different parts of the same
file without overwhelming a single file server. For this reason, HPFS systems
are ofter referred to as parallel file systems.
The large number of file servers and high performance makes the HPFS tier of
storage more expensive than less performant tiers. As such, HPFS tiers are
often considered scratch storage, as they are only intended for the
storage of data (either input to or output from) jobs that are currently
running or that would be running in the near future. Typically, input data
should be downloaded to the HPFS (or copied from medium term
or longer tiered storage), the job submitted, the job runs to completion,
and then the inputs are deleted (or returned to longer term storage tiers),
temporary files from the job are deleted, and precious output is moved to
longer term storage. It is the
policy on UMD HPC systems
that scratch space is only for data related to active jobs on the cluster, i.e.
jobs that are running, in the queue, that completed recently, or that you plan
to run in the near future. Please delete all unneeded files in a timely
fashion, and move and data off the scratch file system when no longer needed
for active jobs.
See also:
- hybrid parallelism
-
Hybrid parallelism is a paradigm for
parallel computing in which some of the
parallelism is done using
distributed memory parallelism
for some of the parallelism, and
shared-memory parallelism for the
rest. This is commonly done by using MPI for the
distributed-memory type, with each MPI task being
multithreaded.
This technique is not common, but typically occurs when the problem can be
decomposed for parallelization in two distinct manners, with the code
using one paradigm for one decomposition and the other for the other.
Even for codes which support hybrid parallelization, then benefits and the
performance gains might be highly dependent on the problem being solved. So
it is advisable to benchmark to find the optimal configuration.
See also
- High Performance Computing (HPC)
- High Performance Computing is a general term for computing with a high
level of performance. Generally high performance computing specifically
refers to running jobs which are very parallel, often running on hundreds or
even thousands of cores.
However, then term is often used more generally to encompass not only
traditional high performance computing but also
high-throughput computing and various
machine learning paradigms.
The Deepthought clusters are generally designed for High Performance
Computing, and are referred to as such, but they also support High Throughput
Computing as well.
I
- Infiniband
- Infiniband is a data interconnect used by many high=performance computing (HPC)
systems because of its high bandwidth and low latency. These factors are important
in HPC systems because the different tasks in an MPI code typically
need to exchange significant amounts of data with each other. While more traditional
interconnects like ethernet can sometimes be competitive on bandwidth, they typically
have significantly more latency, which can degrade performance in many MPI based
codes.
See also:
J
- Job Priority
- Every job submitted to the scheduler has an associated priority
which determines the order in which the scheduler will try to run them.
While broadly speaking jobs are run in a first-come, first-served order,
there are exceptions. Jobs submitted to the debug partition get the highest
priority (i.e. the scheduler tries to run those first) --- these jobs have
highly restrictive time limits and a small dedicated pool of nodes, so this
generally allows users to get these jobs started quickly, to promote rapid
debugging of issues.
Most remaining jobs are run at standard priority --- these comprise the bulk
of the jobs on the cluster. Basically, unless you specifically request
otherwise (by either submitting to the debug partition above, or the scavenger
partition below), the job will run at standard priority. Typically jobs
submitted at this priority run in a first-come/first-serve fashion. However,
if the next job X in line to run is waiting on resources, the scheduler may
"backfill" some smaller jobs onto resources reserved for the job X
f such jobs would finish before the remaining resources needed to run job X
are expected to be ready.
Jobs submitted to the
scavenger partition run
at the lowest priority, and are subject to preemption,
but to compensate for that such jobs do not incur charges to the allocation
account.
See also:
K
- Kerberos
- Kerberos is an authentication system which makes use of
tickets to grant access to computer resources on the network.
Typcially, you will automatically get Kerberos tickets when you enter your
password to log into a Kerberos enabled system. These tickets have expiration
dates (typically on the order of a day), although they can be renewed.
Kerberos tickets for the basis for obtaining tokens to access the
Auristor file system which is used for the
SHELL medium term storage tier.
See also:
- kibibyte (kiB)
- A kibibyte is 1024 bytes, as distinct from
a kilobyte (kB) which is 1000 bytes. Because certain
parts of computer hardware (e.g. memory) typically tend to favor numbers
which are powers of 2, there is a need for binary units like kibibytes.
In the early days of personal computers, usage was sloppy, and many used
the term kilobyte to refer to 1024 bytes (even though the metric kilo-
prefix means 1000x), but this became problematic as sizes increased, with
each successive prefix adding about 2% to the difference between purely
decimal and purely binary units. Eventually the standards committees
settled on this system of binary units. This is about the size of the
text of Jabberwocky
or a small icon image. See also:
- kilobyte (kB)
- A kilobyte is 1000 bytes. Sometimes this term
is used for 1024 (2^10) bytes, but that is more properly referred to as a
kibibyte (kiB) as the metric prefix k is meant to be
a multiplier of exactly 1000. A kilobyte is about the size of the
text of Jabberwocky
or a small icon image.
- kSU
- Please see service unit (SU).
L
- Login node
- These are the nodes you can actually log into to submit jobs, edit files,
monitor your jobs, view the results of jobs, etc. They are
*not* intended to do number crunching or real computation.
If the cluster has distinct DTNs, then most file transfers (certainly all
transfer of large amounts of data) should be done on the DTNs. You should
not run any computationally intensive processes on the login nodes.
- Lustre
-
Lustre is a high-performance, parallel file system. It
was used to provide the scratch tier storage resource
on the
Deepthought2
high-performance computing cluster, as well as for the
Juggernaut cluster.
See also:
M
- Mebibyte (MiB)
- This is the binary unit corresponding to the megabyte,
with 1 MiB = 2^10 kiB = 1024 kiB = 2^20 B = 1024^2 B = 1,048,576 B.
This is approximately the size of the content of a typical
Harry Potter novel.
See also:
- Medium-term storage (SHELL storage tier)
- The SHELL storage tier is a medium-term storage tier provided on
the Zaratan
cluster. Unlike the scratch tier,
data on the medium-term storage cluster does not have to be needed by a job
which is running or about to run. However, it is not long-term or
archival storage --- the storage it not automatically backed up by
DIT, nor are there guarantees that the storage will last beyond the lifetime
of the Zaratan cluster (5 years or so). It is intended for the storage of
large amounts of data related to active research on the cluster, even if the
data is not related to jobs which are running or will be run in the near future.
Please note that the SHELL tier is not designed to the same level
of performance as the HPFS/scratch tier. To prevent users
from overloading the small number of fileservers forming this tier by having
many tasks from large parallel jobs
do large amounts of I/O to it, the SHELL tier is not accessible from
the compute nodes. It is accessible from the
login and data transfer nodes.
Furthermore, since the underyling filesystem is the
Auristor file system, it can be securely accessed
with appropriate credentials from any system with the appropriate file
client, even systems outside cluster.
- Megabyte (MB)
- This is the decimal unit corresponding to one million (10^6)
bytes.
This is approximately the size of the content of a typical
Harry Potter novel.
See also:
- Memory
- Just like people, when talking about computers memory
is the means by which computers and computer programs keep track of things
for future use. With computer systems, the term memory
typically refers to RAM, which is a specific type/family
of memory chips which provide relatively high speed access. Although both
memory and disk storage are usually measured in bytes,
disk storage is typically not considered memory (just like one normally
does not refer to notes in a notebook as a person's memory).
Note that GPUs typically have memory which is distinct
from the main system (CPU) memory, usually an order of
magnitude or so smaller.
- Message Passing Interface (MPI)
- The Message Passing Interface (MPI) is a standardized and portable
standard for communication between tasks in a
distributed memory parallel
job. It has become a de-facto standard for programs which can run across
multiple nodes.
There are a number of implementations of this interface. UMD HPC clusters
typically provide:
- For GNU compilers, OpenMPI: an open-source MPI implementation
- For Intel compilers, Intel MPI: Intel's implementation of MPI
Both implementations include wrappers for C, C++, and Fortran compilers,
as well as bindings for many scripting languages including Python and R.
See also:
- MPI Communicator
-
MPI jobs group the various processes
which comprise the job into objects called communicators. The default
communicator is MPI_COMM_WORLD which consists of all of the processes that
were in the job when it first started. For many jobs, all of the processes
are closely coupled, and that is the only communicator you need, but MPI
provides other mechanisms which are needed in some more complicated cases.
Each communicator has a size
, which is the number of processes
in the communicator. Every process in the communicator has an unique
rank
, which is an integer between 0 and the size - 1
of the communicator.
See also:
- MSU
- Please see service unit (SU).
- Multi-instance GPU
- Modern represent a significant amount of
computational power, and sometimes jobs cannot fully utilize a full GPU.
However, a single GPU cannot be split between jobs, so in this situation
the GPU goes underutilized. Certain models of NVidia GPUs support something
called
multi-instance GPU which allows one physical GPU to be split into
multiple, less powerful, virtual GPUs. These virtual GPUs can each be assigned
to different jobs, thereby allowing multiple jobs to share one of these
powerful GPUs, increasing utilitization of the device.
The A100
GPUs support this feature, and each physical can be split into various
smaller virtual devices, up to seven smaller units. We have currently divided
the A100s on some of the nodes into seven smaller virtual units:
GPU name in Slurm |
Description | >
GPU Compute units |
GPU Memory |
Cost per hour |
gpu:a100 |
Full physical A100 |
8 |
40 GB |
48 SU |
gpu:a100_1g.5gb |
One seventh of an A100 |
1 |
5 GB |
7 SU
N
- node
- In HPC terminology, we use the term
node
to refer to a complete computer system, comprising of at least one
CPU, memory, power supply, various buses,
networking devices and generally some local disk storage. This is the direct
analog to your laptop or workstation.
On the clusters, we have differentiated the nodes into various specializations:
See also:
O
- OpenACC
-
OpenACC is a programming standard designed for
parallel computing
heterogeneous CPU/GPU systems as
well as other accelerators.
See also:
- OpenCL
-
OpenCL is a programming framework for writing code to execute across
heterogeneous platforms of CPUs, GPUs,
and other hardware accelerators. It is an
open standard.
See also:
- OpenMP
- OpenMP is an application programming interface for
shared memory parallelism using
threads. (Should not be confused with
OpenMPI.)
OpenMP is implemented in the compiler, adding some pragmas to the language with
which you can instruct the compiler which sections of the code are
parallelizable and which must be run sequentially.
See also:
- OpenMPI
- OpenMPI is an open source implementation of the MPI
library.
P
- Parallel Computing
- Parallel computing is a type of computation wherein many calculations
are carried out simultaneously. This contrasts with
sequential computing wherein only a single calculation occurs at any
given time.
Obviously, a parallel computation should be faster than a sequential
computation. The speedup of an algorithm when
parallelized is the ratio of the runtime of the sequential code to that
of the parallelized code.
Although one might naively expect the speedup (compared to the
sequential calculation) of a computation parallelized
over N CPU cores would be N, this
is rarely the case. Typically, there are parts of an algorithm which cannot
take advantage of parallelization, causing the speedup to be less than N.
Generally, the addition of CPU cores will start having diminishing returns
at some point, depending on the algorithm.
See also:
- Parallel File System
- Please see High-Performance File System.
- Partition
- In HPC terminology, a partition is a set of
compute nodes. Clusters typically divide their nodes into partitions to
reflect different hardware availabilities, or to associate priority and
quality of service settings to jobs submitted to the partition.
The term "queue" is sometimes used synonymously with "partition" (because
in some sense the scheduler maintains a distinct
queue for each partition).
See also:
- Petabyte (PB)
- This is the decimal unit corresponding to one quadrillion (10^15)
bytes.
This is about the size of 2000 years of MP3-encoded music.
See also:
- Preemption of Jobs
- Most jobs on the cluster are submitted to the scheduler, wait in the
queue until resources are available, and then once the job is started, it
runs until it either completes, crashes, or is cancelled due to exceeding its
requested walltime. Even if a higher priority job is
submitted afterwards which "wants" the resources being used by the lower
priority job which already started, the running job is allowed to run to
its successful or unsuccessful completion, and the higher priority job is
forced to wait until it finishes (or other resources become available).
An exception to this is jobs submitted to the
scavenger partition. These jobs, in addition to be
the lowest priority, are also preemptible. This means that even after
the scavenger job starts, if another, higher priority job (and every
non-scavenger job is higher priority) "wants" the resources being used by the
scavenger job, the higher priority job can kick the scavenger job off of the
node(s) it is using, and take those node(s) for its own use. This is called
preemption. Typically, the preempted scavenger job is put back in
the queue, and will eventually start again (it is advisable that such
scavenger jobs do checkpointing so that they do
not start over from the very beginning but can make some progress towards
completion in these intermittent runs).
Again, normal jobs will not get preempted, only scavenger jobs.
Scavenger jobs are subject to preemption, but your
allocation account does not get charged for scavenger jobs.
See also:
- Priority of Jobs
- Please see job priority.
- Project
-
A project is a collection of
allocations for a research group, i.e. under a
single principal investigator (PI). In theory, a single project can
have allocations across multiple clusters. Usually a faculty member
will have a single project; although exceptions exist. If the faculty member
has requested use of the cluster for a class he/she is teaching, then that
class will be in a separate project. Also, faculty members who manage
college or departmental pools of HPC resources will see that as a distinct
project.
Every user of the cluster is associated with one or more
projects, and also with at least one of the
allocation accounts for the cluster beneath the
project.
Projects usually belong to research groups, although sometimes they belong
to a department or a subset of a research group. They provide a means of
organizing users and associating them with allocation accounts. Members of
the same project also share membership in an Unix group, which can be used
to facilitate sharing files (although by default your home and data directories
are restricted to your login account only, they are generally group-owned by
the project group, and you can use chmod
to allow group members
to read).
See also:
Q
- Quarter (of a year)
- This represents one fourth of a year. Some resources are meted out on a
quarterly basis. This is generally to encourage the spreading out of the
use of the resource over time. Quarters are labelled Q1, Q2, Q3, and Q4,
starting on 1 Jan, 1 Apr, 1 Jul, and 1 Oct, respectively.
- Qubit
- A qubit (from
qu
antum bit) is the smallest unit
of information in a quantum computer system. When observed/read out
at the end of a calculation, it returns a binary value (typically denoted
as 0 or 1) just like a classical , but during processing
it can be a coherent superposition of both of those states simultaneously,
thereby containing much more information.
See also:
- Glossary entry for bit
- Wikipedia entry for qubit
- Queue
- In Britain, a queue is a line of people, etc. waiting their turn
for something. This definition is borrowed in HPC to
refer to the list of jobs that are waiting to be allocated
resources and started by the scheduler.
When you submit a job to the cluster and then check its status, unless you
are lucky enough to have submitted it when the cluster is very idle, you will
see that the job is in a "pending" state --- i.e. the job is waiting in the
queue.
The amount of time a job spends in the queue depends on how busy the cluster is,
how many other jobs are in the queue, how many resources your job is requesting
and for how long, the priority of the job, and many
other factors. A sizable production job requesting a day or more of runtime
can easily spend a couple of hours in the queue. Very short jobs (under 15
minutes) should look into the
debug partition which generally has much lesser
wait times.
In some sense, the different partitions each have
their own queue, and so sometimes the term queue is also used
to refer to a partition. Some
schedulers tend to use the term queue in that fashion more than others.
See also:
R
- RAM
-
RAM is an acronym for Random-access memory. This is particular form of
computer memory which can be both read and written,
and for which the access times are basically irrespective of the location
of the data within the memory. Most typically, this takes the form of
dynamic random access memory (DRAM), which is "volatile", meaning that
the stored information is lost if power is removed from the chip.
RAM is typically where the machine code for the program typically resides
when it is being run, along with all of the data that the program is
actiing on, although in both cases this can be supplemented by disk storage
as well. For static data like the code, the system just overwrites the
memory storing the code when needed, and when the code is needed reads it
back from the program file on disk. For general data, the program can
write out chunks of memory to a swap area on the disk, overwrite the memory
thus freed for a while, and read the memory back from disk when needed.
In either case (but especially the latter case), this can significantly
degrade performance, as disks are much slower than RAM, and so high
performance codes typically wish to avoid swapping to disk.
See also:
S
- Scavenger partition
-
- The scavenger partition is a special
partition that allows users to submit very low priority
partition is a special
partition that allows users to submit very low priority
job that is subject to preemption,
but in return does not incur any charges against the
allocation account.
Because scavenger jobs do not incur charges against the allocation account,
these are an useful mechanism to run jobs in excess of your SU
allotment. But because of the preemption, you are strongly encouraged to
make use of checkpointing in order that your job
can make progress during its piecemeal run time.
- Scheduler
- The scheduler is a critical component of an HPC
environment. It is a complex piece of code responsible for allocating
resources to the various jobs in the
queue in a timely fashion.
The scheduler is in some ways analagous to the host or hostess at a restaurant,
with the restaurant customers being like jobs, the tables being like the
compute resources, and the waiting list being like the queue. If the
restaurant is not busy, a new customer can just come in and the host/hostess
will seat them almost immediately, just like when the cluster is idle the
scheduler will allocate resources to a new job almost immediately. When things
are busy, the host/hostess will place customers on a waiting list, and
customers will be seated in roughly the order they came in. But there are
exceptions --- some parties are too big for some tables or might have special
requests (e.g. indoor vs outdoor seating, etc), so sometimes smaller or less
particular parties can jump ahead in the queue.
In the same way, the scheduler tries to allocate resources to jobs roughly
in the order they came in, but modified because the jobs have different sizes
(from single cores to thousands of cores) and can requestaGPUs or large
amounts of memory. Also, unlike most restaurants, the scheduler will allow in
some cases for multiple jobs to be placed on the same table. As an added
twist, jobs are required to specify the maximum amount of time for which they
will run, and the scheduler can use this to backfill
jobs.
There are many different schedulers available for HPC clusters. At UMD, we
have been using the Slurm scheduler since 2014. Slurm is an
open source product used at many clusters in the world, including most of
the TOP500 systems.
See also:
- Secure copy protocol (scp)
- The secure copy protocol and command (
scp
) are a
protocol and command for transfering files from one computer system
to another over the network. The secure in the name comes
from the fact that the authentication and the transferred file data
are both encrypted over the wire, thereby protecting the data from
wire tapping attacks.
- Scratch file system
- Please see the High-Performance File system
- Sequential Job or Process
- A sequential job or process is a job or process which does not do any
form of
parallelism, i.e. the code processes one instruction,
then the next, and so on. These are also referred to as
single core jobs or processes, as they can
efficiently run on a single CPU core.
Service node
Service node is a catch-all phrase for any of the
nodes in the cluster which provide various services
needed to keep the cluster running. Generally you will never directly
run processes on these nodes, but they provide essential services.
E.g. the scheduler, file servers, web servers, etc.
SHELL file system
Please see Medium-term storage
Solid-state drive
Solid-state drives(SSDs) are a form of mass storage which use integrated
circuits for the actual storage instead of spinning magnetic media. These
devices are of interest in HPC applications primarily because they have
much higher input and output speeds and much lower latency than traditional
spinning media. Solid state drives can either use traditional
hard disk drive
(HDD) interfaces like
SAS
or SATA, but they can
also use newer interfaces like
NVMe which can better
support the parallelism available in modern SSDs and therefore even further
improve performance. The downside is that currently SSDs are more expensive
per storage unit than traditional HDDs.
Secure Shell Protocol (SSH)
The Secure Shell Protocol is a cryptographic network protocol for
communicating between systems. It is also the name of the most common
remote login client application ssh
using this protocol.
Typically one uses the ssh
command to open a command line
terminal session on a remote system.
This client application is usually
installed by default (as ssh
) on
circuits for the actual storage instead of spinning magnetic media. These
devices are of interest in HPC applications primarily because they have
much higher input and output speeds and much lower latency than traditional
spinning media. Solid state drives can either use traditional
hard disk drive
(HDD) interfaces like
SAS
or SATA, but they can
also use newer interfaces like
NVMe which can better
support the parallelism available in modern SSDs and therefore even further
improve performance. The downside is that currently SSDs are more expensive
per storage unit than traditional HDDs.
SSH public key authentication
Solid-state drivesa(SSDs) are a form of mass storage which use integrated
circuits for the actual storage instead of spinning magnetic media. These
devices are of interest in HPC applications primarily because they have
much higher input and output speeds and much lower latency than traditional
spinning media. Solid state drives can either use traditional
hard disk drive
(HDD) interfaces like
SAS
or SATA, but they can
also use newer interfaces like
NVMe which can better
support the parallelism available in modern SSDs and therefore even further
improve performance. The downside is that currently SSDs are more expensive
per storage unit than traditional HDDs.
Striping (of a file)
High-performance file systems (like
BeeGFS or Lustre)
typically achieve their high performance by striping
the data in a file across multiple file servers, or at least
disk arrays on a file server. I.e., different portions or
chunks of a file will get written to different disk
arrays, and usually to disk arrays on different file servers.
Generally, there is a chunk size and a number of
stripes set for a file (either by default or specified
by the user), and the system will assign the file to a number
of disk arrays and/or file servers equal to the number of
stripes, and write the first chunk size amount of data to
the first such disk array, and when that is done write the
second chunk size amount of data to the second, and so forth.
After the chunk is written to the last of the assigned disk
arrays, things cycle back around to the first again.
This spreads out the I/O load among multiple disks, and usually
across multiple servers, so in theory the maximum I/O rate
is the sum of the I/O rates for the individual disk arrays.
Of course, in practice the data rates do not quite reach the
theoretical maximums, but process does allow for substantial
increases in I/O throughput.
See also:
Service Unit (SU)
A service unit (commonly abbreviated SU) is the basic unit used
in HPC clusters to measure an amount of computation. It
is basically defined as 1 SU = 1 walltime hour of use of a single CPU core,
although this definition can be tweaked somewhat. On clusters with many
different CPU models, an additional factor maybe included to account for the
different computational power of the different CPU models; e.g. often the
above formula only holds for the slowest CPU model on the cluster, and more
powerful CPU cores will have an additional factor (>1) applied to normalize.
E.g., if a job takes one hour on the slower core (so consumes 1 SU), and only
takes 30 minutes on the faster core, the faster core might have a factor of 2
added in the above formula (so the SU cost on that core would be twice the
number of CPU hours, or 2 * 1 core * 0.5 hour = 1 SU, so the job costs the
same on either core).
Similarly, the above formula will be tweaked to account for other
resources. E.g., memory is a restricted resource,
and so a job which is using twice the average memory per CPU core might be
charged as if twice as many cores were in use.
GPUs are a valuable resource, typically with each GPU having thousands of
cores. (E.g. an Nvidia A100 GPU has almost 7000 CUDA
cores and over 400 Tensor cores.) Although each GPU core is less powerful
than a single CPU core, the sheer
number of such cores makes the GPUs very powerful for some alorithms. We
do not charge 1 SU for each GPU core used for 1 hour, but
we do charge significantly more for the use of an entire GPU for 1 hour than
we do for a single CPU core. The exact factor depends on the
GPU being used.
Generally, your job will be charged an hourly SU rate given by the maximum of:
job (in hours) times the
- 1 SU/hour times the number of CPU cores allocated to the job
- 0.25 SU/hour per GB times the total the amount of memory allocated to the job (in GB)
for jobs in the standard or GPU partitions. For jobs in the bigmem partition, this
value is 0.0625 SU/hour per GB.
- 7 SU/hour per GPU times the number of a100_1g.5g GPUs allocated to the job
- 48 SU/hour per GPU times the number of A100 GPUs allocated to the job
- 144 SU/hour per GPU times the number of H100 GPUs allocated to the job
For example, a job requesting 6 CPU cores, no GPUs, and 8 GB/CPU-core would
be charged an hourly rate of 12 SU/hour (in this case, the total memory
allocated would be 6 CPU * 8 GB/CPU = 48 GB, and the maximal value in the
list above would be 48 GB * 0.25 SU/hr/GB = 12 SU/hour). Similarly, a job
requesting two A100 GPUs along with 64 CPU cores and 4 GB/CPU core would
be dominated by the GPU costs, and result in 48 SU/hour/GPU * 2 GPU = 96 SU/hour.
This rate is multiplied by the actual walltime the job used (in hours)
to calculate the SU cost of the job. Note that the actual walltime is
used; if your job requested 24 hours of walltime but completed in 20 hours,
your job is charged for 20 hours, not 24 hours. Note also that the hourly
SU rate is based on the amount of resources allocated to the job, not
the resources actually used. E.g., if you submit a single core job
requesting 8 GB or RAM and no GPUs, you will be charged for 8 GB of RAM
whether your job uses all 8 GB or not --- these resources are reserved
for your job (and therefore unavailable to other users) for the duration
of the job and so you will be charged for them. In particular, submitting
a job in exclusive mode (which disallows any other jobs from sharing the
node with your job) will result in your being charged for all the resource
on the node whether you are using them or not.
Compared to the amount of computation done on the cluster, an SU is a fairly
small unit, so one often talks in terms of kSU or even MSU, where
1 kSU = 1000 SU and 1 MSU = 1,000,000 SU.
See also:
Shared memory parallelism
Shared memory parallelism is a paradigm for
parallel computing in which
all of the parallel processes share a global, shared memory address space
which they use to communicate among each other. It contrasts with
distributed memory parallelism
techniques where no memory space is shared among all of the processes.
Shared memory parallelism is typically implemented by
threads using
OpenMP or Threading Building Blocks.
Note: in order to use this paradigm, all of the
processes using shared memory parallelism must be running on the
same node, otherwise they would not be able to share
memory.
See also:
SHELL filesystem
Please see medium term file system (SHELL).
Speedup
The speedup of a code when parallelized
is a measurement of how much
performance improvement one obtains when parallelizing a job. It is given
by the ratio of the runtime for a sequential
version of the algorithm and the parallelized
computation.
I.e., if the sequential code takes T_seq seconds, and the
parallel code takes T_par seconds, the speedup is given by
S = T_seq/T_par.
See also:
SU
Please see service unit
T
- task
- MPI jobs consist of a number of tasks, which is the
smallest unit of execution within MPI. All processes, etc. for a given
task run on the same node (other tasks can also be on the same or
different nodes, but the processes for a given task will not be split
across multiple nodes), and each task
uses the MPI for communication with the other tasks, so therefore the
tasks can all be running on different nodes (although
they can also run on the same node).
In a pure MPI job, the tasks are sequential, and
so the job will use a number of cores equal to the number
of tasks.
Hybrid jobs can also exist, in which each MPI task uses a
shared memory parallelism
technique (OpenMP or other
threading paradigms) for additional parallelism. In
this case, each task needs multiple CPU cores on the same
node. Although all cores for a task will always be
assigned on the same node, the default is one CPU core per task, so you
will need to specify the number
of CPU cores per task.
- Tebibyte (TiB)
- This is the binary unit corresponding to one
terabyte (TB) or roughly one trillion
bytes (B). It is
equivalent to 1 TiB = 2^40 B = 1024^4 B = 1024^3 kiB = 1099511627776 B ~ 1.1 TB.
Note that on this scale, the difference between the binary (TiB) and decimal
(TB) units is almost 10%.
See also:
- Terabyte (TB)
- This is the decimal unit corresponding to one trillion (10^12)
bytes.
The largest consumer hard drives available in 2007 were only 1 TB, and even
in 2022 most laptops only have drives in the 1/4 to 1 TB range.
See also:
- thread
- A thread, often called a "lightweight process", is a thread of execution,
which is the smallest sequence of instructions which the operating system
can manage.
From an HPC perspective, a multi-threaded process can
have multiple threads of execution running simultaneously, assuming that
the node the job is running on has available cores
for each thread. Technically, a multi-threaded process can run on fewer
cores than threads, but in that case you will not get full advantage of the
parallelism.
Generally for HPC purposes, you want a separate CPU core for
each thread, to get the maximum speedup.
See also:
- Threading Building Blocks
- Threading Building Blocks (aka TBB or oneTBB) is a C++ template library
developed by Intel for shared memory
parallelism.
See also:
U
V
W
- Walltime
- The term walltime refers to the amount of time that the job actually run.
The "wall" in the term means that this is normal, everyday time you are familiar
with, as for example might be measured by a clock on the wall. (This is to
distinguish from other, more specialized "times" sometimes used in refering
to processes running on a computer system.) I.e., a job which starts to run
at 1:05 PM and finishes at 3:27 PM on the same day will have a walltime of
3:27 - 1:05 = 2:22 or 2.37 hours.
X
Y
Z
Back to Top