Partition name | Max wallclock | Notes |
---|---|---|
standard | 7 days | All jobs that have no other special requirements |
debug | 15 minutes | Short test/debug jobs |
gpu | 7 days | Jobs requiring the use of GPUs. Further requirements are described here. |
bigmem | 7 days | Jobs requiring large amounts of memory (more than 512GB, up to 2TB) |
serial | 14 days | Intended for single node jobs. These nodes do not have infiniband, and so multinode jobs are likely to not perform well on them. But they have longer timelimits, and many have extra memory. |
scavenger | 14 days | Low priority jobs. Free, but can be terminated by standard jobs |
Jobs in the low priority scavenger partition run at the lowest possible priority, and are pre-emptible. I.e., even once the job has started, if another job comes along and needs the resources the job is using, the scavenger partition job will be terminated and put back in the queue. Your account does not get charged for jobs in the scavenger queue, but in order to make good use of it your job needs to be able to checkpoint itself so that it can make progress in the slices of time it gets between other jobs. This used to be referred to (inappropriately) as the serial queue, but it is hoped the new name better reflects its purpose, to allow jobs which can to scavenge free CPU cycles where they can.
See elsewhere for instructions on specifying a partition to run in.
In all other cases, the scheduler will submit your job to the default standard
partition. If you do not have special requirements, there is no
need to specify a partition.
scavenger
partition, which have no fixed size or wallclock limit, but are pre-emptible
and will be killed whenever anyone else needs the node.
Note that these walltime limits are quite generous compared to many other HPC clusters at other universities, etc. A quick (not quite random) sampling from a google search yields:
The scheduler is a process running on the head node which determines when and where jobs will run. It is what is responsible for seeing that your job gets the resources it requested so it can run, and for doing so in a manner which tries to get everyone's jobs scheduled and running in a reasonable amount of time. The following is a simplified overview of the scheduling process. There is a lot of complexity to the problem, but the following should give you a basic understanding and help you to understand why specifying realistic requirements for your job will help reduce the amount of time it spends in the queue waiting to be scheduled.
Jobs submitted by sbatch
, etc. get placed into a queue, and
the scheduler periodically checks the list of jobs in the queue trying to
find resources for them so they can run. Even if the cluster is lightly
loaded and there are no other jobs in the queue, this might take a minute or
two, but since jobs on the HPC typically run for hours this is a minor
overhead. If the cluster is heavily loaded, the jobs might spend hours or
even days in the queued state.
The scheduler basically goes through the list in a FIFO (first in, first
out) fashion, that is, jobs are more or less processed in the order in which
they are submitted. But this is only a first approximation. Jobs will have
differing priorities; jobs submitted via high-priority allocations (e.g.
allocations which end with -hi
(
for more information re high
priority accounts) run at a higher priority than jobs being charged
against standard priority allocations. Jobs submitted to the debug
partition also run at a higher priority than normal, since these are short
jobs and that partition is for people trying to debug stuff.
Jobs also have varying resources. The scheduler does not know what resources your job actually requires; it only knows what you requested. If you do not request enough of a resource, e.g. memory, it is possible that your job will run, but will fail at some point because the scheduler only gave the job the amount of memory that was requested. (If you do not specify an amount of memory that is required, the scheduler will assume that it does not need to worry about it and whatever resources it assigns to your job will meet your memory needs.) But if you specify such resources, the scheduler will only assign you resources meeting your specifications. And if such resources are not currently available, other jobs that were submitted after yours might get scheduled before yours.
For example, the Zaratan cluster has a small number of large (2 TB) memory nodes. If you specify in your job that you need that much memory, the scheduler will only assign one of those nodes to your job. But if all of those nodes are in use, you job cannot be scheduled until one of them becomes free. Whereas other jobs which do not have such strict memory requirements can still run, and might be scheduled before your job (on nodes with insufficient memory for your job) even though they are of lower priority and/or were submitted after your job was.
There is also the question of whether multiple jobs can share a node or
not. By default on the Zaratan cluster, jobs can share nodes with other jobs.
You can explicitly control this with the
--exclusive
and --oversubscribe
flags,
but the defaults are not unreasonable in most cases. But this can impact the time it
takes to schedule jobs; if all the nodes in the cluster are in use, even though
most nodes are running jobs with oversubscribe mode set, if your job has the
exclusive mode set, there will be no nodes available to it. But a job with
shared mode set might be able to run on one of the shared nodes if there are
sufficient resources that are not being used.
The scheduler might also keep nodes in reserve for a large job. If the job at the top of the queue requires a large number, e.g. 50, nodes, chances are that they are not 50 nodes idle when it reaches the top of the queue. So as nodes become free, if they would be suitable for the large job to run on, the scheduler will earmark it for the large job. So these nodes might be kept idle even though there are other jobs, behind the large job in the queue, which could run immediately on them. This is required, or else the large job will never run because every time a few nodes are freed, a narrower job will gobble them up.
There is an exception to the above; the scheduler knows the walltime requested for all jobs. So let's return to our 50 node job; and let's say the scheduler has 30 nodes earmarked for it. All the other nodes that could be used by the large job currently have other jobs running on them. But looking at the walltimes of those jobs, the scheduler might compute that the next 20 nodes (to complete the set of 50 needed by the job) will be available in 6 hours. If there are jobs which have a walltime under 6 hours which can make use of those nodes, the scheduler can let them use those nodes, since they will be idle BEFORE the large 50 node job can make use of them. This is referred to as backfill. This way the large 50 node job is not delayed longer than necessary, but the smaller, shorter jobs can run ahead of when they otherwise would have. So a "win-win" situation. If there are not enough smaller, shorter jobs to make use of those windows, the scheduler can throw scavenger partition jobs also, since it can kill those jobs at any time.
Again, this is another reason why specifying accurate resource requirements
will help your job get scheduled more quickly. If your job should take about
4 hours to run, and so you request 5 hours so that there is some buffer
(since the scheduler will terminate your job as soon as its walltime
runs out, even if it was 99.99% finished), then your job could be considered
for that 6 hour window when the scheduler is trying to schedule the 50 node
job in the previous example. But if you said "I may as well request 8 hours
since that is the limit on the QoS," the scheduler is just going by what you
requested, and your job will NOT be considered for backfill in a 6 hour
window. If your job really needs around 8 hours, that is one thing, but if
you just are inaccurate in your request, it will spend needless time in the
pending queue.
Back to Top