Queues and such

Partitions
Summarizing
How jobs waiting in the Queue get processed

Partitions

The Zaratan cluster provides a number of partitions to group nodes by function or to provide special capabilities. These different partitions may also allow different maximum job lengths. The queues available on Zaratan include:

Partition name	Max wallclock	Notes
standard	7 days	All jobs that have no other special requirements
debug	15 minutes	Short test/debug jobs
gpu	7 days	Jobs requiring the use of GPUs. Further requirements are described here.
bigmem	7 days	Jobs requiring large amounts of memory (more than 512GB, up to 2TB)
serial	14 days	Intended for single node jobs. These nodes do not have infiniband, and so multinode jobs are likely to not perform well on them. But they have longer timelimits, and many have extra memory.
scavenger	14 days	Low priority jobs. Free, but can be terminated by standard jobs

Jobs in the low priority scavenger partition run at the lowest possible priority, and are pre-emptible. I.e., even once the job has started, if another job comes along and needs the resources the job is using, the scavenger partition job will be terminated and put back in the queue. Your account does not get charged for jobs in the scavenger queue, but in order to make good use of it your job needs to be able to checkpoint itself so that it can make progress in the slices of time it gets between other jobs. This used to be referred to (inappropriately) as the serial queue, but it is hoped the new name better reflects its purpose, to allow jobs which can to scavenge free CPU cycles where they can.

See elsewhere for instructions on specifying a partition to run in.

In all other cases, the scheduler will submit your job to the default standard partition. If you do not have special requirements, there is no need to specify a partition.

Summarizing

Experience has shown that jobs using more than about 50% of the cluster tend to negatively impact other users on the cluster, particularly with respect to wait times. While the Allocations and Advisory Committee (AAC) tries to encourage very parallel jobs on the clusters, they also have a responsibility to other users of the cluster. To balance the needs of all users of the clusters, the AAC now requires explicit prior approval for jobs consuming more than 50% of the cores of the cluster. If you have need to run jobs which would require more than half of the CPU cores in the cluster, you will need to request and receive approval from the AAC in order to get access to run jobs with the verywide QoS. Restrictions on the wall time, number of jobs, etc. will depend upon the agreement reached with the AAC. When making your request, please include the allocation account you wish to use, as well as explain why such wide jobs are needed for your research.
Whether you belong to a contributing group or not, you can also submit jobs to the scavenger partition, which have no fixed size or wallclock limit, but are pre-emptible and will be killed whenever anyone else needs the node.

Note that these walltime limits are quite generous compared to many other HPC clusters at other universities, etc. A quick (not quite random) sampling from a google search yields:

New York University: 4 days maximum
University of Southern California: 2 weeks for 1 node, otherwise 1 day.
PennState: 2 weeks for up to 32 cores (contributors), 4 days for up to 256 cores otherwise
UMBC: 5 days
TACC: Stampede: 2 days
TACC: Lonestar: 1 day
Princeton: Della: 6 days
Princeton: Hecate: 15 days

How jobs waiting in the Queue get processed

The scheduler is a process running on the head node which determines when and where jobs will run. It is what is responsible for seeing that your job gets the resources it requested so it can run, and for doing so in a manner which tries to get everyone's jobs scheduled and running in a reasonable amount of time. The following is a simplified overview of the scheduling process. There is a lot of complexity to the problem, but the following should give you a basic understanding and help you to understand why specifying realistic requirements for your job will help reduce the amount of time it spends in the queue waiting to be scheduled.

Jobs submitted by sbatch, etc. get placed into a queue, and the scheduler periodically checks the list of jobs in the queue trying to find resources for them so they can run. Even if the cluster is lightly loaded and there are no other jobs in the queue, this might take a minute or two, but since jobs on the HPC typically run for hours this is a minor overhead. If the cluster is heavily loaded, the jobs might spend hours or even days in the queued state.

The scheduler basically goes through the list in a FIFO (first in, first out) fashion, that is, jobs are more or less processed in the order in which they are submitted. But this is only a first approximation. Jobs will have differing priorities; jobs submitted via high-priority allocations (e.g. allocations which end with -hi ( for more information re high priority accounts) run at a higher priority than jobs being charged against standard priority allocations. Jobs submitted to the debug partition also run at a higher priority than normal, since these are short jobs and that partition is for people trying to debug stuff.

Jobs also have varying resources. The scheduler does not know what resources your job actually requires; it only knows what you requested. If you do not request enough of a resource, e.g. memory, it is possible that your job will run, but will fail at some point because the scheduler only gave the job the amount of memory that was requested. (If you do not specify an amount of memory that is required, the scheduler will assume that it does not need to worry about it and whatever resources it assigns to your job will meet your memory needs.) But if you specify such resources, the scheduler will only assign you resources meeting your specifications. And if such resources are not currently available, other jobs that were submitted after yours might get scheduled before yours.

For example, the Zaratan cluster has a small number of large (2 TB) memory nodes. If you specify in your job that you need that much memory, the scheduler will only assign one of those nodes to your job. But if all of those nodes are in use, you job cannot be scheduled until one of them becomes free. Whereas other jobs which do not have such strict memory requirements can still run, and might be scheduled before your job (on nodes with insufficient memory for your job) even though they are of lower priority and/or were submitted after your job was.

There is also the question of whether multiple jobs can share a node or not. By default on the Zaratan cluster, jobs can share nodes with other jobs. You can explicitly control this with the --exclusive and --oversubscribe flags, but the defaults are not unreasonable in most cases. But this can impact the time it takes to schedule jobs; if all the nodes in the cluster are in use, even though most nodes are running jobs with oversubscribe mode set, if your job has the exclusive mode set, there will be no nodes available to it. But a job with shared mode set might be able to run on one of the shared nodes if there are sufficient resources that are not being used.

The scheduler might also keep nodes in reserve for a large job. If the job at the top of the queue requires a large number, e.g. 50, nodes, chances are that they are not 50 nodes idle when it reaches the top of the queue. So as nodes become free, if they would be suitable for the large job to run on, the scheduler will earmark it for the large job. So these nodes might be kept idle even though there are other jobs, behind the large job in the queue, which could run immediately on them. This is required, or else the large job will never run because every time a few nodes are freed, a narrower job will gobble them up.

There is an exception to the above; the scheduler knows the walltime requested for all jobs. So let's return to our 50 node job; and let's say the scheduler has 30 nodes earmarked for it. All the other nodes that could be used by the large job currently have other jobs running on them. But looking at the walltimes of those jobs, the scheduler might compute that the next 20 nodes (to complete the set of 50 needed by the job) will be available in 6 hours. If there are jobs which have a walltime under 6 hours which can make use of those nodes, the scheduler can let them use those nodes, since they will be idle BEFORE the large 50 node job can make use of them. This is referred to as backfill. This way the large 50 node job is not delayed longer than necessary, but the smaller, shorter jobs can run ahead of when they otherwise would have. So a "win-win" situation. If there are not enough smaller, shorter jobs to make use of those windows, the scheduler can throw scavenger partition jobs also, since it can kill those jobs at any time.

Again, this is another reason why specifying accurate resource requirements will help your job get scheduled more quickly. If your job should take about 4 hours to run, and so you request 5 hours so that there is some buffer (since the scheduler will terminate your job as soon as its walltime runs out, even if it was 99.99% finished), then your job could be considered for that 6 hour window when the scheduler is trying to schedule the 50 node job in the previous example. But if you said "I may as well request 8 hours since that is the limit on the QoS," the scheduler is just going by what you requested, and your job will NOT be considered for backfill in a 6 hour window. If your job really needs around 8 hours, that is one thing, but if you just are inaccurate in your request, it will spend needless time in the pending queue.

Queues and such

Table of contents

Partitions

Summarizing

How jobs waiting in the Queue get processed