If your job doesn't run and ends up in the BLOCKED JOBS
section, you can use the checkjob
command to get more
information about why your job isn't running.
login-1:~: checkjob 4195
[ ... deleted for brevity ... ]
job is deferred. Reason: NoResources (cannot create reservation for job '4195' (intital reservation attempt))
Holds: Defer (hold reason: NoResources)
PE: 232.00 StartPriority: 200
cannot select job 4195 for partition DEFAULT (job hold active)
In this example, we see that the job was deferred because there are insufficient resources available to run the job. Once sufficient resources become available, the job will run automatically.
If instead, you see the following as part of the checkjob output, it means that the job you are trying to run will exceed the allocation you have remaining. This may simply be because you did not specify a walltime as part of your job specification. If your specifications are correct, you can either resubmit your job to your standard-priority account, or to the free serial queue, or you can request an additional allocation from the committee.
login-1:~: checkjob 4204
[ ... deleted for brevity ... ]
job is deferred. Reason: BankFailure (cannot debit job account)
Holds: Defer (hold reason: BankFailure)
PE: 32.00 StartPriority: 200
cannot select job 4204 for partition DEFAULT (job hold active)
If none of the above conditions apply, and your job is listed in the IDLE JOBS section, keep the following in mind:
squeue
command lists jobs according to
priority order, with the highest priority jobs listed first.
The debug partition is available for running short tests with reasonable fast turn around. This is useful to see that your code, or your submit scripts, are working as intended, especially if your "real job" would require a fair amount of time and/or nodes and get stuck for a while in the queue.
Instructions on how to submit a job to the debug partition.
The individual compute nodes do not allow direct shell access except when the node is allocated to a job owned by you. If you need shell access to one or more nodes, you can request the scheduler assign some to you. This request gets put in the queue with all the batch job requests, and depending on the cluster usage at the moment you might get a prompt in seconds, or in minutes, or it might take hours or days.
The script sinteractive
is provided to assist with this
for most basic cases. It takes the following optional arguments:
-c NUMCPUS
specifies the number of CPU cores to
request. Default is 1.-a ACCOUNT
specifies the account to charge. Default
is your default account.-J NAME
specifies the name to use for the job. Default
is "interactive".-s SHELL
specifies the shell to start up on the assigned node. Default is your default login shell.-t MINUTES
specifies the wall time limit for your
interactive session, in minutes. Default is 60 (1 hour). You cannot request
more than 8 hours (480 minutes) with this utility.-d
If given, use the debug
partition. The
-t
parameter is ignored, and wallclock limit is set to 15 minutes.
-g GRES
specifies a generic resource required. The resulting salloc
will have --gres=GRES
added
if given.-h
Help. No interactive shell will be granted, but
an explanation of these and some less common options will be given.
-D
Dry-run. No interactive shell will be granted, but
the salloc command that would have been run is printed out. Useful if you
need to go beyond what the 'sinteractive' script can do but want to use it
as a starting point.
An example of using sinteractive
:
login-2:~: sinteractive -t 120 -a test-hi
salloc: Granted job allocation 1561831
salloc: Waiting for resource configuration
salloc: Nodes compute-b19-14 are ready for job
DISPLAY is login-2.deepthought2.umd.edu:15.0
Try re-authenticating(K5). You have no Kerberos tickets
compute-b19-14:~:
[ do some work interactively ]
compute-b19-14:~: exit
logout
salloc: Relinquishing job allocation 1561831
salloc: Job allocation 1561831 has been revoked.
login-2:~:
The warning message
Try re-authenticating(K5). You have no Kerberos tickets
can be ignored. The batch mechanism will not forward your kerberos tickets
to the compute node, but you probably do not need one there anyway. Should
you need it, you can issue the renew
command and enter your
password to obtain kerberos tickets.
Issuing the command sinteractive -t 120 -a test-hi -g gpu
will behave similarly, but you will get assigned a node with a GPU.
Although the sinteractive
command can cover many of the more
common situations, it is limited, and if you need more control over the
request you will have to manually request an allocation and start an
interactive shell on it. The -D
flag to sinteractive
will run sinteractive
in dryrun mode --- in this mode the
script will print out what it would have run, but not run it. This might
be an useful starting place for you to begin. Basically, you need to
request the assignment of resources to you from the scheduler with the
salloc
command, and then use the srun
to start
a process (typically a shell if you wish to use them interactively) on the
node.
For example, if you want to request two seperate nodes, try this:
login-1:~: salloc -N 2 -t 00:15:00 -p debug --qos=debug
salloc: Granted job allocation 13225
login-1:~: echo $SLURM_JOB_NODELIST
compute-b20-47,compute-b20-49
login-1:~: srun hostname
compute-b20-47.deepthought2.umd.edu
compute-b20-49.deepthought2.umd.edu
login-1:~: exit
exit
salloc: Relinquishing job allocation 13225
|
The
salloc command will require that you specify a partition
and QoS. You can use debug for both partition and QoS if your
session is under 15 minutes. Otherwise, you will need to use the partition
standard (or high-priority for the high priority
accounts) with one of the QoSes listed
here.
|
|
Remember to exit the spawned subshell when you are done, to relinquish
the nodes that you have requested for the job.
|