--> Skip to main content

Monitoring and Managing Your Jobs, etc.

  1. Seeing what jobs are running/queued
  2. When will my job run
  3. Detailed information about your jobs
  4. Viewing output of jobs in progress
  5. Cancelling your jobs
  6. Monitoring the cluster

Seeing what jobs are running/queued

The slurm command to list what jobs are running is squeue, e.g.
login-1: squeue
           1243530 standard  test2.sh  payerle  R   18:47:23      2 compute-b18-[2-3]
           1244127 standard  slurm.sh    kevin  R    1:15:47      1 compute-b18-4
           1230562 standard  test1.sh  payerle PD       0:00      1 (Resources)
           1244242 standard  test1.sh  payerle PD       0:00      2 (Resources)
           1244095 standard  slurm2.sh   kevin PD       0:00      1 (ReqNodeNotAvail)

The ST column gives the state of the job, with the following codes:

The NODELIST(REASON) field will tell you on which nodes jobs that are currently running are running on. If the job is pending (i.e. not running), it will give a short explanation for why the job is not running (as of the last time the scheduler examined the job). Typically one might see something like:

Typically, if you see something note in the above list, there is a problem and you will want to contact systems staff to assist.

The squeue command also takes a wide range of options, including options to control what is output and how. See the man page (man squeue) for more information.

For example: if you add the following to your ~/.aliases file (assuming you are using a C-shell variant):

alias sqp 'squeue -S -Q -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q %R"'
when you next log in the command sqp will list jobs in the queue in order of descending priority.

When will my job start?

The scheduler tries to schedule all jobs as quickly as possible, subject to cluster policies, available hardware, allocation priority (contributers to the cluster get higher priority allocations), etc. Typically jobs run within a day or so, but this can vary and usage of the cluster can vary widely at times.

The command squeue command, with the appropriate arguments, can show you the scheduler's estimate of when a pending/idle job will start running. It is, of course, just the scheduler's best estimate, given current conditions, and the actual time a job starts might be earlier or later than that depending on factors such as the behavior of currently running jobs, the submission of new jobs, and hardware issues, etc.

To see this, you need to request that squeue show the %S field in the output format option, e.g.

login-1> squeue -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S"
      473  standard test1.sh  payerle PD       0:00      4 2014-05-08T12:44:34
      479  standard test1.sh    kevin PD       0:00      4 N/A
      489  standard tptest1.  payerle PD       0:00      2 N/A

Obviously, the times given are estimates. The job could start earlier if other jobs ahead of it in the queue do not use their full walltime, or could get delayed if jobs with a higher priority than yours are submitted before your start time.

Detailed information about your jobs

To get more detailed information about your job, you can use the scontrol show job JOBNUMBER command. This command provides much detail about your job, eg.

login-2> scontrol show job 486
JobId=486 Name=test1.sh
   UserId=payerle(34676) GroupId=glue-staff(8675)
   Priority=33 Account=test QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   SubmitTime=2014-05-06T11:20:20 EligibleTime=2014-05-06T11:20:20
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=standard AllocNode:Sid=pippin:31236
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=2 NumCPUs=8 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

Viewing output of jobs in progress

Slurm outputs the stdout and stderr streams for your job to the files you specified on the shared filesystem in real time. There is no need for an extra command like qpeek under the PBS/Moab/Torque environment.

Cancelling Your Jobs

Sometimes one needs to kill a job. To kill/cancel a job that is waiting in the queue, or is already running, use the scancel command:

login-1> scancel -i 122488
Cancel job_id=122488 name=test1.sh partition=standard [y/n]? y

Monitoring the Cluster

Notices of scheduled and unscheduled outages, issues, etc on the clusters will be announced on the appropriate mailing lists (e.g. HPCC Announce for the Deepthought* clusters) --- users are automatically subscribed to these lists when they get access to the cluster.

Sometimes you want a broader overview of the cluster. The squeue command can give you information on what jobs are running on the cluster. The sinfo -N command can show you attributes of the nodes on the cluster. But both of these use a text orientated display, which while providing fairly dense amount of information, is often difficult to digest.

The sview command uses real (not text mode) graphics to show the status of the cluster. As such it requires an X server running on the computer you are sitting at. This will present a graphical overview of the nodes in the cluster and their state, as well as the job queue.

PLEASE SET THE REFRESH INTERVAL to something like 300 seconds (5 minutes). Select Options| Set Refresh Interval. The application default is far too frequent and causes performance issues.

For an even prettier view, there are online pages for monitoring for the clusters at:

Back to Top