Skip to content

Displaying Cluster Information

Slurm offers a range of commands for interacting with the cluster. In this section, we will explore some examples of using the sinfo, squeue, scontrol, and sacct commands, which provide valuable insights into the cluster's configuration and status. For comprehensive information on all the commands supported by Slurm, please refer to the Slurm project website.

Command: sinfo

The command displays information about the cluster's state, partitions (subdivisions of the cluster), nodes, and available computing resources. There is a multitude of options available to specify the information we want to display about the cluster. For more precise control over the output, we can refer to the (documentation) that provides details on the various options and switches available with the sinfo command.

Display general information about the cluster configuration:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gridlong*    up 14-00:00:0      4 drain* nsc-gsv001,nsc-lou001,nsc-msv001,nsc-vfp002
gridlong*    up 14-00:00:0      3  down* nsc-fp003,nsc-gsv003,nsc-msv006
gridlong*    up 14-00:00:0      1  drain nsc-vfp001
gridlong*    up 14-00:00:0      3  alloc nsc-lou002,nsc-msv[003,018]
gridlong*    up 14-00:00:0      3   resv nsc-fp[005-006],nsc-msv002
gridlong*    up 14-00:00:0     24    mix nsc-fp[002,004,007-008],nsc-gsv[002,004-007],nsc-msv[004-005,007-017,019-020]
gridlong*    up 14-00:00:0      1   idle nsc-fp001
e7           up 14-00:00:0      2 drain* nsc-lou001,nsc-vfp002
e7           up 14-00:00:0      1  drain nsc-vfp001
e7           up 14-00:00:0      1  alloc nsc-lou002
sinfo

In the above output, you can see the available logical partitions, their state, the time limit for jobs in each partition, and the lists of compute nodes associated with them. The output can be customized using appropriate options to display specific information based on your requirements.

Display detailed information about compute nodes:

$ sinfo --Node --long
Tue Jan 05 11:06:02 2021
NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
nsc-fp001       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp002       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp003       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp004       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp005       1 gridlong*    reserved 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp006       1 gridlong*    reserved 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp007       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp008       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-gsv001      1 gridlong*    reserved 64     4:16:1 515970        0      1 AMD,bigm none
nsc-gsv002      1 gridlong*   allocated 64     4:16:1 515970        0      1 AMD,bigm none
sinfo --Node --long

The above output provides information about each compute node in the cluster, including its partition affiliation (PARTITION), current state (STATE), number of CPUs (CPUS), number of processor sockets (S), number of processor cores per socket (C), number of hardware threads (T), amount of system memory (MEMORY), and any assigned features (AVAIL_FEATURES) such as processor type, presence of GPUs, etc.

Cluster partitions may be reserved in advance for various reasons such as maintenance, workshops, or specific projects. An example of displaying active reservations in the NSC cluster is as follows:

$ sinfo --reservation
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
fri          ACTIVE  2020-10-13T13:57:32  2021-10-13T13:57:32  365-00:00:00  nsc-fp[005-006],nsc-gsv001,nsc-msv002
sinfo --reservation

The above output shows any active reservations in the cluster, along with the reservation duration and the list of nodes included in each reservation. Each reservation is associated with a user group that has exclusive access to it, allowing them to bypass waiting for job completion from users without reservations.

Command: squeue

In addition to cluster configuration, we are naturally interested in the job queue status. The squeue command allows us to inquire about jobs that are currently in the queue, running, or have already successfully or unsuccessfully completed (documentation).

Output of the current job queue status:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
387388  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387372  gridlong mc15_14T prdatlas PD       0:00      1 (Resources)
387437  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387436  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
385913  gridlong mc15_14T prdatlas  R   15:57:58      1 nsc-msv004
385949  gridlong mc15_14T prdatlas  R   13:47:49      1 nsc-msv017
squeue

From the output, we can retrieve the identifier of each individual job, the partition on which it is running, the job name, the user who launched it, and the current job status.

Some of the important job states are:

  • PD (PenDing) - the job is waiting in the queue,
  • R (Running) - the job is running,
  • CG (CompletinG) - the job is completing,
  • CD (CompleteD) - the job has completed,
  • F (Failed) - there was an error during execution,
  • S (Suspended) - the job execution is temporarily suspended,
  • CA (CAnceled) - the job has been canceled,
  • TO (TimeOut) - the job has been terminated due to a time limit.

The output also provides information about the total job runtime and the list of nodes on which the job is running, or the reason why the job has not started yet."

We are usually most interested in the status of jobs that we have launched ourselves. We can limit the output to jobs of a specific user using the --user option.

Example output of jobs owned by user gen012:

$ squeue --user gen012
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
381650  gridlong pmfuzzy_   gen012  R 7-03:13:06      1 nsc-msv020
381649  gridlong pmfuzzy_   gen012  R 7-03:15:06      1 nsc-msv018
381646  gridlong pmfuzzy_   gen012  R 7-03:18:28      1 nsc-msv008
381643  gridlong pmautocc   gen012  R 7-03:25:38      1 nsc-msv017
381641  gridlong pmautocc   gen012  R 7-03:28:26      1 nsc-msv007
381639  gridlong pmautocc   gen012  R 7-03:32:40      1 nsc-msv004
squeue --user gen012

In addition, we can also limit the output to jobs in a specific state. This can be done using the --states option.

Example output of all currently pending (PD) jobs:

$ squeue --states=PD
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
387438  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387437  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387436  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387435  gridlong mc15_14T prdatlas PD       0:00      1 (Priority)
387434  gridlong mc15_14T prdatlas PD       0:00      1 (Resources)
squeue --states=PD

Command: scontrol

Sometimes we require more detailed information about a specific partition, node, or job. This information can be obtained using the scontrol command (documentation). Below are some examples of how to use this command.

Example output of more detailed information about a specific partition:

$ scontrol show partition gridlong
PartitionName=gridlong
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=YES QoS=N/A
  DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=nsc-msv0[01-12,14-15],nsc-gsv0[01-03],nsc-fp0[01-04,06-08],nsc-lou0[01-16],nsc-vfp0[01-02]
  PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
  OverTimeLimit=NONE PreemptMode=OFF
  State=UP TotalCPUs=4016 TotalNodes=42 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=2000 MaxMemPerNode=UNLIMITED
  TRESBillingWeights=CPU=1.0,Mem=0.25G
scontrol show partition gridlong

Example output of more detailed information about the compute node nsc-lou003:

$ scontrol show node nsc-lou003
NodeName=nsc-lou003 Arch=x86_64 CoresPerSocket=64
  CPUAlloc=128 CPUTot=128 CPULoad=114.57
  AvailableFeatures=AMD,zen3
  ActiveFeatures=AMD,zen3
  Gres=(null)
  NodeAddr=nsc-lou003 NodeHostName=nsc-lou003 Version=20.11.9
  OS=Linux 5.15.81-1.el8.vega.x86_64 #1 SMP Tue Dec 6 15:38:27 CET 2022
  RealMemory=257355 AllocMem=254968 FreeMem=89742 Sockets=1 Boards=1
  State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1000 Owner=N/A MCS_label=N/A
  Partitions=gridlong
  BootTime=2023-03-13T11:24:12 SlurmdStartTime=2023-03-13T11:24:32
  CfgTRES=cpu=128,mem=257355M,billing=128
  AllocTRES=cpu=128,mem=254968M
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
  Comment=(null)
scontrol show node nsc-lou003  

Example output of more detailed information about the job with ID 387489:

$ scontrol show job 387489
JobId=387489 JobName=mc15_14TeV_5005
  UserId=prdatlas002(19002) GroupId=prdatlas(19000) MCS_label=N/A
  Priority=34890 Nice=69 Account=prdatlas QOS=normal
  JobState=PENDING Reason=Priority Dependency=(null)
  Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:00 TimeLimit=1-13:52:00 TimeMin=N/A
  SubmitTime=2021-01-07T19:40:00 EligibleTime=2021-01-07T19:40:00
  AccrueTime=2021-01-07T19:40:00
  StartTime=Unknown EndTime=Unknown Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-07T19:52:02
  Partition=gridlong AllocNode:Sid=ctrl:1835997
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=(null)
  NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=8,mem=40000M,node=1,billing=9
  Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
  MinCPUsNode=8 MinMemoryCPU=5000M MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/tmp/SLURM_job_script.rxZjWR
  WorkDir=/ceph/grid/session/C5FNDmdNpHynWeBumqyIfvmm6BJ3FoABFKDmXIwWDmABFKDmusGpDo
  StdErr=/ceph/grid/session/C5FNDmdNpHynWeBumqyIfvmm6BJ3FoABFKDmXIwWDmABFKDmusGpDo.comment
  StdIn=/dev/null
  StdOut=/ceph/grid/session/C5FNDmdNpHynWeBumqyIfvmm6BJ3FoABFKDmXIwWDmABFKDmusGpDo.comment
  Power=
  MailUser=(null) MailType=NONE
scontrol show job 387489

We can also check which users have permission to use reserved nodes:

$ scontrol show reservation
ReservationName=fri StartTime=2021-01-26T08:31:15 EndTime=2022-01-26T08:31:15 Duration=365-00:00:00
  Nodes=nsc-lou[001-002] NodeCnt=2 CoreCnt=256 Features=(null) PartitionName=(null) Flags=SPEC_NODES
  TRES=cpu=512
  Users=dsling01,dsling02mdsling03,root Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
  MaxStartDelay=(null)
scontrol show reservation

Command: sacct

With the sacct command, we can obtain more information about jobs in execution and those completed.

For example, we can check the status of all jobs from the last day:

$ sacct --starttime $(date -d '1 day ago' +%D-%R) --format JobID,JobName,Elapsed,State,ExitCode
JobID    JobName    Elapsed      State ExitCode
------------ ---------- ---------- ---------- --------
2911393            perf   00:00:16  COMPLETED      0:0
2911393.ext+     extern   00:00:16  COMPLETED      0:0
2911393.0          perf   00:00:16  COMPLETED      0:0
2911546            perf   00:00:16  COMPLETED      0:0
2911546.ext+     extern   00:00:17  COMPLETED      0:0
2911546.0          perf   00:00:16  COMPLETED      0:0
2911581            perf   00:00:34  COMPLETED      0:0
2911581.ext+     extern   00:00:35  COMPLETED      0:0
2911581.0          perf   00:00:34  COMPLETED      0:0
2911658            perf   00:00:14  COMPLETED      0:0
2911658.ext+     extern   00:00:14  COMPLETED      0:0
2911658.0          perf   00:00:14  COMPLETED      0:0
2912298      singulari+   00:00:00 CANCELLED+      0:0
2912302      singulari+   00:00:47 CANCELLED+      0:0
2912302.ext+     extern   00:00:47  COMPLETED      0:0
2912302.0    singulari+   00:00:47 CANCELLED+      0:2
2912321      singulari+   00:00:10     FAILED      1:0
2912321.ext+     extern   00:00:10  COMPLETED      0:0
2912321.0    singulari+   00:00:10     FAILED      1:0
2912337      singulari+   00:00:02     FAILED    127:0
2912337.ext+     extern   00:00:03  COMPLETED      0:0
sacct --starttime $(date -d '1 day ago' +%D-%R) --format JobID,JobName,Elapsed,State,ExitCode

We can also inquire about the details of a specific job:

$ sacct --job=2911581 --format JobID,JobName,Elapsed,State,ExitCode
JobID    JobName    Elapsed      State ExitCode
------------ ---------- ---------- ---------- --------
2911581            perf   00:00:34  COMPLETED      0:0
2911581.ext+     extern   00:00:35  COMPLETED      0:0
2911581.0          perf   00:00:34  COMPLETED      0:0
sacct --job=2911581 --format JobID,JobName,Elapsed,State,ExitCode  

Exercise

You can find exercises to improve your knowledge of commands for querying cluster information at the following link.