Displaying Cluster Information

Slurm offers a range of commands for interacting with the cluster. In this section, we will explore some examples of using the sinfo, squeue, scontrol, and sacct commands, which provide valuable insights into the cluster's configuration and status. For comprehensive information on all the commands supported by Slurm, please refer to the Slurm project website.

Command: `sinfo`

The command displays information about the cluster's state, partitions (subdivisions of the cluster), nodes, and available computing resources. There is a multitude of options available to specify the information we want to display about the cluster. For more precise control over the output, we can refer to the (documentation) that provides details on the various options and switches available with the sinfo command.

Display general information about the cluster configuration:

ExampleCommand

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up 2-00:00:00      7   drng wn[105,108,110,127,143,161-162]
all*         up 2-00:00:00      1  drain wn051
all*         up 2-00:00:00      2   resv wn[111-112]
all*         up 2-00:00:00     48    mix wn[052,064-065,101-104,113-124,128-132,134-142,144-148,150-153,155-160]
all*         up 2-00:00:00     16  alloc wn[012-016,053,061-062,106-107,109,125-126,133,149,154]
long         up 14-00:00:0      2   drng wn[161-162]
long         up 14-00:00:0      1  drain wn051
long         up 14-00:00:0     10    mix wn[052,151-153,155-160]
long         up 14-00:00:0      2  alloc wn[053,154]
gpu          up 4-00:00:00      1 drain* wn224
gpu          up 4-00:00:00      2  drain gwn03,wn221
gpu          up 4-00:00:00     24    mix gwn[01-02,04-07],wn[201,203-211,214-218,220,222-223]
gpu          up 4-00:00:00      4  alloc wn[202,212-213,219]

sinfo

In the above output, you can see the available logical partitions, their state, the time limit for jobs in each partition, and the lists of compute nodes associated with them. The output can be customized using appropriate options to display specific information based on your requirements.

Display detailed information about compute nodes:

ExampleCommand

$ sinfo --Node --long
Wed Oct 30 11:15:39 2024
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gwn01          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
gwn02          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
gwn03          1       gpu     drained 64     2:16:2 256000        0      1 amd,geno maint               
gwn04          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
gwn05          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
gwn06          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
gwn07          1       gpu       mixed 64     2:16:2 256000        0      1 amd,geno none                
wn012          1      all*   allocated 16      1:8:2 336000        0      1 amd,rome none                
wn013          1      all*   allocated 16      1:8:2 336000        0      1 amd,rome none                
wn014          1      all*   allocated 16      1:8:2 336000        0      1 amd,rome none                
wn015          1      all*   allocated 16      1:8:2 320000        0      1 amd,rome none                
wn016          1      all*       mixed 16      1:8:2 336000        0      1 amd,rome none                
...

sinfo --Node --long

The above output provides information about each compute node in the cluster, including its partition affiliation (PARTITION), current state (STATE), number of CPUs (CPUS), number of processor sockets (S), number of processor cores per socket (C), number of hardware threads (T), amount of system memory (MEMORY), and any assigned features (AVAIL_FEATURES) such as processor type, presence of GPUs, etc.

Cluster partitions may be reserved in advance for various reasons such as maintenance, workshops, or specific projects. An example of displaying active reservations in the Arnes cluster is as follows:

ExampleCommand

$ sinfo --reservation
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
fri          ACTIVE  2024-10-10T13:22:20  2025-02-21T08:00:00  133-18:37:40  wn[111-112]

sinfo --reservation

The above output shows any active reservations in the cluster, along with the reservation duration and the list of nodes included in each reservation. Each reservation is associated with a user group that has exclusive access to it, allowing them to bypass waiting for job completion from users without reservations.

Command: `squeue`

In addition to cluster configuration, we are naturally interested in the job queue status. The squeue command allows us to inquire about jobs that are currently in the queue, running, or have already successfully or unsuccessfully completed (documentation).

Output of the current job queue status:

ExampleCommand

$ squeue
JOBID    PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
52930230       all micmodel  zkolenc PD       0:00      1 (Resources)
52934494       all 13502482   dn9134 PD       0:00      1 (Priority)
52934493       all 38884188   dn9134 PD       0:00      1 (Priority)
52934492       all 29829442   dn9134 PD       0:00      1 (Priority)
52934491       all 18316019   dn9134 PD       0:00      1 (Priority)
52934490       all 68789192   dn9134 PD       0:00      1 (Priority)
52934489       all 50851613   dn9134 PD       0:00      1 (Priority)
52934488       all 19730810   dn9134 PD       0:00      1 (Priority)
52934487       all 34141707   dn9134 PD       0:00      1 (Priority)
52934486       all 31707664   dn9134 PD       0:00      1 (Priority)
52934485       all 62596266   dn9134 PD       0:00      1 (Priority)
52934484       all 92148791   dn9134 PD       0:00      1 (Priority)
52934483       all 50552286   dn9134 PD       0:00      1 (Priority)
...

squeue

From the output, we can retrieve the identifier of each individual job, the partition on which it is running, the job name, the user who launched it, and the current job status.

Some of the important job states are:

PD (PenDing) - the job is waiting in the queue,
R (Running) - the job is running,
CG (CompletinG) - the job is completing,
CD (CompleteD) - the job has completed,
F (Failed) - there was an error during execution,
S (Suspended) - the job execution is temporarily suspended,
CA (CAnceled) - the job has been canceled,
TO (TimeOut) - the job has been terminated due to a time limit.

The output also provides information about the total job runtime and the list of nodes on which the job is running, or the reason why the job has not started yet."

We are usually most interested in the status of jobs that we have launched ourselves. We can limit the output to jobs of a specific user using the --user option.

Example output of jobs owned by user prdatlas006:

ExampleCommand

$ squeue --user=prdatlas006
JOBID    PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
52930355       all data22_1 prdatlas PD       0:00      1 (Priority)
52930498       all data23_1 prdatlas PD       0:00      1 (Priority)
52930487       all data23_1 prdatlas PD       0:00      1 (Priority)
52930459       all data22_1 prdatlas PD       0:00      1 (Priority)
52930457       all data22_1 prdatlas PD       0:00      1 (Priority)
52930456       all data22_1 prdatlas PD       0:00      1 (Priority)
52930455       all data22_1 prdatlas PD       0:00      1 (Priority)
52930454       all data22_1 prdatlas PD       0:00      1 (Priority)
52930453       all data22_1 prdatlas PD       0:00      1 (Priority)
...

squeue --user=prdatlas006

In addition, we can also limit the output to jobs in a specific state. This can be done using the --states option.

Example output of all currently pending (PD) jobs:

ExampleCommand

$ squeue --states=PD
JOBID    PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
52930230       all micmodel  zkolenc PD       0:00      1 (Resources)
52933624       all 34318835   dn9134 PD       0:00      1 (Priority)
52933625       all 37524431   dn9134 PD       0:00      1 (Priority)
52933626       all 19299379   dn9134 PD       0:00      1 (Priority)
52933627       all 63522836   dn9134 PD       0:00      1 (Priority)
52933628       all 20181126   dn9134 PD       0:00      1 (Priority)
52933629       all 23600582   dn9134 PD       0:00      1 (Priority)
52933630       all 40800209   dn9134 PD       0:00      1 (Priority)
52933631       all 29293675   dn9134 PD       0:00      1 (Priority)
52933632       all 11208460   dn9134 PD       0:00      1 (Priority)
52933633       all 25627117   dn9134 PD       0:00      1 (Priority)
52933634       all 77397217   dn9134 PD       0:00      1 (Priority)
52933635       all 34206014   dn9134 PD       0:00      1 (Priority)
...

squeue --states=PD

Command: `scontrol`

Sometimes we require more detailed information about a specific partition, node, or job. This information can be obtained using the scontrol command (documentation). Below are some examples of how to use this command.

Example output of more detailed information about a specific partition:

ExampleCommand

$ scontrol show partition all
PartitionName=all
    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
    AllocNodes=ALL Default=YES QoS=N/A
    DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
    MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
    Nodes=wn[012-016,051-053,061-062,064-065,101-162]
    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
    OverTimeLimit=NONE PreemptMode=OFF
    State=UP TotalCPUs=8272 TotalNodes=74 SelectTypeParameters=NONE
    JobDefaults=(null)
    DefMemPerCPU=2000 MaxMemPerNode=UNLIMITED
    TRES=cpu=8272,mem=21630000M,node=74,billing=10561
    TRESBillingWeights=CPU=1.0,Mem=0.5G

scontrol show partition all

Example output of more detailed information about the compute node gwn01:

ExampleCommand

$ scontrol show node gwn01
NodeName=gwn01 Arch=x86_64 CoresPerSocket=16 
    CPUAlloc=32 CPUEfctv=64 CPUTot=64 CPULoad=31.38
    AvailableFeatures=amd,genoa,gpu,h100
    ActiveFeatures=amd,genoa,gpu,h100
    Gres=gpu:2
    NodeAddr=gwn01 NodeHostName=gwn01 Version=23.11.6
    OS=Linux 4.18.0-553.22.1.el8_10.x86_64 #1 SMP Tue Sep 24 05:16:59 EDT 2024 
    RealMemory=256000 AllocMem=32768 FreeMem=232983 Sockets=2 Boards=1
    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=gpu 
    BootTime=2024-10-22T09:15:24 SlurmdStartTime=2024-10-22T09:15:53
    LastBusyTime=2024-10-30T10:06:03
    CfgTRES=cpu=64,mem=250G,billing=125,gres/gpu=2
    AllocTRES=cpu=32,mem=32G,gres/gpu=2
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

scontrol show node gwn01

Example output of more detailed information about the job with ID 52930355:

PrimerUkaz

$ scontrol show job 52930355
JobId=52930355 JobName=data22_13p6TeV_
    UserId=prdatlas006(21006) GroupId=prdatlas(21000) MCS_label=N/A
    Priority=240 Nice=21 Account=prdatlas QOS=normal
    JobState=PENDING Reason=Priority Dependency=(null)
    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:00 TimeLimit=13:42:00 TimeMin=N/A
    SubmitTime=2024-10-30T09:58:15 EligibleTime=2024-10-30T09:58:15
    AccrueTime=2024-10-30T09:58:15
    StartTime=Unknown EndTime=Unknown Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-30T11:38:18 Scheduler=Backfill:*
    Partition=all AllocNode:Sid=hpc:260900
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=
    NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=8,mem=20496M,node=1,billing=10
    Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
    MinCPUsNode=8 MinMemoryCPU=2562M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/tmp/SLURM_job_script.GZhe0X
    WorkDir=/d/arc/session_ssd/9VhNDmIUKQ6nTkIgJmuwkNOnABFKDmABFKDmFhFKDmXsuLDm9qnwBn
    StdErr=/d/arc/session_ssd/9VhNDmIUKQ6nTkIgJmuwkNOnABFKDmABFKDmFhFKDmXsuLDm9qnwBn.comment
    StdIn=/dev/null
    StdOut=/d/arc/session_ssd/9VhNDmIUKQ6nTkIgJmuwkNOnABFKDmABFKDmFhFKDmXsuLDm9qnwBn.comment
    Power=

scontrol show job 52930355

We can also check which users have permission to use reserved nodes:

ExampleCommand

$ scontrol show reservation
ReservationName=fri StartTime=2024-10-10T13:22:20 EndTime=2025-02-21T08:00:00 Duration=133-18:37:40
    Nodes=wn[111-112] NodeCnt=2 CoreCnt=128 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
    TRES=cpu=256
    Users=bcesnik,ma2144,lb1684,ab93070,nb91605,ac7723,nc04512,dd25660,ad8146,ug68213,sg5783,jg5045,ag18201,rh16274,vh82826,ii7102,hj3103,gk31332,sk0506,lk81496,bk8638,tm51226,am93067,nm16356,hm8067,lm63759,pm13167,mm5129,jo48340,mp04643,kr15575,gr2399,pr3601,js71723,ns4749,as0386,as8534,ks20412,tt59968,ft25802,zu0476,kv44391,mv74130,av22431,pz77679,nb9613,bb3110,tb5882,sc46604,mc7753,nc91840,je31875,lg7891,eh3501,mj4175,lj15233,ak66653,lk7075,mk5098,nk46668,tk97810,bk5254,ml70458,jl5985,jm9220,dm50679,kp69350,lp6573,ar3642,nr88276,jr7486,sr6111,ds33951,ms36332,ts92284,ls7424,at7379,tt4657,nu24080,pz28690,dz9232,dsluga,ratkop,patriciob,urosl,bb36988,ml2541,tc0588,ms6035 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
    MaxStartDelay=(null)

scontrol show reservation

Command: `sacct`

With the sacct command, we can obtain more information about jobs in execution and those completed.

For example, we can check the status of all of our jobs for the last three days:

ExampleCommand

$ sacct --starttime $(date -d '3 day ago' +%D-%R) --format JobID,JobName,Elapsed,State,ExitCode
JobID           JobName    Elapsed      State ExitCode 
------------ ---------- ---------- ---------- -------- 
52825111     SNNResNet+ 2-02:00:44    RUNNING      0:0 
52825111.ba+      batch 2-02:00:44    RUNNING      0:0 
52825111.ex+     extern 2-02:00:44    RUNNING      0:0 
52825111.0    apptainer 2-02:00:43    RUNNING      0:0 
52825135     test_steps   00:05:05  COMPLETED      0:0 
52825135.ba+      batch   00:05:05  COMPLETED      0:0 
52825135.ex+     extern   00:05:05  COMPLETED      0:0 
52825135.0     hostname   00:00:00  COMPLETED      0:0 
52825135.1     hostname   00:00:00  COMPLETED      0:0 
52825135.2     hostname   00:00:00  COMPLETED      0:0

sacct --starttime $(date -d '3 day ago' +%D-%R) --format JobID,JobName,Elapsed,State,ExitCode

We can also inquire about the details of a specific job:

ExampleCommand

$ sacct --job=52825111 --format JobID,JobName,Elapsed,State,ExitCode
JobID           JobName    Elapsed      State ExitCode 
------------ ---------- ---------- ---------- -------- 
52825111     SNNResNet+ 2-01:59:22    RUNNING      0:0 
52825111.ba+      batch 2-01:59:22    RUNNING      0:0 
52825111.ex+     extern 2-01:59:22    RUNNING      0:0 
52825111.0    apptainer 2-01:59:21    RUNNING      0:0

sacct --job=52825111 --format JobID,JobName,Elapsed,State,ExitCode

Exercise

You can find exercises to improve your knowledge of commands for querying cluster information at the following link.

Displaying Cluster Information

Command: sinfo

Command: squeue

Command: scontrol

Command: sacct

Command: `sinfo`

Command: `squeue`

Command: `scontrol`

Command: `sacct`