Skip to content

Running Jobs on the Cluster

In the following section, we will explore the commands srun, and sbatch, which are used for job submission, and the scancel command, which allows us to cancel job execution.

Command: srun

The simplest way to run a job is by using the srun command. This command is followed by various options that allow us to specify the quantity and type of hardware resources required for our job, as well as other settings. You can find a detailed explanation of all available options in the documentation. We will explore some of the most commonly used options.

Using a Reservation

For the workshop, several nodes have been temporarily reserved, ensuring that no one else can launch their jobs on them. To select the reservation, use the --reservation=fri option. If the reservation does not exist, do not include this option.

To start, we will execute a simple system program called hostname, which displays the name of the node where it runs. Here is an example of running the hostname program on one of the compute nodes:

$ srun --ntasks=1 hostname
nsc-msv002.ijs.si
srun --ntasks=1 hostname

In the command line, we used the --ntasks=1 option. This option indicates that our job consists of a single task, and we want to launch a single instance of the hostname program. Slurm automatically assigns one of the CPU cores in the cluster and executes the job on it. The --ntasks=1 option can also be expressed in a shorter form as -n 1.

In the next step, we can try running multiple tasks within our job:

$ srun --ntasks=4 hostname
nsc-msv002.ijs.si
nsc-msv002.ijs.si
nsc-msv002.ijs.si
nsc-msv002.ijs.si
srun --ntasks=4 hostname

We immediately notice a difference in the output. Now, four identical tasks have been executed within our job. They were executed on four different CPU cores located on the same compute node (nsc-msv002.ijs.si).

Of course, we can also distribute our tasks across multiple nodes. We can do this in the following way:

$ srun --nodes=2 --ntasks=4 hostname
nsc-fp005.ijs.si
nsc-fp005.ijs.si
nsc-msv002.ijs.si
nsc-msv002.ijs.si
srun --nodes=2 --ntasks=4 hostname

Now we have requested two nodes for our job and launched 4 tasks on them. From the output, we can see that Slurm evenly distributed our tasks across both nodes.

We can always interrupt our job during execution by using the key combination Ctrl + C.

Command: sbatch

The disadvantage of the srun command is that it blocks the command line until the job is completed. Additionally, it is inconvenient to launch more complex jobs with a multitude of settings using srun. In such cases, it is preferable to use the sbatch command, where we write the job and task-specific settings within a bash script file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/bash
#SBATCH --job-name=my_job_name
#SBATCH --partition=gridlong
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=100MB
#SBATCH --output=my_job.out
#SBATCH --time=00:01:00

srun hostname

job.sh

In the code snippet above, we have an example of a script. The script starts with the comment #!/bin/bash, which indicates that it is a bash script file. Following that, we have a series of job settings listed line by line, each prefixed with #SBATCH. We have already discussed some of the settings, such as specifying the reservation, number of tasks, and number of nodes in the srun command. Let's go over the remaining settings:

  • --job-name=my_job_name: specifies the name of the job, which will be displayed when we query the job status using the squeue command,
  • --partition=gridlong: selects the partition within which we want to run the job; if this setting is omitted, default partition will be used,
  • --mem-per-cpu=100MB: specifies the amount of system memory required by each task of the job, measured per CPU core,
  • --output=my_job.out: sets the name of the output file where the contents that would be printed to the standard output (screen) by the job are redirected,
  • --time=00:01:00: sets the time limit for the job in the format hours:minutes:seconds.

Hint

Setting a time limit is a good practice as it helps the job scheduler in managing the job queue more efficiently. When the scheduler has information about the expected duration of a job, it can allocate resources more effectively and prioritize jobs accordingly. By specifying a time limit, we enable the scheduler to find available resources for our job more easily and in a timelier manner.

Next, we execute our task, which is the same as in the previous examples (hostname).

Save the content in the code block above to a file, for example job.sh, and then launch the job by running the following command:

$ sbatch ./job.sh
Submitted batch job 387508
sbatch ./job.sh  

We can see that the command has provided us with the job ID and immediately returned control to the command line. Once the job is completed (which we can check using the squeue command), a file named moj_posel.out will be generated in the current directory. This file will contain the output produced during the job execution.

$ cat ./my_job.out
nsc-msv002.ijs.si
nsc-msv002.ijs.si
nsc-msv002.ijs.si
nsc-msv002.ijs.si
cat ./my_job.out

Specialty of the srun command

The srun command serves two different roles. In the above script, it is used to launch individual tasks within a job. In previous examples, we used it for the complete job configuration and execution.

Command: scancel

To stop jobs that were launched using the sbatch command, we can use the scancel command. We simply need to provide the appropriate job identifier (JOBID).

$ scancel 387508
scancel 387508

Exercise

You can find exercises to reinforce your knowledge of job submission commands on the cluster at the following link.