Using HPC Scheduler on Ada Lovelace Cluster

After getting the basics of using the HPC and interactive session which allows you to use the resources of the HPC to achieve results quicker than using a standard computer but to get the full benefits of using the HPC is to run multiple jobs at the same time using the vast resources of the HPC.

This is dependent on the work needing to be done but for example if running a program on your computer takes 3 hours and can be run in 1 hour on the HPC you’re saving 2 hours. but to run 100 jobs will still take 100 hours. Using the HPC scheduler allows you to run several jobs concurrently to reduces that linear progression in time, so you could run 50 jobs concurrently in the same 1 hour.

To start using Slurm (HPC scheduler) you’ll need to write a batch script to tell the HPC how many recourses the job needs and what programs and files need to be run. Below is a brief instruction on what variables are most common and a link to an example script to get the basics.

Slurm

To practice submitting a job to the HPC you can view the example here that shows how to submit a basic python script, I recommend reading about the slurm variables below as you work your way through to get a better understanding about the structure of the script.

there are various Slurm commands the can be used to see information about jobs currently running and queued as well as variable that tell the HPC what resources are needed. Slurm batch files are made up of variables start with #SBATCH with the file type ending in .sh an example of what a complete batch file looks like can be seen here but below the individual variable will be broken down and explained briefly.

When square brackets are used it represents a variable that will need to be filled, when entering the variable you remove the brackets. The first example would become #SBATCH -J test

Every batch file needs to start with

#!/bin/bash

after that you can start selecting the variables needed for the job.

#SBATCH -J [Name of Job]

Example : #SBATCH -J test

This variable is used to give the submitted job a custom identifier in the queue for easier management, if not added name will default to the same as the batch file such as example.sh

#SBATCH -c [number of cpu's required]

Example : #SBATCH -c 1

this variable denotes the amount of CPU cores needed for the task. The amount needed is dependent on the task running and the amount available is dependent on the type of node requested.

#SBATCH -mem=[amount of memory required]G

Example : #SBATCH -mem=4G

Similar to the variable above this one denotes how much standard memory is required for the task being submitted. The amount needed is dependent on the task running and the amount available is dependent on the type of node requested.

#SBATCH -p [partition]

Example : #SBATCH -p workq

This variable tell the system which node to launch the job on as not all nodes have GPU’s and have different amounts of memory. A list of the of the nodes and their partitions identifiers can be found on the getting started on HPC page.

#SBATCH -G [number of GPU's required]

Example : #SBATCH -G 1

The majority of the GPU’s task should only be requesting 1 unless the processing time is considered excessively high and then the highest available on s single node is 2 on the H100 nodes.

#SBATCH -t=HHH:MM:SS

Example : #SBATCH -t=2:00:00

This variable nominates how long the process will run for; it can be removed from the script is time is unknown but that will make the job default to the queue’s default.

#SBATCH -o [output_file].out

Example : #SBATCH -o test.out

Example : #SBATCH -o /home/python/test.out

This variable tells the HPC where to put the output from the submitted files. By default, it saves the file to the same directory that the batch file was submitted from, but you can also specify a custom directory such as /home/<username>/<dir>/[output_file].out. If left blank it will default to putting the out file in the submit directory and be named “slurm-<job number>.out”.

#SBATCH -e [error_file].err

Example : #SBATCH -o test.err

Example : #SBATCH -o /home/python/test.err

This variable works identically as about but for the error file. If left blank it will copy the example of -o .

#SBATCH --mail-type=[Variables]

Example : #SBATCH --mail-type=END, FAIL

This variable lets you choose what events trigger emails to be sent. The most common options used are BEGIN, END, FAIL, TIME_LIMIT_[%%] and ALL.

#SBATCH --mail-user=[your email address]

Example : #SBATCH --mail-user=l.decosta@cqu.edu.au

This variable tells the script where to send the mail from the above to.

After these commands you can then start putting in Unix commands to get the HPC to load up the modules needed and execute the files. The Command that is usually done first is to change the directory to where the script and the files you’re working are. If the Slurm script is in another location, you’ll need to specify where the files are, so we recommend having the script in the directory with the rest of your work.

cd $SLURM_SUBMIT_DIR

This changes the directory to be the same as where the slurm batch file was submitted from.

module load [module]

Example : module load Anaconda3/2024.06-1

This loads the specified module/s that are necessary for your programs to run. While running an interactive session you will need to put this command into the terminal to load up the modules you need and then once you have your program running in the environment you can take note of what modules are needed and add them to the script to run non interactively.

Once your script is complete and saved it can be run by running is with the command in terminal

sbatch example.sh

you can also add variables from the terminal by leaving them out of the batch file and adding them before it, for example you can designate a name for the job but leaving it out of the script and submitting it with the name

sbatch -J test example.sh

this will let you quickly change the name of the job without having to edit the batch file.

An example of a complete Slurm script can be found below

#!/bin/bash
###### Select resources #####
#SBATCH -J Job1
#SBATCH -c 1
#SBATCH -mem=4g
#SBATCH -p workq
#
#### Output File #####
#SBATCH -o job1.out  
#
#### Error File #####
#SBATCH -e Job1.err
#
##### Mail Options #####
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT_50     # will send email when halfway through time allotment
#SBATCH --mail-user=l.decosta@cqu.edu.au
#
##### Change to current working directory #####
cd $SLURM_SUBMIT_DIR

##### Execute Program #####
module load Python/3.12.3-GCCcore-13.3.0
python ./myprogram.py

After the job is running there are command you can put into the terminal to view information about the status or to cancel them.

squeue

entered into the terminal will show you the current jobs running and in the queue for the HPC.

squeue -u [username]

will show you the jobs for that user. Most useful for checking on your current jobs.

scancel [job_ID]

cancels the job specified

scancel -u [user]

cancels all active jobs for that user. Only works on your own account so you can’t cancel other users jobs.

More information about Slurm commands can be found here and more specific information for the various programs can be found on the