Using OpenAI Whisper on the HPC
When wanting to transcribe audio to text, there are lots of options available. This page will walk you through using CQUniversity’s, HPC Cluster to generate a transcription using OpenAI’s Whisper LLM. This may not be the best option for the data you need to transcribe, as it involves some technical knowledge and set up. However, when it comes to ethical and data security concerns, there is no safer way of generating a transcription. Since it is in a closed environment behind the CQUniversity’s firewall, the data never leaves the CQUniversity network.
Contents
Getting Started
To get started you’ll need to have;
- A HPC5 account – contact TaSAC if you need a HPC account.
- Audio files in a format the is compatible with Whisper, the main ones being, WAV, FLAC, MP3, OGG.
Supported audio files and conversion options
If you need to convert your audio files to one of the compatible formats, you can use VLC media player (https://www.videolan.org/vlc/) to convert your files. This can be achieved by opening the VLC software, select the Media menu option on the top bar, select Convert/Save from the list. Open the file you want to convert and select one of the above formats.
Using the HPC Facility to transcibe audio files
Connecting to the HPC System
To get started, you will need to connect to the HPC using either SSH or the Open on Demand GUI and start a terminal session.
Suggested data setup
In this guide, we will create the directory(folder) “/LLM
” within in your home directory. You can use whatever you like, but you will then need to change into this directory and then create a subdirectory called “/audio
“.
mkdir LLM
cd LLM
mkdir audio
After creating the /audio directory you’ll need to map the HPC drive, so you can upload your audio files to this directory on the HPC to be processed.
Recommended HPC modules that need to be loaded?
You will need to load some HPC modules within a shell/terminal session so that you can use the whisper software. The two recommended HPC modules that are needed to be loaded to use Whisper are:
- Python/3.12.3-GCCcore-13.3.0-whisper
- FFmpeg/7.0.2-GCCcore-13.3.0
Both of these can be loaded with the ‘module load
‘ Command
module load Python/3.12.3-GCCcore-13.3.0-whisper
module load FFmpeg/7.0.2-GCCcore-13.3.0
Writing HPC scripts to process audio files to be transribed
You have two options for writing this script to submit audio files to be transribed.
- You can write the scripts directly on the HPC facility using a text editor, sush as pluma (GUI text editor – desktop menu -> Accessories -> pluma), nano (command line text editor) or vi (advance users recommended).
- Write the scripts with a text editor on your local computer, such as Notepad++ (windows – available on Company Portal for CQU PC’s or downloadable here), or another “pure” text editor. Microsfot word is not recommended. If you use your local computer to develop the HPC scriptsm you will need need to transfer these files to the HPC facility to be able to use them.
Ensure you have changed into this “/LLM
” location before creating the following files.
Python Code example
Create the following file whisper-test.py
Note, you can change model_id to the recommended model at the time of writing this guide. You should be able to use other models when newer versions are released:
- model_id=”openai/whisper-large-v3″
Ensure you change the <username> in this python code example to the base_dir path to your HPC username.
import os
import torch
from pathlib import Path
from transformers import pipeline
# ----------------------------
# Set device (GPU or CPU)
# ----------------------------
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"Device set to: {device.upper()}")
# ----------------------------
# Load whisper model pipeline
# ----------------------------
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3", #Model ID can be change if/when new models released
device=0 if torch.cuda.is_available() else -1,
)
# ----------------------------
# Set paths
# ----------------------------
base_dir = Path("/home/<username>/LLM") # change <username> to your HPC5 username
audio_dir = base_dir / "audio"
transcript_dir = base_dir / "transcript"
# Create job-specific output folder
job_name = os.getenv("SLURM_JOB_NAME", "local")
job_id = os.getenv("SLURM_JOB_ID", "0000")
job_output_dir = transcript_dir / f"{job_name}_{job_id}"
job_output_dir.mkdir(parents=True, exist_ok=True)
print(f"Saving transcripts to: {job_output_dir}")
# ----------------------------
# Find all .audio files recursively
# ----------------------------
audio_files = list(audio_dir.rglob("*")) #This will try and transcribe every file but can be changed to a certain file type or file
print(f"Found {len(audio_files)} audio files to process.\n")
# ----------------------------
# Process each file
# ----------------------------
for file_path in audio_files:
try:
print(f"Transcribing: {file_path}")
# Convert Path object to string
result = pipe(str(file_path), return_timestamps=True)
# Build relative output path
relative_path = file_path.relative_to(audio_dir)
output_file = job_output_dir / relative_path.with_suffix(".txt")
output_file.parent.mkdir(parents=True, exist_ok=True)
# Save transcript
with open(output_file, "w", encoding="utf-8") as f:
f.write(result["text"])
print(f"Saved to: {output_file}\n")
except Exception as e:
print(f"❌ Failed to transcribe {file_path}: {e}\n")
HPC Slurm Submission script
We will create a generic HPC submission script whisper-test.slurm
, that we can use on the CQUniversity HPC facility.
This script will submit the python file whisper-test.py
that is located in the /home/<username>/LLM/
directory to the HPC scheduler.
The submission script has asked for 100 CPU cores and 80gb’s of Memory.
#!/bin/bash
#SBATCH --job-name=whisper-test-cpu
#SBATCH --partition=workq
#SBATCH --cpus-per-task=100 # Adjust based on your workload
#SBATCH --mem=80G # Adjust memory as needed
#SBATCH --output=logs/%x-%j.out # Save logs to a folder named 'logs'
#SBATCH --mail-user=#put email here
#SBATCH --mail-type=END
####commands####
module load Python/3.12.3-GCCcore-13.3.0-whisper
module load FFmpeg/7.0.2-GCCcore-13.3.0
python whisper-test.py
Submitting a job to the HPC faility to process your audio files
Once both the python script “whisper-test.py
” and submission script “whisper-test.slurm
” has been created, you can submit the job using the following command.
Both scripts need to be in the LLM directory for it to work but if you want to have the slurm script elsewhere you just need to define the path to the python script.
sbatch whisper-test.slurm
Once the program has computed, the output should be viewable in the following files:
/home/<username>/LLM/logs/whisper-test.out
This above process starts one job that will go through and transcribe all designated audio files. If you have a large number of audio files or very large ones, you may want to submit multiple jobs that could potentially process a single audio file per a job. This allows multiple audio files to be transcribed at the same time, thus ensuring you get results faster. If you think you need the extra throughput, please contact us at eresearch@cqu.edu.au so we can assist in modifying the code to run parrel jobs.