HPC JOB SCHEDULER
By default, when running anything on the CQ University HPC systems, unless you are preforming simple tasks or doing a “quick test”, all programs must be executed on the “compute nodes.
The CQ University’s HPC Facilities are “Large Shared” resources. Unlike personal computers, these system are used by multiple users at the same time. Given the usage of the HPC can vary at times, there is a need for a “HPC Scheduler” to be ultilised. This scheduler will check if the requested resources are available. If they are available, they will execute the job on one of the available compute resources, if no resources are available, the request is “queued”, until resources become available.
If users execute large jobs on any of the “Login” nodes, this will slow down usability and will impact other users performance.
The CQ University’s HPC Facilities uses “PBS Pro” as the scheduler for resource management. Information on PBS commands can be found on the “PBS Commands” user guide.
In an effort to make using the scheduler easier, as number of PBS sample scripts have been created (See here for sample information). Additionally, some simple scripts have been created to highlight current HPC usage and to assist with deleting HPC jobs.
Command | Usage | Example Output |
---|---|---|
qusers | This will provide an overall summary of HPC usage |
Thu Sep 12 12:32:30 EST 2013 |
myjobs | This command will provide information on your current HPC jobs, as well as providing a comparison of HPC Scheduler Requested Resources vs Actual Compute resources used for all “R”unning jobs. |
bellj@newton:~> myjobs Jobs running for bellj -------------------------------- pbsserver: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 407256.pbsserve bellj workq Test-run1 65575 4 32 -- 01:00 R 00:55 n005[0]/0*8+n005[0]/1*8+n008[0]/0*8+n008[0]/1*8 407257.pbsserve bellj workq Test-run2 44175 4 32 -- 01:00 R 00:55 n009[0]/0*8+n009[0]/1*8+n022[0]/1*8+n023[0]/0*8 407260.pbsserve bellj workq Test-run5 86742 4 32 -- 01:00 R 00:55 n027[0]/1*8+n028[0]/0*8+n028[0]/1*8+gn002[1]/0*8 407828.pbsserve bellj workq STDIN -- 1 4 10gb -- Q -- -- ======================================================================== Job D Job #CPU's CPU's (%) Memory (gb) Memory (gb) Name requested Utilisation requested in use ======================================================================== 407256.pbsserver Test-run1 32 99 0 1 407257.pbsserver Test-run2 32 98 0 1 407260.pbsserver Test-run5 32 98 0 0 |
deletemyjobs | This command will delete all your submitted jobs (both “R”unning and “Q”ueued) |