Running a Job on HPC using Slurm
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher defined jobs. Slurm has been deployed at various national and international computing centers, and by approximately 60% of the TOP500 supercomputers in the world.
You can learn more about SLURM and its commands from the official Slurm website.
Requesting Job Resources
ATTENTION: We recommend that you submit sbatch Slurm jobs with the #SBATCH --export=NONE option to establish a clean environment, otherwise Slurm will propagate current environmental variables to the job. This could impact the behavior of the job, particularly for MPI jobs.
In order to use the HPC Slurm compute nodes, you must first login to a head node, hpc-login3 or hpc-login2, and submit a job.
- To request an interactive job, use the salloc command.
- To submit a job script, use the sbatch command.
- To check on the status of a job already in the Slurm queue, use the squeue and sinfo commands.
Creating a Job Script
One option for running a job on the HPC cluster is to set up a job script. This script will request cluster resources and list, in sequence, the commands that you want to execute. A job script is a plain text file that can be edited with a UNIX editor such as vi, nano, or emacs. Information on UNIX editors can be found on the ITS website.
To properly configure a job script, you will need to know the general script format, the commands you wish to use, how to request the resources required for the job to run, and, possibly, some of the Slurm environmental variables.
Common Slurm Commands
The following is a list of common Slurm commands that will be discussed in more detail on this page. To learn about all the options available for each command, enter man <command> while logged into the Slurm environment.
Command | Definition |
sbatch | Submits job scripts into system for execution (queued) |
scancel | Cancels a job |
scontrol | Used to display Slurm state, several options only available to root |
sinfo | Display state of partitions and nodes |
squeue | Display state of jobs |
salloc | Submit a job for execution, or initiate job in real time |
Specifying Resources
Slurm has its own syntax to request compute resources. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. For a complete listing of request syntax, run the command man sbatch.
Syntax | Meaning |
sbatch/salloc | Submit batch/interactive job |
--ntasks=<number> | Number of processes to run (default is 1) |
--mem=<number> | Total memory (single node) |
--mem-per-cpu=<number> | Memory per processor core |
--constraint=<attribute> | Node property to request (e.g. avx, IB) |
--partition=<partition_name> | Request specified partition/queue |
For more details on Slurm syntax, see “Determining What Resources to Request” below or the Slurm documentation at slurm.schedmd.com/sbatch.html.
Determining What Resources to Request
Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you sit down to write your script.
Keep in mind that while increasing the amount of compute resources you request may decrease the time it takes to run your job, it will also increase the amount of time your job spends waiting in the queue. You may request whatever resources you need but be mindful that other researchers need to be able to use those resources as well.
Below are some tips for determining the number of resources to ask for in your job script. These are options defined for the sbatch and salloc commands. There are additional options that you can find by checking the man pages for each command.
Tasks and CPUs Per Task
In Slurm terminology, a task is an instance of a running program.
If your program supports communication across computers or you plan on running independent tasks in parallel, request multiple tasks with the following command. The default value is set to 1.
--ntasks=<number>
For more advanced programs, you can request both multiple tasks and multiple CPUs per task.
Memory
All programs require a certain amount of memory to function properly. To see how much memory your program needs, you can check the documentation or run it in an interactive session and use the top command to profile it. To specify the memory for your job, use the mem-per-cpu option.
--mem-per-cpu=<number>
Where <number> is memory per processor core. The default is 1GB.
Walltime
If you do not define how long your job will run, it will default to 30 minutes. The maximum walltime for the main partition (queue) is 24 hours. If you are running a job in the main partition and it will not complete in 24 hours, please contact HPC at [email protected]. If a job does not complete within the runtime specified in the script, it will terminate. To specify the walltime for your job, use the time option.
--time=<hh:mm:ss>
Where <hh:mm:ss> represents hours, minutes, and seconds requested.
GPUs
Some programs can take advantage of the unique hardware architecture in a graphics processing unit (GPU). You’ll have to check your documentation for compatibility. To request a GPU, add one of the following lines to your script:
--gres=gpu:<number>
or
--gres=gpu:<GPU_type>:<number>
where:
- <number> is the number of GPUs per node requested.
- <GPU_type> is one of the following: k20, k40, k80, or p100.
Use the chart below to determine which of the above GPU_type‘s you wish to request:
GPU_type | Max Number of GPUs Per Node | GPU Model |
k20 | 2 | NVIDIA Tesla K20 |
k40 | 2 | NVIDIA Tesla K40 |
k80 | 4 | NVIDIA Tesla K80 (condo nodes) |
p100 | 2 | NVIDIA Tesla P100 |
For interactive sessions utilizing GPUs, after salloc has run and you are on a compute node, you will need to use the srun command to execute your commands:
[hpc3025]$ nvidia-smi No devices were found [hpc3025]$ srun -n1 nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.25 Driver Version: 390.25 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m On | 00000000:08:00.0 Off | 0 | | N/A 29C P8 25W / 225W | 0MiB / 4743MiB | 0% Default | +-------------------------------+----------------------+----------------------+
Network/Cluster
Some programs solve problems that can be broken up into pieces and distributed across multiple computers that communicate over a network. This strategy often delivers greater performance. HPC has compute nodes on two separate networks, Infiniband (56 Gbps) and Myrinet (10 Gbps). In order to see these performance increases, your application or code must be specifically designed to take advantage of these low latency networks.
To request a specific network, you can add the following line to your resource request:
--constraint=<network>
where <network> is either myri or IB.
Other
Besides the network a compute node lives on, there may be other features about it that you might need to specify for your program to run efficiently. Below is a table of some commonly requested node attributes that can be defined within the constraints of the sbatch and salloc commands.
Constraint | What It Does |
avx/avx2 | Advanced Vector eXtensions, optimized math operations |
Xeon | Request compute nodes with Intel Xeon processors |
Opteron | Request compute nodes with AMD Opteron processors |
Environment Variables
Any environment variables that you have set with the sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it’s best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.
NOTE: Environment variables will override any options set in a batch script and command line options will override any previously set environment variables.
Information on some of the common input and output environment variables available within the sbatch job script are listed in the chart below. For additional information, see the man page for sbatch.
Environmental Variable | Definition |
$SLURM_JOB_ID | ID of job allocation |
$SLURM_SUBMIT_DIR | Directory job where was submitted |
$SLURM_JOB_NODELIST | File containing allocated hostnames |
$SLURM_NTASKS | Total number of cores for job |
Interactive Jobs
Interactive jobs use the command salloc to allocate resources and put you in an interactive shell on compute node(s). Review the “Determining What Resources to Request” section above to determine which you resources you may need to include as options for these commands.
Interactive jobs can be a helpful debugging tool for creating job scripts for batch job submission, described in the next section. This allows you to experiment on compute nodes with command options, and environmental variables, providing immediate feedback, which can be helpful in determining your workflow.
salloc [options]
Recommendation: use of the option --ntasks enables Slurm to be efficient when allocating resources.
For testing, we recommend the following script as a starting point:
salloc --ntasks=8 --time=1:00:00 --mem-per-cpu=2GB
Examples of Interactive Jobs in Slurm
To request a job to run 8 tasks on an IB node:
salloc --ntasks=8 --constraint=IB
To request a job to run 8 tasks on a Myrinet node:
salloc --ntasks=8 --constraint=myri
Job Scripts
After determining what your workflow will be and the compute resources needed, you can create a job script and submit it. To submit a script for a batch run you can use the command sbatch as in:
sbatch <job_script>
Here is a sample job script. We’ll break this sample script down, line by line, so you can see how a script gets put together.
#!/bin/bash #SBATCH --ntasks=8 #SBATCH --time=01:00:00 cd /home/rcf-proj/tt1/test/ source /usr/usc/python/3.6.0/setup.sh python my.py
In general, a job script can be split into three different parts:
Line 1: Interpreter
#!/bin/bash
- Specifies the shell that will be interpreting the commands in your script.
- In the example above the bash shell, HPC’s default, is being used. To avoid confusion, this should match your login shell.
Line 2-3: Slurm Options
#SBATCH --ntasks=8 #SBATCH --time=01:00:00
- Request cluster resources.
- Lines that begin with #SBATCH will be ignored by the interpreter and read by the job scheduler
- #SBATCH --ntasks=<number>: specifies the number of tasks (processes) that will run in this job. In this example, 8 tasks will run.
- #SBATCH --time=<hh:mm:ss>: sets the maximum runtime for the job. In this example, the maximum runtime is 1 hour.
NOTE: Since 8 processor cores in total are being requested, the job will consume 8 core-hours. This is the unit of measurement that the job scheduler uses to keep track of compute time usage.
Lines 4-6: Job commands
cd /home/rcf-proj/tt1/test/ source /usr/usc/python/3.6.0/setup.sh python my.py
- These lines provide the sequence of commands needed to run your job.
- These commands will be executed on the allocated resources.
- cd /home/rcf-proj/tt1/test/: Changes the working directory to /home/rcf-proj/tt1/test/
- source /usr/usc/python/3.6.0/setup.sh: Prepares the environment to run Python 3.6.0.
- python my.py: Runs the program on the resources allocated. In this example it runs python, specifying my.py in the current directory, /home/rcf-proj/tt1/test, as the argument.
Example of a Simple MPI Script in Slurm: Hello World MPI
This is an example of a simple MPI program that runs on multiple processors. It demonstrates the use of Slurm’s interactive mode and HPC’s openmpi setup. The program is available at /home/rcf-proj/workshop/introSLURM/helloMPI/helloWorldMPI.c
HelloWorldMPI.c
#include "mpi.h" #include #include int main (int argc, char *argv[]) { int i, rank, size, namelen; char name [MPI_MAX_PROCESSOR_NAME]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Get_processor_name (name, &namelen); printf ("Hello World from rank %d running on %s!\n", rank, name); if (rank == 0 ) printf ("MPI World size = %d processes\n", size); MPI_Finalize (); }
Either use your preferred editor to create the file for testing, or copy it from /home/rcf-proj/workshop/introSLURM/helloMPI/helloWorldMPI.c on HPC. You will need to source the the openmpi software based on your shell, compile and test the code. Here is an example using the copy command in a bash shell and testing in your home directory.
[ttrojan@hpc-login3 ~]$ cp /home/rcf-proj/workshop/introSLURM/helloMPI/helloWorldMPI.c ~
[ttrojan@hpc-login3 ~]$ source /usr/usc/openmpi/1.8.8/setup.sh
[ttrojan@hpc-login3 ~]$ mpicc -o helloWorldMPI helloWorldMPI.c
[ttrojan@hpc-login3 ~]$ ls -l helloWorldMPI
-rwxr-xr-x 1 ttrojan nobody 8800 Feb 21 14:32 helloWorldMPI
[ttrojan@hpc-login3 ~]$ salloc --ntasks=30
----------------------------------------
Begin SLURM Prolog Wed 21 Feb 2018 02:34:35 PM PST
Job ID: 767
Username: ttrojan
Accountname: lc_usc1
Name: bash
Partition: quick
Nodes: hpc[3264,4005]
TasksPerNode: 15(x2)
CPUSPerTask: Default[1]
TMPDIR: /tmp/767.quick
Cluster: uschpc
HSDA Account: false
End SLURM Prolog
----------------------------------------
[ttrojan@hpc3264 ~]$ source /usr/usc/openmpi/1.8.8/setup.sh
[ttrojan@hpc3264 ~]$ srun --ntasks=30 --mpi=pmi2 ./helloWorldMPI
Hello World from rank 10 running on hpc3264!
Hello World from rank 19 running on hpc4005!
Hello World from rank 11 running on hpc3264!
Hello World from rank 3 running on hpc3264!
Hello World from rank 17 running on hpc4005!
Hello World from rank 4 running on hpc3264!
Hello World from rank 7 running on hpc3264!
Hello World from rank 2 running on hpc3264!
Hello World from rank 12 running on hpc3264!
Hello World from rank 21 running on hpc4005!
Hello World from rank 26 running on hpc4005!
Hello World from rank 9 running on hpc3264!
Hello World from rank 13 running on hpc3264!
Hello World from rank 22 running on hpc4005!
Hello World from rank 6 running on hpc3264!
Hello World from rank 5 running on hpc3264!
Hello World from rank 20 running on hpc4005!
Hello World from rank 15 running on hpc4005!
Hello World from rank 18 running on hpc4005!
Hello World from rank 14 running on hpc3264!
Hello World from rank 23 running on hpc4005!
Hello World from rank 28 running on hpc4005!
Hello World from rank 8 running on hpc3264!
Hello World from rank 27 running on hpc4005!
Hello World from rank 16 running on hpc4005!
Hello World from rank 25 running on hpc4005!
Hello World from rank 1 running on hpc3264!
Hello World from rank 29 running on hpc4005!
Hello World from rank 24 running on hpc4005!
Hello World from rank 0 running on hpc3264!
MPI World size = 30 processes
[ttrojan@hpc3264 ~]$ logout
salloc: Relinquishing job allocation 767
[ttrojan@hpc-login3 ~]$
The srun command used in the example above runs the helloWorldMPI program on 30 tasks. Slurm provides information about the job. Most of the information is self explanatory. Only 1 cpu was used per task, and the job ran across 2 nodes. Note that for multi node jobs the number of tasks per node lines up with the nodes utilized by the job. In this example 22 tasks were run on hpc0981, while 8 were run on hpc1407.
Monitoring Your Jobs
To monitor the status of your jobs in the Slurm partitions, use the squeue command. You will only have access to see your queued jobs. Options to this command will help filter and format the output to meet your needs. See the man page for more information.
Squeue Option | Action |
--user=<username> | Lists entries only belonging to username, only available to administrator |
--jobs=<job_id> | List entry, if any, for job_id |
--partition=<partition_name> | Lists entries only belonging to partition_name |
Here is an example of using squeue.
[ttrojan@hpc-login3 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
537 quick helloWor ttrojan R 0:47 2 hpc[0981,1407]
The output of squeue provides the following information:
Squeue Output Column Header | Definition |
JOBID | Unique number assigned to each job |
PARTITION | Partition id the job is scheduled to run, or is running, on |
NAME | Name of the job, typically the job script name |
USER | User id of the job |
ST | Current state of the job (see table below for meaning) |
TIME | Amount of time job has been running |
NODES | Number of nodes job is scheduled to run across |
NODELIST(REASON) | If running, the list of the nodes the job is running on. If pending, the reason the job is waiting |
Valid Job States
Code | State | Meaning |
CA | Canceled | Job was canceled |
CD | Completed | Job completed |
CF | Configuring | Job resources being configured |
CG | Completing | Job is completing |
F | Failed | Job terminated with non-zero exit code |
NF | Node Fail | Job terminated due to failure of node(s) |
PD | Pending | Job is waiting for compute node(s) |
R | Running | Job is running on compute node(s) |
TO | Timeout | Job terminated upon reaching its time limit |
Job in Queue
Sometimes a long queue time is an indication that something is wrong or the cluster could simply be busy. You can check to see how much longer your job will be in the queue with the command:
squeue --start --job <job_id>
Please note that this is only an estimate based on current and historical utilization and results can fluctuate. Here is an example of using squeue with the start and job options.
ttrojan@hpc-login3 ~]$ squeue --start --job 384
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
384 main star-lac ttrojan PD 2018-02-12T16:09:31 2 (null) (Resources)
In the above example, the job is in a pending to run state, because there are no resources available that will allow it to launch. The job is expected to start at approximately 16:09:31 on 02-12-2018. This is an estimation, as jobs ahead of it may complete sooner, freeing up necessary resources for this job. If you believe there is a problem with your job starting, and have checked your scripts for typos, send email to [email protected]. Let us know your job ID along with a description of your problem and we can check to see if anything is wrong.
Job is Running
Another mechanism for obtaining job information is with the command scontrol show job <job_id>. This provides more detail on the resources requested and reserved for your job. It will be able to tell the status of your job, but not the status of the programs running within the job. Here is an example using scontrol.
ttrojan@hpc-login3~]$ scontrol show job 384
JobId=390 JobName=star-ali
UserId=ttrojan(12345) GroupId=uscno1(01) MCS_label=N/A
Priority=1 Nice=0 Account=lc_ucs1 QOS=lc_usc1_maxcpumins
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2018-02-12T15:39:57 EligibleTime=2018-02-12T15:39:57
StartTime=2018-02-12T16:09:31 EndTime=2018-02-12T16:39:31 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=quick AllocNode:Sid=hpc-login3:21524
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=hpc[0981,1407]
NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=2048,node=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
Features=[myri|IB] DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/auto/rcf-00/ttrojan
Power=
ttrojan@hpc-login3~]$
When your job is done, check the log files to make sure everything has completed without incident.
Job Organization
Slurm has some handy features to help you keep organized, when you add them to the job script, or the salloc command.
Syntax | Meaning |
--mail-user=<email> | Where to send email alerts |
--mail-type=<BEGIN|END|FAIL|REQUEUE|ALL> | When to send email alerts |
--output=<out_file> | Name of output file |
--error=<error_file> | Name of error file |
--account=<account_id> | Which account to charge cpu time |
--job-name=<job_name> | Job name (will display in squeue output) |
Getting Job Usage Statistics
It can be helpful to fine tune your job or requests knowing the resources that were used. The
sacct --jobs=<job_id>
command can provide some usage statistics for jobs that are running, and those that have completed.
Output can be filtered and formatted to provide specific information, including requested memory and peak memory used during job execution. See the man pages for more information.
[ttrojan@hpc-login3 ~]$ sacct --jobs=383 --format=User,JobID,account,Timelimit,elapsed,ReqMem,MaxRss,ExitCode
User JobID Account Timelimit Elapsed ReqMem MaxRSS ExitCode
--------- ------------- ------------ ------------- ------------ ------------ ----------- --------
ttrojan 383 lc_usc1 02:00:00 01:28:59 1Gc 0:0
383.extern lc_usc1 01:28:59 1Gc 0:0
[ttrojan@hpc-login3 ~]$
Canceling a Job
Whether your job is running or waiting in the queue, you can cancel the job using the scancel <job_id> command. Use the <strongsqueue command if you do not recall the job id.
[ttrojan@hpc-login3 ~]$ scancel 384
[ttrojan@hpc-login3 ~]$
Monitoring the Partitions in the Clusters
To see an overall status of the partitions and nodes in the clusters run the sinfo command. As with the other monitoring commands, there are additional options and formatting available.
ATTENTION: jobs are automatically routed to the general queue partitions, quick, main and large. Attempting to submit jobs directly to these partitions may cause your submission to fail.
[ttrojan@hpc-login3 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
route_queue* up 1:00:00 4 idle hpc[0981,1407,3264,4005]
quick up 1:00:00 4 idle hpc[0981,1407,3264,4005]
main up 1-00:00:00 4 idle hpc[0981,1407,3264,4005]
large up 1-00:00:00 4 idle hpc[0981,1407,3264,4005]
long up 14-00:00:0 4 idle hpc[0981,1407,3264,4005]
haas up 14-00:00:0 2 idle hpc[1053-1054]
isi up 14-00:00:0 18 idle hpc[1540-1557]
scec up 14-00:00:0 37 idle hpc[4192-4226,4237-4238]
scec_hiprio up 14-00:00:0 37 idle hpc[4192-4226,4237-4238]
[ttrojan@hpc-login3 ~]$
Monitoring the Nodes in the Clusters
To get detailed information on a particular compute node, use the scontrol show node=<nodename> command.
[ttrojan@hpc-login3 ~]$ scontrol show node="hpc4005"
NodeName=hpc4005 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=1.01
AvailableFeatures=IB,avx,avx2,xeon,E5-2640v3,nx360
ActiveFeatures=IB,avx,avx2,xeon,E5-2640v3,nx360
Gres=(null)
NodeAddr=hpc4005 NodeHostName=hpc4005 Version=17.02
OS=Linux RealMemory=63000 AllocMem=16384 FreeMem=45957 Sockets=2 Boards=1
MemSpecLimit=650
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=16 Owner=N/A MCS_label=N/A
Partitions=route_queue,quick,main,large,long,testSharedQ,restrictedQ,preemptMeQ,preemptYouQ
BootTime=2018-02-08T04:08:36 SlurmdStartTime=2018-02-09T12:55:53
CfgTRES=cpu=16,mem=63000M
AllocTRES=cpu=16,mem=63000M
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
[ttrojan@hpc-login3 ~]$
Getting Help
If you need with using Slurm, please email us at [email protected]