Running computations on the Slurm cluster

What is the Slurm cluster?

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. More about Slurm on the official site.

Migrating from Torque/PBS to SLURM

Task

Torque/PBS

SLURM

Submit a job

qsub myjob.sh

sbatch myjob.sh

Delete a job

qdel 123

scancel 123

Show job status

qstat

squeue

Show expected job start time

- (showstart in Maui/Moab)

squeue –start

Show queue info

qstat -q

sinfo

Show job details

qstat -f 123

scontrol show job 123

Show queue details

qstat -Q -f <queue>

scontrol show partition <partition_name>

Show node details

pbsnode n0000

scontrol show node n0000

Show QoS details

- (mdiag -q <QoS> in Maui/Moab)

sacctmgr show qos <QoS>

Resource sharing and job prioritisation

For optimising the utilisation of the computing resources, certain resource-sharing and job prioritisation policies are applied to jobs submitted to the Slurm cluster. The implications to users can be seen from the three aspects: cluster limits*, **job limits and job priority.

Cluster limits

There are cluster-wide limits on resource usage and job submission per user. Those are:

number of running jobs

200

number of queued jobs

2000

total memory

2560 GB

total CPU cores

200

total GPUs

2

Beyond these limits, the user is allowed to run few high-priorty interactive jobs (i.e. jobs submitted to the interactive partition) with the following limits:

partition

runnable jobs

queued jobs

total memory

total CPU cores

interactive

2

4

128 GB

32

Job limits

Each job is limited by a maximum amount of walltime and memory. Job with resource requirement beyond the limit will be rejected for submission.

partition

max. walltime

max. memory

batch

72 hours

256 GB

interactive

72 hours

64 GB

gpu

72 hours

256 GB

Job priority

partition

priority

batch

normal

interactive

high

gpu

normal

Job priority determines the order of waiting jobs to start in the cluster. Job priority is calculated based on various factors. In the cluster at DCCN, mainly the following two factors are considered.

  1. The waiting time a job has spent in the queue: this factor will add one additional priority point to jobs waiting for one additional minute in the queue.

  2. Partition priority: this factor is mainly used for boosting interactive jobs (i.e. jobs submitted to the interactive partition) with an outstanding priority offset so that they will be started sooner than other types of jobs.

The final job priority combining the two factors is used by the scheduler to order the waiting jobs accordingly. The first job on the ordered list is the next to start in the cluster.

Note: Job priority calculation is dynamic and not complete transparent to users. One should keep in mind that the cluster does not treat the jobs as “first-come first-serve”.

The slurm module

Wrapper scripts, such as vncmanager, matlab, rstudio, pycharm, etc. are available via environment module slurm.

$ module load slurm

Interactive job

The simplest way to submit an interactive job is using the sbash wrapper script as it takes care of the settings and options required for running graphical applications.

Hereafter is an example command to start an interactive job with requirement of 10 hour walltime and 4 GB memory:

$ sbash --time=10:00:00 --mem=4gb

The terminal will be blocked until the job starts on the compute node.

Similarly you could also use the native Slurm command srun, for example:

$ srun --time=10:00:00 --mem=4gb -p interactive --pty bash -i

If you intend to run graphical applications, the interactive job should be submitted with an additional --x11 option. For example,

$ srun --x11 --time=10:00:00 --mem=4gb -p interactive --pty bash -i

If you additionally require a GPU, the interactive job should be submitted with an --partition=gpu --gres=gpu:1 option, but without -p interactive. For example,

$ srun --x11 --partition=gpu --gres=gpu:1 --time=01:00:00 --mem=4gb --pty bash -i

Batch job

  1. prepare a batch job script like one below and save it to a file, e.g. slurm_first_job.sh:

    #!/bin/bash
    #SBATCH --job-name=myfirstjob
    #SBATCH --nodes=1
    #SBATCH --time=0-00:05:00
    #SBATCH --mail-type=FAIL
    #SBATCH --partition=batch
    #SBATCH --mem=5GB
    
    hostname
    
    echo "Hello from job: ${SLURM_JOB_NAME} (id: ${SLURM_JOB_ID})"
    
    sleep 600
    

    The script is essentially a bash script with few comment lines right after the script’s shebang (i.e. the first line). Those comment lines are started with #SBATCH followed by options the same as the options supported by Slurm’s job submission program sbatch.

  2. submit the job script to slurm

    $ sbatch slurm_first_job.sh
    Submitted batch job 951
    

A job id is returned after job submission. In the example above, the job id is 951.

In the example above, sbatch options were defined in the job script. You can, however, also pass them directly (overruling the options in the job script), e.g. like this:

$ sbatch --mem=1G --time=00:01:00 slurm_first_job.sh

You can even pass your script directly, using a so-called “Here” document (Heredoc, defined by a start << EOF and end EOF)

$ sbatch --mem=1G --time=00:01:00 << EOF
#!/bin/bash
echo "Hello world! No script had to be written to disk to run me :-)"
EOF

Job status and information

One can use the squeue --me to get an overview of running and pending jobs.

$ squeue --me
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    951   batch     myfirstj   honlee  R       0:05      1 dccn-c079

To get job’s detail information, one use the command scontrol:

$ scontrol show job 951
JobId=951 JobName=myfirstjob
UserId=honlee(10343) GroupId=tg(601) MCS_label=N/A
Priority=829 Nice=0 Account=tg QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:03:16 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2023-08-24T16:19:17 EligibleTime=2023-08-24T16:19:17
AccrueTime=2023-08-24T16:19:17
...

Note

squeue and scontrol can only be used to display status/information of running and pending jobs. Use the command sacct to get information about historical job.

Once the job is completed, one should use the sacct command to get the information:

$ sacct -j 951
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
951          myfirstjob      batch         tg          1    TIMEOUT      0:0
951.batch         batch                    tg          1  CANCELLED     0:15
951.extern       extern                    tg          1  COMPLETED      0:0

sacct has an option --json to dump the output in JSON format. It can be used together with jq for further processing on the job information. For example, to get on which nodes resources were allocated for the job:

$ sacct --json -j 951 | jq -r '.jobs[] | .nodes'
dccn-c079

Job deletion

To delete a running or pending job, one use the scancel command:

$ scancel 951

Output streams of the job

On the compute node, the job itself is executed as a process in the system. The default STDOUT and STDERR streams of the process are both redirected to a file named as slurm-<job_id>.out within the directory from which a job is submitted. The file is available from the start of the job.

Specifying resource requirement

Each job submitted to the cluster comes with a resource requirement. The job scheduler and resource manager of the cluster make sure that the needed resources are allocated for the job. To allow the job to complete successfully, it is important that a right and sufficient amount of resources are specified at the job submission time.

1 CPU core, 4 gigabytes memory and 12 hours wallclock time

$ sbatch -N 1 -c 1 --ntasks-per-node=1 --mem=4G --time=12:00:00 job.sh

4 CPU cores on a single node, 12 hours wallclock time, and 4 gb memory

$ sbatch -N 1 -c 4 --ntasks-per-node=1 --mem=4G --time=12:00:00 job.sh

1 CPU core, 500gb of free local “scratch” diskspace, 12 hours wallclock time, and 4 gb memory

$ sbatch -N 1 -c 1 --ntasks-per-node=1 --mem=4G --time=12:00:00 --tmp=500G job.sh

1 Intel CPU core, 4 gigabytes memory and 12 hours wallclock time

$ sbatch -N 1 -c 1 --ntasks-per-node=1 --mem=4G --time=12:00:00 --tmp=500G --gres=cpu:intel job.sh

Here we ask the allocated CPU core to be on a node with GRES cpu:intel.

4 CPU cores distributed on 2 nodes, 12 hours wallclock time, and 4 gb memory per node.

$ sbatch -N 2 -n 4 --mem=4G --time=12:00:00 job.sh

Here we use -n to specify the amount of CPU cores we need; and -N to specify from how many compute nodes the CPU cores should be allocated. In this scenario, the job (or the application the job runs) should take care of the communication between the processors distributed on many nodes. This is typically for the MPI-like applications.

1 GPU interactive with 12 hours wallclock time, and 4 gb memory.

$ srun --partition=gpu --gres=gpu:1 --mem=4G --time=12:00:00 --pty /bin/bash

1 GPU interactive with specific GPU specification, 12 hours wallclock time, and 4gb memory.

$ srun --partition=gpu --gpus=nvidia_rtx_a6000:1 --mem=4G --time=12:00:00 --pty /bin/bash

2 GPU’s interactive with specific GPU specification, 12 hours wallclock time, and 4gb memory.

$ srun --partition=gpu --gpus=nvidia_a100-sxm4-40gb:2 --mem=4G --time=12:00:00 --pty /bin/bash

Currently we have three types of GPU’s available the slurm environment:

  • One node with 1x NVidia RTX A6000 48GB (specify as nvidia_rtx_a6000:1)

  • Three nodes with 1x NVidia A100 80GB each (specify as nvidia_a100_80gb_pcie:1)

  • Two nodes with 4x NVidia A100 40GB each (specify as nvidia_a100-sxm4-40gb:1)

This sums up to 12 GPU’s in total.

The --partition=gpu option is needed. Without this option the job will fail.

Estimating resource requirement

As we have mentioned, every job has attributes specifying the required resources for its computation. Based on those attributes, the job scheduler allocates resources for jobs. The more precise these requirement attributes are given, the more efficient the resources are used. Therefore, we encourage all users to estimate the resource requirements before submitting massive jobs to the cluster.

The walltime and memory requirements are the most essential ones amongst others. Hereafter are three different ways to make estimations of those two requirements.

Note

Computing resources in the cluster are reserved for jobs in terms of size (e.g. amount of requested memory and CPU cores) and duration (e.g. the requested walltime). Under-estimating the requirement causes job to be killed before completion and thus the resources have been consumed by the job were wasted; while over-estimating blocks resources from being used efficiently.

  1. Consult your colleages

    If your analysis tool (or script) is commonly used in your research field, consulting with your colleagues might be just an efficient way to get a general idea about the resource requirement of the tool.

  2. Monitor the resource consumption (with an interactive test job)

    A good way of estimating the wall time and memory requirement is through monitoring the usage of them at run time. This approach is only feasible if you run the job interactively through a graphical interface. Nevertheless, it’s encouraged to test your data analysis computation interactively once before submitting it to the cluster with a large amount of batch jobs. Through the interactive test, one could easily debug issues and measure the resource usage.

    Upon the start of an interactive job, a resource consumption monitor is shown on the top-right corner of your VNC desktop. An example is shown in the following screenshot:

    ../../_images/slurm_interactive_jobinfo.png

    The resource monitor consists of three bars. From top to bottom, they are:

    • Elapsed walltime: the bar indicates the elapsed walltime consumed by the job. It also shows the remaining walltime. The walltime is adjusted accordingly to the CPU speed.

    • Memory usage: the bar indicates the current memory usage of the job.

    • Max memory usage: the bar indicates the peak memory usage of the job.

  3. Check the epilogue information at the end of the job output stream

    For batch jobs, the epilogue script also writes the accounting information to the job’s output stream. One could also take it as a reference to determine the amount of resources needed for the computation.