Exercise: simple batch job
The aim of this exercise is to get you familiar with the slurm client tools for submitting and managing cluster jobs. We will firstly create a script that calls the sleep command for a given period of time. After that, we are going to submit the script as jobs to the cluster.
Tasks
Note
DO NOT just copy-n-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.
make a script called
run_sleep.shwith the following content:#!/bin/bash #SBATCH --job-name=sleep_1m #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:01:30 #SBATCH --mem=10MB my_host=$( /bin/hostname ) time=$( date ) echo "$time: $my_host falls asleep ..." sleep $1 time=$( date ) echo "$time: $my_host wakes up."
Note
Input argument of a bash script is accessible via variable
$nwherenis an integer referring to the n-th variable given the the script. In the script above, the value$1on the linesleep $1refers to the first argument given the the script. For instance, if you run the script asrun_sleep.sh 10, the value of$1is10.make sure the script runs locally
$ chmod +x run_sleep.sh $ ./run_sleep.sh 1 Thu Oct 31 09:40:18 CET 2024: mentat006.dccn.nl falls asleep ... Thu Oct 31 09:40:19 CET 2024: mentat006.dccn.nl wakes up.
submit a job to run the script
$ sbatch $PWD/run_sleep.sh 60 Submitted batch job 46288492
check the job status. For example,
$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 46288492 batch run_slee lenobl R 0:27 1 dccn-c061
Note
The squeue command shows all interactive jobs and batch jobs currently running on the cluster. The ‘–me’ flag alters the squeue command to only show the jobs that you submitted.
or monitor it until it is complete
$ watch squeue --me
Tip
The
watchcommand is used here to repeat thesqueuecommand every 2 seconds. Press Control-c to quit thewatchprogram when the job is finished.examine the output file, e.g.
slurm-46288492.out, and find out the resource consumption of this job. The job ID should be replaced accordingly.$ grep -E 'Job ID|Job Exit Code|Used resources' slurm-46288492.out Job ID: 46288492 Job Exit Code: 0:0 Used resources: cputime=00:01:00,walltime=00:01:00,memory=0
or retrieve information from the slurm job accounting database
$ sacct -j 46288492
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
46288492 run_sleep+ batch mhng 1 COMPLETED 0:0
46288492.ba+ batch mhng 1 COMPLETED 0:0
46288492.ex+ extern mhng 1 COMPLETED 0:0
submit another job to run the script, with longer duration of
sleep. For example,$ sbatch $PWD/run_sleep.sh 3600 Submitted batch job 46288593
Note
Try to compare the command in step 3 and the job parameters in the
run_sleep.shscript. As we expect the job to run longer, the requirement on the job walltime must be extend to 1 hour and 10 minutes within the script to account for this.Ok, we don’t want to wait for the 1-hour job to finish. Let’s cancel the job. For example,
$ scancel 46288593