Exercise: simple batch job

The aim of this exercise is to get you familiar with the slurm client tools for submitting and managing cluster jobs. We will firstly create a script that calls the sleep command for a given period of time. After that, we are going to submit the script as jobs to the cluster.

Tasks

Note

DO NOT just copy-n-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

  1. make a script called run_sleep.sh with the following content:

    #!/bin/bash
    #SBATCH --job-name=sleep_1m
    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --time=00:01:30
    #SBATCH --mem=10MB
    
    my_host=$( /bin/hostname )
    
    time=$( date )
    echo "$time: $my_host falls asleep ..."
    
    sleep $1
    
    time=$( date )
    echo "$time: $my_host wakes up."
    

    Note

    Input argument of a bash script is accessible via variable $n where n is an integer referring to the n-th variable given the the script. In the script above, the value $1 on the line sleep $1 refers to the first argument given the the script. For instance, if you run the script as run_sleep.sh 10, the value of $1 is 10.

  2. make sure the script runs locally

    $ chmod +x run_sleep.sh
    $ ./run_sleep.sh 1
    Thu Oct 31 09:40:18 CET 2024: mentat006.dccn.nl falls asleep ...
    Thu Oct 31 09:40:19 CET 2024: mentat006.dccn.nl wakes up.
    
  3. submit a job to run the script

    $ sbatch $PWD/run_sleep.sh 60
    Submitted batch job 46288492
    
  4. check the job status. For example,

    $ squeue --me
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    46288492     batch run_slee   lenobl  R       0:27      1 dccn-c061
    

    Note

    The squeue command shows all interactive jobs and batch jobs currently running on the cluster. The ‘–me’ flag alters the squeue command to only show the jobs that you submitted.

  5. or monitor it until it is complete

    $ watch squeue --me
    

    Tip

    The watch command is used here to repeat the squeue command every 2 seconds. Press Control-c to quit the watch program when the job is finished.

  6. examine the output file, e.g. slurm-46288492.out, and find out the resource consumption of this job. The job ID should be replaced accordingly.

    $ grep -E 'Job ID|Job Exit Code|Used resources' slurm-46288492.out
     Job ID:          46288492
     Job Exit Code:   0:0
     Used resources:  cputime=00:01:00,walltime=00:01:00,memory=0
    
  7. or retrieve information from the slurm job accounting database

$ sacct -j 46288492
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
46288492     run_sleep+      batch       mhng          1  COMPLETED      0:0
46288492.ba+      batch                  mhng          1  COMPLETED      0:0
46288492.ex+     extern                  mhng          1  COMPLETED      0:0
  1. submit another job to run the script, with longer duration of sleep. For example,

    $ sbatch $PWD/run_sleep.sh 3600
    Submitted batch job 46288593
    

    Note

    Try to compare the command in step 3 and the job parameters in the run_sleep.sh script. As we expect the job to run longer, the requirement on the job walltime must be extend to 1 hour and 10 minutes within the script to account for this.

  2. Ok, we don’t want to wait for the 1-hour job to finish. Let’s cancel the job. For example,

    $ scancel 46288593