Exercise: distributed data analysis with R
Warning
This page requires an update for the Slurm cluster.
In this exercise, you will learn how to submit R jobs in the cluster using the Rscript
, the scripting front-end of R.
This exercise is divided into two tasks. The first task is to get you familiar with the flow of running R script as batch jobs in the HPC cluster. The second is more about bookkeeping outputs (R data files) produced by R jobs running concurrently in the cluster.
Note
In this exercise, we will use commands in R and in Linux shell. When you see the commands started with a prompt $
, it means a command in Linux shell. If you see >
, it implies a command to be typed in a R console.
Preparation
Follow the steps below to download the prepared R scripts
.
$ mkdir R_exercise
$ cd R_exercise
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_R/R_exercise.tgz
$ tar xvzf R_exercise.tgz
$ ls
magic_cal_2.R magic_cal_3.R magic_cal.R
Load environment for R version 3.2.2.
$ module unload R
$ module load R/3.2.2
$ which R
/opt/R/3.2.2/bin/R
Task 1: simple job
In this task, we use the script magic_cal.R
. This script uses the magic
library to genera a magic matrix of a given dimension, and calculate the sum of its diagonal elements. The matrix and the sum are both printed to the standard output.
run the script interactively, for a matrix of dimention 8
$ export R_LIBS=/opt/R/packages $ Rscript magic_cal.R 5 WARNING: ignoring environment value of R_HOME Loading required package: abind [,1] [,2] [,3] [,4] [,5] [1,] 9 2 25 18 11 [2,] 3 21 19 12 10 [3,] 22 20 13 6 4 [4,] 16 14 7 5 23 [5,] 15 8 1 24 17 [1] 65
read and understand the
magic_cal.R
scriptrun the script to the cluster as a batch job
$ echo "Rscript $PWD/magic_cal.R 5" | qsub -N "magic_cal" -l walltime=00:10:00,mem=256mb 11082769.dccn-l029.dccn.nl
wait the job to finish, and check the output of the job. Do you get same results as running interactively?
run five batch jobs in parallel to run the
magic_cal.R
with matrices in dimention 5,6,7,8,9.$ for d in {5 .. 9}; do echo "Rscript $PWD/magic_cal.R $d" | qsub -N "magic_cal_$d" -l walltime=00:10:00,mem=256mb; done
Task 2: job bookkeeping and saving output objects
In the previous task, data objects are just printed to the standard output, which are consequently captured as text in the output files of the jobs. Data stored in this way is hardly be reused for following analyses. A better approach is to store the objects in a R data file (i.e. the RData files), using the save
function of R.
Given that batch jobs in the cluster will be executed at the same time, writing objects from different jobs into the same file is not recommanded as the concurrency issue may result in corrupted outputs. A better approach is to write outputs of each job to a seperate file. In implies that running batch jobs in parallel requires an additional bookkeeping strategy on the jobs as well as the output files produced from them.
In this exercise, we are going to use the script magic_cal_2.R
in which functions are provided to
save objects into data file, and
get job/process information that can be used for the bookkeeping purpose.
Follow the steps below:
run the script interactively
$ Rscript magic_cal_2.R 5 WARNING: ignoring environment value of R_HOME Loading required package: abind saving objects magic_matrix,sum_diagonal to magic_cal_2.out.RData ...done
From the terminal output, you see two objects are saved into a RData file called
magic_cal_2.out.RData
. Later on, you can load the object from this file into R or a R script. For example,> load("magic_cal_2.out.RData") > ls() [1] "magic_matrix" "sum_diagonal" > magic_matrix [,1] [,2] [,3] [,4] [,5] [1,] 9 2 25 18 11 [2,] 3 21 19 12 10 [3,] 22 20 13 6 4 [4,] 16 14 7 5 23 [5,] 15 8 1 24 17 > q(save="no")
read and understand the
magic_cal_2.R
script, especially the functions at the top of the script.try to run
magic_cal_2.R
as batch jobs as we did in the previous task.Tip
You probably noticed that the functions defined in
magic_cal_2.R
are so generic that they can be reused for different scripts.That is right! In fact, we have factored out those functions into
/opt/cluster/share/R
so that you could easily make use of those functions in the future.In the script
magic_cal_3.R
, it shows you how to load those functions in your R scripts. It also shows you how to construct the name of the RData file using the job information.