Distributed data analysis with R
Note
If you are missing the context of this exercise, please refer to Exercise: distributed data analysis.
Preparation
Using the commands below to download the exercise package
and check its content.
$ cd
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_da/hpc_exercise_slurm.tgz
$ tar xvzf hpc_exercise_slurm.tgz
$ cd hpc_exercise_slurm
$ ls
subject_0 subject_1 subject_2 subject_3 subject_4 subject_5 ...
In the package, there are folders for subject data (i.e. subject_{0..5}
). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo on the Internet.
In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder:
decrypting the URL string, and
downloading the subject’s photo.
Tasks
before you start, get into the directory of the
hpc_exercis_slurm
and run$ ./clean.sh
to remove previously produced results.
(optional) read the script
run_analysis.R
and try to get an idea how to use it. Don’t spend too much time in understanding every detail.Tip
The script consists of a R function
analyze_subject_data
encapsulating the data-analysis algorithm. The function takes one input argument, the subject id.make a test run of
run_analysis.R
on subject0
$ module load R $ Rscript ./run_analysis.R 0
On success, you should see the output file being created in the
subject_0
directory.$ ls -l subject_0/photo.jpg
Run the clean script to remove the output of the test.
$ ./clean.sh
Let’s test again on anoter subject (i.e. subject 3) with a slurm job.
$ sbatch --job-name=subject_3 --time=10:00 --mem=1gb --wrap="Rscript $PWD/run_analysis.R 3"
Note
Note that we use the
--wrap
option of thesbatch
command here. It is needed becasue the executableRscript
is not really a (Bash or Python) script.Wait until the job to finish, and check if you get the output in
subject_3
directory.$ ls -l subject_3/photo.png
Run the clean script again before we perform the analysis on all subjects:
$ ./clean.sh
Run the analysis on all subjects in parallel
$ for id in {0..5}; do sbatch --job-name=subject_$id --time=10:00 --mem=1gb --wrap="Rscript $PWD/run_analysis.R $id"; done
and check if you get outputs (photos) of all 6 subjects.