Distributed data analysis with bash script
Note
If you are missing the context of this exercise, please refer to Exercise: distributed data analysis.
Preparation
Using the commands below to download the exercise package
and check its content.
$ cd
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_da/hpc_exercise_slurm.tgz
$ tar xvzf hpc_exercise_slurm.tgz
$ cd hpc_exercise_slurm
$ ls
subject_0 subject_1 subject_2 subject_3 subject_4 subject_5 ...
In the package, there are folders for subject data (i.e. subject_{0..5}
). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo on the Internet.
In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder:
decrypting the URL string, and
downloading the subject’s photo.
Tasks
before you start, get into the directory of the
hpc_exercis_slurm
and run$ ./clean.sh
to remove previously produced results.
(optional) read the script
run_analysis.sh
and try to get an idea how to use it. Don’t spend too much time in understanding every detail.Tip
The script consists of a BASH function
analyze_subject_data
encapsulating the data-analysis algorithm. The function takes one input argument, the subject id. In the main program (the last line), the function is called with an input$1
. In BASH, variable$1
is used to refer to the first argument of a shell command.run the analysis interactively on the dataset of
subject_0
$ ./run_analysis.sh 0
The command doesn’t return any output to the terminal. If it is successfully executed, you should see a photo in the folder
subject_0
.Tip
The script
run_analysis.sh
is writen to take one argument as the subject id. Thus the command above will perform the data analysis algorithm on the dataset ofsubject_0
interactively.run the analysis by submitting 5 parallel jobs; each runs on a dataset.
Tip
The command
seq 1 N
is useful for generating a list of integers between1
andN
. You could also use{1..N}
as an alternative.wait until the jobs finish and check out who our subjects are. You should see a file
photo.*
in each subject’s folder.
Solution
a complete version of the
run_analysis.sh
:1#!/bin/bash 2 3## This function mimicing data analysis on subject dataset. 4## - It takes subject id as argument. 5## - It decrypts the data file containing an encrypted URL to the subject's photo. 6## - It downloads the photo of the subject. 7## 8## To call this function, use 9## 10## analyze_subject_data <the_subject_id> 11function analyze_subject_data { 12 13 ## get subject id from the argument of the function 14 id=$1 15 16 ## determin the root directory of the subject folders 17 if [ -z $SUBJECT_DIR_ROOT ]; then 18 SUBJECT_DIR_ROOT=$PWD 19 fi 20 21 subject_data="${SUBJECT_DIR_ROOT}/subject_${id}/data" 22 23 ## data decryption password 24 decrypt_passwd="dccn_hpc_tutorial" 25 26 if [ -f $subject_data ]; then 27 28 ## decrypt the data and get URL to the subject's photo 29 url=$( openssl enc -aes-256-cbc -d -in $subject_data -pbkdf2 -k $decrypt_passwd ) 30 31 if [ $? == 0 ]; then 32 33 ## get the file suffix of the photo file 34 ext=$( echo $url | awk -F '.' '{print $NF}' ) 35 36 ## download the subject's photo 37 wget --no-check-certificate $url -o ${SUBJECT_DIR_ROOT}/subject_${id}/log -O ${SUBJECT_DIR_ROOT}/subject_${id}/photo.${ext} 38 39 return 0 40 41 else 42 echo "cannot resolve subject data url: $subject_data" 43 return 1 44 fi 45 46 else 47 echo "data file not found: $subject_data" 48 return 2 49 fi 50} 51 52## The main program starts here 53## - make this script to take the subject id as its first command-line argument 54## - call the data analysis function given above with the subject id as the argument 55 56analyze_subject_data $1
submit jobs to the slurm cluster
$ for id in $( seq 1 5 ); do sbatch --job-name=subj_${id} --time=10:00 --mem=4gb $PWD/run_analysis.sh $id; done