Distributed data analysis with bash script

Note

If you are missing the context of this exercise, please refer to Exercise: distributed data analysis.

Preparation

Using the commands below to download the exercise package and check its content.

$ cd
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_da/torque_exercise.tgz
$ tar xvzf torque_exercise.tgz
$ cd torque_exercise
$ ls
subject_0  subject_1  subject_2  subject_3  subject_4  subject_5 ...

In the package, there are folders for subject data (i.e. subject_{0..5}). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo on the Internet.

In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder:

  1. decrypting the URL string, and

  2. downloading the subject’s photo.

Tasks

  1. before you start, get into the directory of the torque_exercise and run

    $ ./clean.sh
    

    to remove previously produced results.

  2. (optional) read the script run_analysis.sh and try to get an idea how to use it. Don’t spend too much time in understanding every detail.

    Tip

    The script consists of a BASH function analyze_subject_data encapsulating the data-analysis algorithm. The function takes one input argument, the subject id. In the main program (the last line), the function is called with an input $1. In BASH, variable $1 is used to refer to the first argument of a shell command.

  3. run the analysis interactively on the dataset of subject_0

    $ ./run_analysis.sh 0
    

    The command doesn’t return any output to the terminal. If it is successfully executed, you should see a photo in the folder subject_0.

    Tip

    The script run_analysis.sh is writen to take one argument as the subject id. Thus the command above will perform the data analysis algorithm on the dataset of subject_0 interactively.

  4. run the analysis by submitting 5 parallel jobs; each runs on a dataset.

    Tip

    The command seq 1 N is useful for generating a list of integers between 1 and N. You could also use {1..N} as an alternative.

  5. wait until the jobs finish and check out who our subjects are. You should see a file photo.* in each subject’s folder.

Solution

  1. a complete version of the run_analysis.sh:

     1#!/bin/bash
     2
     3## This function mimicing data analysis on subject dataset.
     4##   - It takes subject id as argument.
     5##   - It decrypts the data file containing an encrypted URL to the subject's photo.
     6##   - It downloads the photo of the subject.
     7##
     8## To call this function, use
     9##
    10##   analyze_subject_data <the_subject_id>
    11function analyze_subject_data {
    12
    13    ## get subject id from the argument of the function
    14    id=$1
    15
    16    ## determin the root directory of the subject folders
    17    if [ -z $SUBJECT_DIR_ROOT ]; then
    18        if [ -z $PBS_O_WORKDIR ]; then
    19            SUBJECT_DIR_ROOT=$PWD
    20        else
    21            SUBJECT_DIR_ROOT=$PBS_O_WORKDIR
    22        fi
    23    fi
    24
    25    subject_data="${SUBJECT_DIR_ROOT}/subject_${id}/data"
    26
    27    ## data decryption password
    28    decrypt_passwd="dccn_hpc_tutorial"
    29
    30    if [ -f $subject_data ]; then
    31
    32        ## decrypt the data and get URL to the subject's photo
    33        url=$( openssl enc -aes-256-cbc -d -in $subject_data -k $decrypt_passwd )
    34
    35        if [ $? == 0 ]; then
    36
    37            ## get the file suffix of the photo file
    38            ext=$( echo $url | awk -F '.' '{print $NF}' )
    39
    40            ## download the subject's photo
    41            wget --no-check-certificate $url -o ${SUBJECT_DIR_ROOT}/subject_${id}/log -O ${SUBJECT_DIR_ROOT}/subject_${id}/photo.${ext}
    42
    43            return 0
    44
    45        else
    46           echo "cannot resolve subject data url: $subject_data"
    47           return 1
    48        fi
    49
    50    else
    51        echo "data file not found: $subject_data"
    52        return 2
    53    fi
    54}
    55
    56## The main program starts here
    57##  - make this script to take the subject id as its first command-line argument
    58##  - call the data analysis function given above with the subject id as the argument
    59
    60analyze_subject_data $1
    
  2. submit jobs to the torque cluster

    $ for id in $( seq 1 5 ); do echo "$PWD/run_analysis.sh $id" | qsub -N "subject_$id" -l walltime=00:20:00,mem=1gb; done