Distributed data analysis with R

Note

If you are missing the context of this exercise, please refer to Exercise: distributed data analysis.

Preparation

Using the commands below to download the exercise package and check its content.

$ cd
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_da/torque_exercise.tgz
$ tar xvzf torque_exercise.tgz
$ cd torque_exercise
$ ls
subject_0  subject_1  subject_2  subject_3  subject_4  subject_5 ...

In the package, there are folders for subject data (i.e. subject_{0..5}). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo on the Internet.

In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder:

  1. decrypting the URL string, and

  2. downloading the subject’s photo.

Tasks

  1. before you start, get into the directory of the torque_exercise and run

    $ ./clean.sh
    

    to remove previously produced results.

  2. (optional) read the script run_analysis.R and try to get an idea how to use it. Don’t spend too much time in understanding every detail.

    Tip

    The script consists of a R function analyze_subject_data encapsulating the data-analysis algorithm. The function takes one input argument, the subject id.

  3. make a test run of run_analysis.R on subject 0

    $ module load R
    $ Rscript ./run_analysis.R 0
    

    On success, you should see the output file being created in the subject_0 directory.

    $ ls -l subject_0/photo.jpg
    

    Run the clean script to remove the output of the test.

    $ ./clean.sh
    
  4. Let’s test again on another subject with a torque job.

    $ echo "Rscript $PWD/run_analysis.R 3" | qsub -l walltime=10:00,mem=1gb -N subject_3
    

    Note

    Think a bit the construction of the shell command above:

    • what is the idea behind the command-line pipe (|)?

    • why prepending $PWD/ in front of the script?

    Wait until the job to finish, and check if you get the output in subject_3 directory.

    $ ls -l subject_3/photo.png
    

    Run the clean script again before we perform the analysis on all subjects:

    $ ./clean.sh
    
  5. Run the analysis on all subjects in parallel

    $ for id in {0..5}; do echo "Rscript $PWD/run_analysis.R $id" | qsub -l walltime=10:00,mem=1gb -N subject_$id; done
    

    and check if you get outputs (photos) of all 6 subjects.