Distributed data analysis with Python

Note

If you are missing the context of this exercise, please refer to Exercise: distributed data analysis.

Preparation

Using the commands below to download the exercise package and check its content.

$ cd
$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/exercise_da/torque_exercise.tgz
$ tar xvzf torque_exercise.tgz
$ cd torque_exercise
$ ls
subject_0  subject_1  subject_2  subject_3  subject_4  subject_5 ...

In the package, there are folders for subject data (i.e. subject_{0..5}). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo on the Internet.

In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder:

  1. decrypting the URL string, and

  2. downloading the subject’s photo.

Tasks

  1. before you start, get into the directory of the torque_exercise and run

    $ ./clean.sh
    

    to remove previously produced results.

  2. (optional) read the script run_analysis.py and try to get an idea how to use it. Don’t spend too much time in understanding every detail.

    Tip

    The script consists of a Python function analyze_subject_data encapsulating the data-analysis algorithm. The function takes one input argument, the subject id.

  3. The run_analysis.py script makes use of a Python library called requests which needs to be installed. We make use of Anaconda and conda environment to install the library in the home directory.

    Create a new conda environment with the commands below:

    $ module load anaconda3
    $ conda create --name exercise
    ...
    

    Note

    You might see a warning like the one below:

    ==> WARNING: A newer version of conda exists. <==
      current version: 4.9.0
      latest version: 4.12.0
    
    Please update conda by running
    
      $ conda update -n base -c defaults conda
    ...
    
    Proceed ([y]/n)?
    

    You can ignore it and simply proceed with y.

    Activate the conda environment, and install the library requests:

    $ source activate exercise
    
    [exercise] $ conda install requests
    ...
    

    Tip

    You will see the bash prompt is populated with the conda environment name. It indicates that you are currently in the conda environment. Only within the conda environment have you access to the requests library that we just installed.

  4. At this point, you can test run the run_analysis.py script with one subject. Let’s test the analysis on subject 0 by doing:

    [exercise] $ ./run_analysis.py 0
    

    You should see the output file subject_0/photo.jpg when the analysis is done.

  5. Let’s test again on another subject with a torque job

    In the command below, we just make an arbitrary (but sufficient) resource requirement of 10 minutes walltime and 1 GB memory.

    [exercise] $ echo "$PWD/run_analysis.py 1" | qsub -l walltime=10:00,mem=1gb -N subject_1
    

    Note

    Think a bit the construction of the shell command above:

    • what is the idea behind the command-line pipe (|)?

    • why prepending $PWD/ in front of the script?

    You should see the output file subject_1/photo.jpg when the analysis is done. At this time, you also see the stdout/stderr files produced by the job.

  6. Run the clean up before we start the analysis in parallel.

    [exercise] $ ./clean.sh
    
  7. In order to run the analysis on all the 6 subjects in parallel, we use a bash for loop:

    [exercise] $ for id in {0..5}; do echo "$PWD/run_analysis.py $id" | qsub -l walltime=10:00,mem=1gb -N subject_$id; done
    

    and check if you get outputs (photos) of all 6 subjects.