Genetic Cluster Computer

Tutorial

Getting started

The GCC is a separate part of the LISA cluster at the Dutch National Computing and Networking Services (SURFsara) and works in a similar way. For more information on working with the batch queuing system (BQS) on LISA, see below for a short introduction or check the LISA user info for more detailed instructions.

  • How to obtain a login-name?
    Apply for a login-name with Danielle Posthuma, by filling out the online form. A brief evaluation will be made of your application, after which you will receive an e-mail with your login details. Please note that the GCC is meant to be used only by researchers working in the area of genetics.
    You will login to the interactive node where you can upload files and scripts and tryout short runs. From here you need to submit jobs to the jobscheduler who will then distribute your scripts to the nodes.
  • Do I have to pay?
    At the moment GCC is still open for all researchers in the field of genetics, we try to keep it that way. However, in times of heavy use we may have to put some limitations on users from institutions that do not financially contribute to GCC (i.e. non VU users, non PGC members, not collaborating with CTG-VU group). If you want to ensure processing time and your group or institute is willing to provide a financial contribution - please contact Danielle Posthuma.
  • How do I login to the system?
    You will receive login details by e-mail. Use SSH2 (e.g. putty) to lisa.surfsara.nl. For FTP use secureFTP (e.g. winscp or Filezilla)
  • If you have very little prior experience with Linux/Unix systems or cluster computing, here’s a powerpoint presentation that shows how to use the cluster in a step-wise manner:
    HowTo_GCC.ppt (5.94 MB)
  • What software is installed?
    Currently we have installed among others:
    Genehunter • MeV • Mendel • Mx • R • SimWalk • Merlin • QTDT • Solar • Twinsim • Allegro • GASP GASSOC LOKI • Mapmaker • MEGA2 PEDIG SAGE TRANSMIT UNPHASED • Vitesse
    For more details regarding the software, see the software page.

    You need to type module load vitesse [if you want to use e.g. vitesse]
    Please direct requests for other software to Danielle Posthuma at: danielle.posthuma@cncr.vu.nl. Software can also locally be installed.

Do I need extensive knowledge of UNIX or programming languages?

No, fortunately not. Extensive knowledge of UNIX/LINUX commands is not needed and lack of it should not obstruct your access to super computing power. A short manual to UNIX commands can e.g. be found at this website. Also, the SURFsara helpdesk is (almost) always available by e-mail hic@surfsara.nl.

Submitting one serial analysis to the cluster.

You need:

  1. Your normal script for data analysis (for example an R script, or a shell script with commands for promptline software) + datafile
  2. PBS submission job job

An example job looks as follows:

#PBS -lnodes=1
#PBS -lwalltime=1:00:00
cd $HOME/MyAnalysis || exit
module load plink
plink --bfile test --assoc

This will ask the cluster to load the plink software and run plink on one node (#PBS -lnodes=1) . You need to specify the maximum time the analysis will run, which is set at 1 hour in this example (#PBS -lwalltime=1:00:00), and need to specify the directory where the script is (using cd etc).

Using the command

> qsub job
submits job

The command

> qstat -u [username]
will show the status of all your jobs, including job-id’s.

The command

> qdel [jobid]
deletes a job from the queue.

After analysis, the output appears in your working directory on the interactive node.

Using multiple processors per node

Every node has multiple processors. If you ask for one node, you occupy all processors, even though most software on GCC is not parallel. It is therefore more efficient to start multiple processes per node, using the following example job:

#PBS -lnodes=1
#PBS -lwalltime=1:00:00
module load plink
cd $HOME/MyAnalysis || exit
plink --bfile test1 --assoc --out out1 &
plink --bfile test2 --assoc --out out2 &

This will ask the cluster to run two analyses (script1.mx and script2.mx) on one node using two cores. Most nodes have 8 or 12 cores, and you are strongly advised to use all (8 or 12) cores.

Using the command

> qsub job
submits job

The command

> qstat -u [username]
will show the status of all your jobs, including job-id’s.

The command

> qdel [jobid]
deletes a job from the queue

Running multiple serial analyses

The cluster is especially equipped to run multiple serial analyses, e.g. 100, 500, or 1000. You do not want to write 100, 500 or 1000 submission jobs, and therefore need another short script that will generate submission jobs for you.

You need 3 documents:

  1. Your normal script for data analysis (i.e. an R script, shell script, mx script, command script for vitesse, merlin, genehunter etc) + datafile(s) if needed. The script needs to include some parameters that need to be changed for every analysis. Below are two examples to run lots of analyses on multiple nodes at the same time.
  2. A shell script that loops through your intended series of analysis.
  3. A submission-job generating shell-script.
Example 1: Running linkage simulations using Merlin

In this example we use

  • Datafiles in linkage/Merlin format, called mydata.ped, mylabels.dat, and markermap.map (not provided here)
  • A shell script that loops through the intended series of analyses, and includes the commands given to merlin, called tmpmerlin
  • A shell script that generates the necessary submission scripts, called
    jobmerlin

The tmpmerlin file contains:

#PBS -lnodes=1
#PBS -lwalltime=4:00:00
module load merlin
cd $HOME/simulations || exit
merlin -p mydata.ped -d mylabels.dat -m markermap.map --simulate -rSIMN1 > linkage_simSIMN1.out&
cd $HOME/simulations || exit
merlin -p mydata.ped -d mylabels.dat -m markermap.map --simulate -rSIMN2 > linkage_simSIMN2.out&
wait

The jobmerlin file contains:

#!/bin/bash
for p in  `seq 1 500` 
do 
(( q=p+500 ))
cat tmpmerlin | sed "s/SIMN1/${p}/g;s/SIMN2/${q}/g" > runmerlin
qsub runmerlin 
done 
exit

This will open tmpmerlin, replace SIMN1 by simulation nr 1, SIMN2 by simulation nr 501, save the file as runmerlin, then submit runmerlin to the job scheduler, and start the next 2 simulations. Runmerlin runs 1000 simulations (500 × 2 cores) in Merlin, and uses the simulation number as random seed.

Example 2: Running simulations using R

In this example we use an R script called sjabloon.R; this file is not provided here, but in sjabloon.R we have the words “xxnpermaxx” and “xxncaxx” that need to be replaced by parameters 2 and 3 that are generated by the shell script.
* Shell script that loops through your intended series of analysis, called runloop. Example:

#!/bin/bash
for nrep in 5000 ; do 
  for nperma in 100 500 1000 ; do
     for nca in 100 200 500 1000 5000  ; do
       echo $nrep $nperma $nca
     done
  done
done |\
while read nrep1 nperma1 nca1 ; do
  read nrep2 nperma2 nca2
  ./subjobg $nrep1 $nperma1 $nca1 $nrep2 $nperma2 $nca2
done
  • Submission-job generating script subjob. Example:
#!/bin/bash
cat <<eoj > tmpjob
#PBS -lwalltime=60:00:00
eoj
while [ "$1" ] ; do
echo creating partial job for $1 $2 $3
cat <<eoj >> tmpjob
cd \$HOME/SIMS || exit
mkdir -p d$1.$2.$3
cd d$1.$2.$3 || exit
cat ../sjabloonG.R | sed 's/xxnpermaxx/$2/;s/xxncaxx/$3/' > finalG.$1.$2.$3
R CMD BATCH finalG.$1.$2.$3 &
eoj
shift 3
done
echo wait >> tmpjob
echo submitting job 
qsub tmpjob

The command

> ./runloop &
generates a series of submission jobs that each submit a unique R script, using the parameters specified in the shell script.

Example session

Example session: running hundreds of analyses using the program R
We have

  • an R script (sjabloon.R) (ascii format)
  • runloop (ascii format)
  • subjob (ascii format)

The three files above can be made with notepad.

  1. Open Filezilla (or any other secure FTP program)
  2. Upload the three files, using ASCII transfer mode
  3. Open secure FTP program (e.g. Putty)
  4. Log on to lisa.surfsara.nl
  5. Go the directory where your files are (e.g. > cd myfiles)
  6. Make runloop and subjob executable by typing
    > chmod +x runloop
    > chmod +x subjob
  7. Call runloop by typing
    > ./runloop &
  8. Type
    > qstat -u [username]
    to check the status of your jobs

Your jobs will now be submitted. Wait until they are finished and find the output in your working directory.