Genetic Cluster Computer

Tutorial

Getting started

The GCC is a separate part of the LISA cluster at the Dutch National Computing and Networking Services (SURFsara) and works in a similar way. For more information on working with the batch queuing system (BQS) on LISA, see below for a short introduction or check the LISA user info for more detailed instructions.

  • How to obtain a login-name?
    Apply for a login-name with Danielle Posthuma, by filling out the online form. A brief evaluation will be made of your application, after which you will receive an e-mail with your login details. Please note that the GCC is meant to be used only by researchers working in the area of genetics.
    You will login to the interactive node where you can upload files and scripts and tryout short runs. From here you need to submit jobs to the jobscheduler who will then distribute your scripts to the nodes.
  • Do I have to pay?
    At the moment GCC is still open for all researchers in the field of genetics, we try to keep it that way. However, in times of heavy use we may have to put some limitations on users from institutions that do not financially contribute to GCC (i.e. non VU users, non PGC members, not collaborating with CTG-VU group). If you want to ensure processing time and your group or institute is willing to provide a financial contribution - please contact Danielle Posthuma.
  • How do I login to the system?
    You will receive login details by e-mail. Use SSH2 (e.g. putty) to lisa.surfsara.nl. For FTP use secureFTP (e.g. winscp or Filezilla)
  • If you have very little prior experience with Linux/Unix systems or cluster computing, here’s a powerpoint presentation that shows how to use the cluster in a step-wise manner:
    HowTo_GCC.ppt (5.94 MB)
  • What software is installed?
    Currently we have installed among others:
    Genehunter • MeV • Mendel • Mx • R • SimWalk • Merlin • QTDT • Solar • Twinsim • Allegro • GASP GASSOC LOKI • Mapmaker • MEGA2 PEDIG SAGE TRANSMIT UNPHASED • Vitesse
    For more details regarding the software, see the software page.

    You need to type module load vitesse [if you want to use e.g. vitesse]
    Please direct requests for other software to Danielle Posthuma at: danielle.posthuma@cncr.vu.nl. Software can also locally be installed.

Do I need extensive knowledge of UNIX or programming languages?

No, fortunately not. Extensive knowledge of UNIX/LINUX commands is not needed and lack of it should not obstruct your access to super computing power. A short manual to UNIX commands can e.g. be found at this website. Also, the SURFsara helpdesk is (almost) always available by e-mail hic@surfsara.nl.

Submitting one serial analysis to the cluster.

You need:

  1. Your normal script for data analysis (for example an R script, or a shell script with commands for promptline software) + datafile
  2. SLURM submission job job

An example job looks as follows:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 1:00:00
cd $HOME/MyAnalysis || exit
module load plink2
plink --bfile test --assoc

This will ask the cluster to load the plink software and run plink on one node (#SLURM -N 1) . You need to specify the maximum time the analysis will run, which is set at 1 hour in this example (#SLURM -t 1:00:00), and need to specify the directory where the script is (using cd etc).

Using the command

> sbatch job
submits job

The command

> squeue -u [username]
will show the status of all your jobs, including job-id’s.

The command

> scancel [jobid]
deletes a job from the queue.

After analysis, the output appears in your working directory on the interactive node.

Using multiple processors per node

Every node has multiple processors. If you ask for one node, you occupy all processors, even though most software on GCC is not parallel. It is therefore more efficient to start multiple processes per node, using the following example job:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 1:00:00
module load plink2
cd $TMPDIR
cp $HOME/MyAnalysis/plink_bfile.??? ./
cp $HOME/MyAnalysis/plink_chunk*.snps ./
for i in {1..16}; do
(
    plink --bfile plink_bfile --extract plink_chunk$i.snps --make-bed --out plink_bfile_chunk$i
)&
done
wait
cp plink_bfile_chunk$i.* $HOME/MyAnalysis/

This will extract SNPs for 16 chunks defined in plink_chunkN.snps on one node using 16 processes. Most nodes have 16 cores, and you are strongly advised to use all 16 cores.

Running multiple serial analyses with array job

The cluster is especially equipped to run multiple serial analyses, e.g. 100, 500, or 1000. You do not want to submit 100, 500 or 1000 jobs manually. In this case, array job would be useful.

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 1:00:00
#SBATCH --array=1-80:16
module load plink2
cd $TMPDIR
cp $HOME/MyAnalysis/plink_bfile.??? ./
cp $HOME/MyAnalysis/plink_chunk*.snps ./
start=$SLURM_ARRAY_TASK_ID
end=$(($start+15))
for i in $(seq $start $end); do
(
    plink --bfile plink_bfile --extract plink_chunk$i.snps --make-bed --out plink_bfile_chunk$i
)&
done
wait
cp plink_bfile_chunk$i.* $HOME/MyAnalysis/

This will result in 5 jobs, each with 16 processes.

The array index is defined by the flag --array=1-80:16, which means from 1 to 80 by 16 steps.
Therefore, $SLURM_ARRAY_TASK_ID will be 1, 17, 33, 49 and 65.