> CASAVA on Biowulf
Biowulf at the NIH
RSS Feed
CASAVA on Biowulf

CASAVA is the part of Illumina's sequencing analysis software that performs alignment of a sequencing run to a reference genome and subsequent variant analysis and read counting.

Program Location

Casava executables are in /usr/local/casava/bin

The iGenomes data is maintained on Biowulf in /fdb/igenomes.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands, as in the example below.

biowulf% module avail casava

---------------- /usr/local/Modules/3.2.9/modulefiles ----------------------
casava/1.8.0          casava/1.8.2(default)

biowulf% module load casava/1.8.2

biowulf% module list
Currently Loaded Modulefiles:
  1) casava/1.8.2

Example of a Casava job

The following examples make use of the built in example files that are provided with the CASAVA package. The procedure is explained in detail in the Casava User Guide. You can copy the sample files and run this example in your own /data area, as shown below. This should be done on an interactive node.

The following section gives a brief example of running casava. Then the sections 'How to submit a batch job', 'How to submit a swarm job', and 'How to run interactive job' use these example commands.

First allocate an interactive node with:
qsub -I -l nodes=1:c16:g24

This will allocate a node with 16 cores (c16) and 24 GB of memory (g24) which will be plenty for this example run.

Bcl Conversion and Demultiplexing

-- Convert *.bcl files into compressed FASTQ files
-- Separate multiplexed sequence runs by index
-- Demultiplexing needs a BaseCalls directory and a sample sheet to start a run.

Create a working directory for this project

$ mkdir casava_example_dir
$ cd casava_example_dir

# Copy and modify the SampleSheet.csv located at /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/biowulf/SampleSheet.csv to your working directory.

$ cp /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/biowulf/SampleSheet.csv .

The standard way to run bcl conversion and demultiplexing is to first create the necessary Makefile, which configure the run. Then run 'make' on the generated files, which executes the calculations.

1. Enter the following command to created a makefile for demultiplexing:

$ module load casava

$ configureBclToFastq.pl \
--input-dir /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/\
Validation/110120_P20_0993_A805CKABXX/Data/Intensities/BaseCalls \
--output-dir Unaligned \
--force --ignore-missing-bcl --ignore-missing-stats \
--sample-sheet SampleSheet.csv

2. Change directory into the newly created Unaligned folder specified by --output-dir above

$ cd Unaligned

3. Run the 'make' command, specifying 16 threads (-j 16) since you have allocated a 16-core node. If you allocated a different type of node, you should modify this number to match the number of cores on the node.

$ make -j 16

-- The above process generated .fastq.gz files under Sample_AR008 and Sample_PhiX respectively (de-multiplexed)

Sequence Alignment

1. Copy the configureAlignment configuration file, config.txt, and edit it.

cp /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/biowulf/config.txt .

Edit the first two parameters, EXPT_DIR and OUT_DIR, to match the path of your working directory. If you have been following the example above exactly, they would be set to

EXPT_DIR  /data/$USER/casava_example_dir/Unaligned
OUT_DIR   /data/$USER/casava_example_dir/Aligned
where $USER is your own username.

2. Enter the configureAlignment.pl command with --make

$ configureAlignment.pl config.txt -make

3. Change directory into the newly created Aligned folder. Type 'make' command for basic analysis

$ cd Aligned
$ make -j 16
As before, the number of threads is being set to 16 because this is a 16-core node. If you allocated a different type of node, you should set this value to be the number of cores.

The above process generated _export.txt.gz files in the Project_Demo tree under each Sample folder, e.g.

Variant Detection

-- The input files for CASAVA variant detection can be found in the Aligned directory generated in configureAlignment step
-- CASAVA build process is divided into several modules (or targets), each of which complete a major portion of the post-alignment analysis pipeline:

"sort" - bins aligned reads into separate regions of the reference genome, sorts these reads and optionally removes PCR duplicates (for paired-end reads) and finally converts these reads to BAM format.

"assembleIndels" - Is used to search for clusters of poorly aligned and anomalous reads. These clusters of reads are de-novo assembled into contains, which are aligned back to the reference to produce candidate indwells.

"callSmallVariants" - This module uses the sorted BAM files and the candidate indels predicted by the assembleIndels module to perform local read realignment and genotype SNPs and indels under a diploid gene and exon counts.

Run the following command:

$ configureBuild.pl \
-id /data/$USER/casava_example_dir/Aligned/Project_Demo/Sample_AR008 \
-od /data/$USER/casava_example_dir/Aligned/Project_Demo/Sample_AR008/Build2 \
--samtoolsRefFile /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/\
iGenomes/Homo_sapiens/UCSC/hg18/Sequence/Chromosomes/chr22.fa \
--refFlatFile /usr/local/CASAVA_v1.8.0/share/CASAVA-1.8.0/examples/\
iGenomes/Homo_sapiens/UCSC/hg18/Annotation/Genes/refFlat.txt.gz \
--workflowAuto -j 16 --targets all

This is the end of the test run, so you should exit from the interactive node:

p2338% exit

qsub: job 2533238.biobos completed


How to Run a Casava job on Biowulf
Creating directories and editing of configuration files can be performed on the login node. All the other commands, e.g. 'make -j 16', 'configureAlignment.pl' and 'configureBuild.pl' should be run as batch jobs. If you want to run these commands interactively for debugging purposes, you should allocate an interactive node first as in the example above.

Submitting a Batch Job

The time consuming or memory intensive steps of casava can be put into a script file and submit to the batch system. For example, the 'make -j 16' is the time consuming step and is suitable for batch or swarm job.

1. Create a script file like the one below:

# This file is casavafile
#PBS -N casava
#PBS -m be
#PBS -k oe

cd /data/$USER/casava/run1/
make -j 16

2. Submit the script using the 'qsub' command on Biowulf.

$ qsub -l nodes=1:g24:c16 PathToAboveScript

Note: In this example, the job was submitted to a node with 24 GB of memory (g24) and 16 cores (c16). Therefore '-j 16' was issued after the 'make' command in the batch script which means 16 processors will be used to run the 'make' command. if another kind of node was requested which has different number of cores, please change the number in '-j 16' accordingly.

Submitting a swarm of jobs

Sometimes users have several sets of data under different directories. The same analytical steps are being performed on each dataset. The 'swarm' utility can be used to submit many similar jobs like this.

Before running swarm, the appropriate configuration files should be created in each directory.

To submit a swarm job, create a swarm command file like this, called, say, 'cmdfile'.

cd /data/userid/casava1; make -j 16
cd /data/userid/casava2; make -j 16
cd /data/userid/casava3; make -j 16
cd /data/userid/casava20; make -j 16

Note that each command (a single line in the file above) will require 16 cores. This value must also be given to the swarm command. Submit this job with:

swarm -t 16 -g 24 -f cmdfile
-t 16: tells swarm that each command will require 16 cores -g 24: tells swarm that each command will required 24 GB of memory. -f cmdfile: tells swarm the name of the command file.

For more information regarding running swarm, see swarm.html