Biowulf at the NIH
rss
Novocraft package on Biowulf

Novoalign is an aligner for single-ended and paired-end reads from the Illumina Genome Analyser. Novoalign finds global optimum alignments using full Needleman-Wunsch algorithm with affine gap penalties whilst performing at the same or better speed than aligners that are limited to two mismatches and no insertions or deletions.

Novoalign indexes for some common genome assemblies such as hg18 and hg19 are available in /fdb/novoalign. If there are other genomes you want indexed, please email staff@helix.nih.gov

Several versions of Novocraft, NovoalignMPI, and NovoalignCSMPI (color space alignment) are maintained on this system. The available versions can be seen by using the modules commands, as in the example below:

[user@biowulf]$ module avail novo

----------------------------- /usr/local/Modules/3.2.9/modulefiles --------------------------
novocraft/2.07.13   novocraft/2.08.01    novocraft/2.08.02    novocraft/2.08.03


[user@biowulf]$ module load novocraft

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) novocraft/2.08.03

[user@biowulf]$ module unload novocraft

[user@biowulf]$ module load novocraft/2.07.13

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) novocraft/2.07.13

Running a single Novoalign batch job

1. Create a batch script along the following lines:

#!/bin/bash
#PBS -N novoalign
#PBS -k oe

# load the latest version of novocraft
module load novocraft

# cd to the appropriate directory
cd /data/user/mydir

# generate an index file named 'celegans' for the sequence file elegans.dna.fa
novoindex celegans elegans.dna.fa

# align the reads in file s_1_sequence.txt against the indexed genome of C.Elegans.
novoalign -c 4 -f s_1_sequence.txt -d celegans -o SAM > out.sam

Note: The Helix staff maintains some novoalign index files in /fdb/novoalign. If you want us to provide index files for other genomes, please email staff@helix.nih.gov

2. on the biowulf login node, submit the job:

qsub -l nodes=1:g8 scriptname

 

Submitting a swarm of Novoalign jobs

1. Create a swarm file along the following lines:

cd /data/user/novo; novoalign -c 4 -d celegans -f sim1.fastq sim1r.fastq -o SAM > out1.sam
cd /data/user/novo; novoalign -c 4 -d celegans -f sim2.fastq sim2r.fastq -o SAM > out2.sam
cd /data/user/novo; novoalign -c 4 -d celegans -f sim3.fastq sim3r.fastq -o SAM > out3.sam
[....]

Submit this swarm with:

swarm -t 4 -g 8 -f swarmfile --module novocraft/2.08.03
The '-t 4' flag tells swarm that each command in the file above will use 4 threads. This number must match the '-c 4' flag in the novoalign command.
The '-g 8' flag tells swarm that each command in the file will require 8 GB of memory. You should modify this as needed.
The '-f swarmfile' tells swarm what commands to run.
The '--module novocraft/2.08.03' flag tells swarm to set up the paths for Novocraft v 2.08.03

For more information regarding running swarm, see swarm.html

Running a NovoalignMPI or NovoalignCSMPI batch job

NovoalignMPI and NovoalignCSMPI use MPICH2. This requires that an MPD password file is generated for each user.

1. Create the MPD password file

biowulf> $ echo 'password=<password> ' > ~/.mpd.conf 
biowulf> $ chmod 600 ~/.mpd.conf

2. Create a batch script along the lines of the one below:

#!/bin/bash
#PBS -N novoalignmpi
#PBS -k oe

# load the latest version of novoalignMPI
module load novocraft

mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
mpiexec -np $np `which novoalignMPI` -d file.index -f sim1.fastq sim2.fastq -o SAM  > out.sam
mpdallexit

3. submit job on the biowulf headnode:

qsub -v np=$np -l nodes=($np-1):g24:c24,mem=0 /path/to/qsub/script/above

The parameter $np is the number of processes, and should be replaced by an actual number in the command above. The value must be at least 2. One process is the master process that uses very little cpu time. Thus, you can set the $np to be one more than the number of nodes. Each process will auto-thread to use all the cores on a node.

The flag "mem=0" tells the batch system to ignore memory usage. In the past, Novocraft MPI jobs were being killed by the batch system because of an incorrect memory calculation. This flag bypasses that problem. For example:

qsub -v np=4 -l nodes=3:g24:c24,mem=0 /FullPathToQsubScriptAbove

In this example, 3x g24 nodes were requested. So set np to 3+1=4. One of the 4 processes will be the master process that uses very little cpu. The other 3 processes will each auto-thread to use all the 24 cores on a node.

Novoindex memory usage

Novoindex can use a lot of memory, so it is worthwhile estimating the memory usage before submitting the job, to prevent nodes with overloaded memory. (Thanks to Colin Hercus of Novocraft for this information).

The memory used for a indexed genome is
N/2 + 4(k+1) + 4N/s
where N is the length of the reference genome, k the index k-mer length and s the indexing step size. Note that the second term must be converted to the same units as the first and third.

For example, for a 6GB reference sequence, with default values k=15 and s=2, the index size would be
6G/2 + 416 +4*6G/2 = 3G + 4G + 12G = 20G

It might be better to set the options as -k=15 -s=3 and then have index of ~ 15G
or -k=14 -s=3 for an index size of 13G.

Changing k&s can have an effect on run time so it might be worth testing with a few values to find the best memory/run time trade off.

Documentation

Loading the appropriate module and then typing the command with no parameters will give you the latest information about each command. The Novoalign PDF documentation may not be updated as often as the software versions.

[user@biowulf]$ module load novocraft

[user@biowulf]$ novoindex
# novoindex (2.8) - Universal k-mer index constructor.
# (C) 2008 - 2011 NovoCraft Technologies Sdn Bhd
# novoindex 
# Creating 16 indexing threads.
Error: Please supply an index filename and at least one sequence file.


Usage:
    novoindex  -k 99 -s 9 -m indexfile sequencefiles....
Where:
    -k   99        is the k-mer length to be used for the index. Typically 14.
    -s   9         is the step size for the index. Typical values are from 1 to 3.
    -t   9         sets number of threads to use for indexing.
    -m             sets lower case masking on. Lower case sequence will not be indexed.
    -b             sets bisulphite indexing and alignment mode for methylation experiments.
    -c            sets ABI SOLiD Colour space indexing mode.
    -n   name      sets the an internal name for the reference sequence index. This is
                   used in report headers and as the AS: field in SAM SQ record.
                   Defaults to the indexfile name.
    indexfile      is the filename for the indexed reference sequence generated by novoindex.
    sequencefiles  a list of sequence files in fasta format to be included in the index.

Example:
              novoindex -k 14 -s 1 celegans.ndx elegans.dna.fa

 If k or s are not specified a suitable value will be chosen by novoindex.

 (c) 2008 NovoCraft Technologies Sdn BHd

Novocraft website - see tab for 'documentation