Novoalign is an aligner for single-ended and paired-end reads from the Illumina Genome Analyser. Novoalign finds global optimum alignments using full Needleman-Wunsch algorithm with affine gap penalties whilst performing at the same or better speed than aligners that are limited to two mismatches and no insertions or deletions.
Novoalign indexes for some common genome assemblies such as hg18 and hg19 are available in /fdb/novoalign. If there are other genomes you want indexed, please email firstname.lastname@example.org
Several versions of Novocraft, NovoalignMPI, and NovoalignCSMPI (color space alignment) are maintained on this system. The available versions can be seen by using the modules commands, as in the example below:
[user@biowulf]$ module avail novo ----------------------------- /usr/local/Modules/3.2.9/modulefiles -------------------------- novocraft/2.07.13 novocraft/2.08.01 novocraft/2.08.02 novocraft/2.08.03 [user@biowulf]$ module load novocraft [user@biowulf]$ module list Currently Loaded Modulefiles: 1) novocraft/2.08.03 [user@biowulf]$ module unload novocraft [user@biowulf]$ module load novocraft/2.07.13 [user@biowulf]$ module list Currently Loaded Modulefiles: 1) novocraft/2.07.13
1. Create a batch script along the following lines:
#!/bin/bash #PBS -N novoalign #PBS -k oe # load the latest version of novocraft module load novocraft # cd to the appropriate directory cd /data/user/mydir # generate an index file named 'celegans' for the sequence file elegans.dna.fa novoindex celegans elegans.dna.fa # align the reads in file s_1_sequence.txt against the indexed genome of C.Elegans. novoalign -c 4 -f s_1_sequence.txt -d celegans -o SAM > out.sam
Note: The Helix staff maintains some novoalign index files in /fdb/novoalign. If you want us to provide index files for other genomes, please email email@example.com
2. on the biowulf login node, submit the job:
1. Create a swarm file along the following lines:
cd /data/user/novo; novoalign -c 4 -d celegans -f sim1.fastq sim1r.fastq -o SAM > out1.sam cd /data/user/novo; novoalign -c 4 -d celegans -f sim2.fastq sim2r.fastq -o SAM > out2.sam cd /data/user/novo; novoalign -c 4 -d celegans -f sim3.fastq sim3r.fastq -o SAM > out3.sam [....]
Submit this swarm with:
swarm -t 4 -g 8 -f swarmfile --module novocraft/2.08.03
The '-g 8' flag tells swarm that each command in the file will require 8 GB of memory. You should modify this as needed.
The '-f swarmfile' tells swarm what commands to run.
The '--module novocraft/2.08.03' flag tells swarm to set up the paths for Novocraft v 2.08.03
For more information regarding running swarm, see swarm.html
NovoalignMPI and NovoalignCSMPI use MPICH2. This requires that an MPD password file is generated for each user.
1. Create the MPD password file
biowulf> $ echo 'password=<password> ' > ~/.mpd.conf
biowulf> $ chmod 600 ~/.mpd.conf
2. Create a batch script along the lines of the one below:
#!/bin/bash #PBS -N novoalignmpi #PBS -k oe # load the latest version of novoalignMPI module load novocraft mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l` mpiexec -np $np `which novoalignMPI` -d file.index -f sim1.fastq sim2.fastq -o SAM > out.sam mpdallexit
3. submit job on the biowulf headnode:
qsub -v np=$np -l nodes=($np-1):g24:c24,mem=0 /path/to/qsub/script/above
The parameter $np is the number of processes, and should be replaced by an actual number in the command above. The value must be at least 2. One process is the master process that uses very little cpu time. Thus, you can set the $np to be one more than the number of nodes. Each process will auto-thread to use all the cores on a node.
The flag "mem=0" tells the batch system to ignore memory usage. In the past, Novocraft MPI jobs were being killed by the batch system because of an incorrect memory calculation. This flag bypasses that problem. For example:
qsub -v np=4 -l nodes=3:g24:c24,mem=0 /FullPathToQsubScriptAbove
In this example, 3x g24 nodes were requested. So set np to 3+1=4. One of the 4 processes will be the master process that uses very little cpu. The other 3 processes will each auto-thread to use all the 24 cores on a node.
Novoindex can use a lot of memory, so it is worthwhile estimating the memory usage before submitting the job, to prevent nodes with overloaded memory. (Thanks to Colin Hercus of Novocraft for this information).
The memory used for a indexed genome is
N/2 + 4(k+1) + 4N/s
where N is the length of the reference genome, k the index k-mer length and s the indexing step size. Note that the second term must be converted to the same units as the first and third.
For example, for a 6GB reference sequence, with default values k=15 and s=2, the index size would be
6G/2 + 416 +4*6G/2 = 3G + 4G + 12G = 20G
It might be better to set the options as -k=15 -s=3 and then have index of ~ 15G
or -k=14 -s=3 for an index size of 13G.
Changing k&s can have an effect on run time so it might be worth testing with a few values to find the best memory/run time trade off.
Loading the appropriate module and then typing the command with no parameters will give you the latest information about each command. The Novoalign PDF documentation may not be updated as often as the software versions.
[user@biowulf]$ module load novocraft [user@biowulf]$ novoindex # novoindex (2.8) - Universal k-mer index constructor. # (C) 2008 - 2011 NovoCraft Technologies Sdn Bhd # novoindex # Creating 16 indexing threads. Error: Please supply an index filename and at least one sequence file. Usage: novoindex -k 99 -s 9 -m indexfile sequencefiles.... Where: -k 99 is the k-mer length to be used for the index. Typically 14. -s 9 is the step size for the index. Typical values are from 1 to 3. -t 9 sets number of threads to use for indexing. -m sets lower case masking on. Lower case sequence will not be indexed. -b sets bisulphite indexing and alignment mode for methylation experiments. -c sets ABI SOLiD Colour space indexing mode. -n name sets the an internal name for the reference sequence index. This is used in report headers and as the AS: field in SAM SQ record. Defaults to the indexfile name. indexfile is the filename for the indexed reference sequence generated by novoindex. sequencefiles a list of sequence files in fasta format to be included in the index. Example: novoindex -k 14 -s 1 celegans.ndx elegans.dna.fa If k or s are not specified a suitable value will be chosen by novoindex. (c) 2008 NovoCraft Technologies Sdn BHd