Biowulf at the NIH
RSS Feed
Prinseq on Helix & Biowulf

PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. It is particular designed for 454/Roche data, but can also be used for other types of sequence data. PRINSEQ provides filter, trim and reformat options for data preprocessing.

[Prinseq website]

Running Prinseq on Helix

The following sample command will filter out sequences with N from fastq files.

helix% module load prinseq

helix% prinseq-lite -verbose -fastq test.fq -ns_max_n 0 -out_good test_no_ns -out_bad test_with_ns

This will separate the input FASTQ files into two files (test_no_ns.fastq containing sequences without N and test_with_ns.fastq containing all sequences with at least one N). You can replace the value for -ns_max_n with, for example, 2 to remove sequences with more than 2 Ns. Alternatively, you can use "-ns_max_p 1" to remove sequences with more than 1% of Ns.

Running a Prinseq job on Biowulf

Set up a batch script along the following lines. The following script filters out low-complexity sequences using the 'dust' algorithm, and converts the sequences to uppercase. The output is in fastq format in the file out1.fq. See the documentation for other options.

# this file is called myjob.bat
module load prinseq
cd /data/$USER/mydir
prinseq-lite -verbose -fastq test.fq -lc_method dust -lc_threshold 40 -seq_case upper -out_good out1.fq

Submit this job with:

qsub -l nodes=1 myjob.bat

Running prinseq interactively on Biowulf

Allocate an interactive node, load the prinseq module, and run the process. Sample session.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 6395753.biobos to start
qsub: job 6395753.biobos ready

[susanc@p1465 ~]$ module load prinseq

[susanc@p1465 ~]$ cd /data/$USER/mydir

[susanc@p1465 ~]$ prinseq-lite -verbose -fastq test.fq -lc_method dust -lc_threshold 40 -seq_case upper -out_good out1.fq

[susanc@p1465 ~]$ exit
qsub: job 6395753.biobos completed



Prinseq documentation at