Biowulf at the NIH
RSS Feed
Sickle on Biowulf

Program introduction

Sickle is a windowed adaptive trimming tool for FASTQ files using quality. Most modern sequencing technologies produce reads that have deteriorating quality towards the 3'-end and some towards the 5'-end as well. Incorrectly called bases in both regions negatively impact assembles, mapping, and downstream bioinformatics analyses. Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads. It will also discard reads based upon the length threshold. It takes the quality values and slides a window across them whose length is 0.1 times the length of the read. If this length is less than 1, then the window is set to be equal to the length of the read. Otherwise, the window slides along the quality values until the average quality in the window rises above the threshold, at which point the algorithm determines where within the window the rise occurs and cuts the read and quality there for the 5'-end cut. Then when the average quality in the window drops below the threshold, the algorithm determines where in the window the drop occurs and cuts both the read and quality strings there for the 3'-end cut. However, if the length of the remaining sequence is less than the minimum length threshold, then the read is discarded entirely. 5'-end trimming can be disabled.

The easiest way to set up the environment for sickle is by typing 'module load sickle'.


Submitting a Single Batch Job

1. Create a script file along the lines of the one below:

# This file is FileName
#PBS -N RunName
#PBS -m be
#PBS -k oe

module load sickle

cd /data/user/somewhereWithInputfile
sickle pe -f input_file1.fastq -r input_file2.fastq........

3. Submit the script using the 'qsub' command on Biowulf.

qsub -l nodes=1 /data/username/theScriptFileAbove

This will submit the job to a node with at least 1 GB of memory and 2 cores. If your sickle command requires more than 1 GB of memory, you can submit to a node with larger memory, e.g.

qsub -l nodes=1:g8 /data/username/theScriptFileAbove

will submit to a node with 8 GB of memory (g8). Use 'freen' to see available node types.

Submitting a Swarm of Jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

cd /data/user/run1/; sickle pe -f file1.fastq -r file2.fastq ......
cd /data/user/run1/; sickle pe -f file3.fastq -r file4.fastq ......
oad sickle; cd /data/user/run1/; sickle pe -f file5.fastq -r file6.fastq ......
Submit this swarm with:
swarm -f cmdfile --module sickle

If each sickle command requires more than 1 GB of memory, you need to run swarm with the -g flag. e.g. if each command (a single line in the file above) requires 8 GB of memory, you would use:

swarm -g 8 -f cmdfile --module sickle

You may need to do a few test runs to determine how much memory your job needs. Set up a single sickle command in a batch file and submit this using qsub. The output from the job will list the amount of memory used.

For more information regarding running swarm, see swarm.html


Running an Interactive Job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns
[user@p4]$ module load sickle
[user@p4]$ sickle pe -f input_file1.fastq -r input_file2.fastq ......
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$

User may add property of node in the qsub command to request specific interactive node. For example, if you need a node with 8gb of memory to run job interactively, do this:

biowulf% qsub -I -l nodes=1:g8