EMBOSS on Biowulf

EMBOSS stands for "The European Molecular Biology Open Software Suite". Within EMBOSS you will find around hundreds of programs (applications) covering areas such as:

EMBOSS on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, using the programs in EMBOSS. If you have just a few query sequences, you should use the EMBOSS web interface or command line on Helix. Please contact the Helix Systems staff, or 301-594-6248) if you have questions about your EMBOSS jobs.

Submitting EMBOSS jobs on Biowulf

The EMBOSS programs are typically used to perform one or more tasks on a large number of sequences. The swarm program on Biowulf is ideally suited for large numbers of independent simultaneous jobs like this.

1. Set up the EMBOSS environment. csh or tcsh users should add the following lines to the end of their /home/username/.cshrc file.

setenv PLPLOT_LIB /usr/local/emboss/plplot/lib
set path=( /usr/local/emboss/bin ${path} )
setenv emboss_acdroot /usr/local/emboss/share/EMBOSS/acd
Or for bash/ksh/sh users, insert the following at the end of your .bashrc file:
export PLPLOT_LIB PATH emboss_acdroot

2. Set up the swarm command file. with one line for each command that you wish to run. For example, to pull 2500 sequences out of the database, you would run the EMBOSS 'seqret' command 2500 times. Create a file called 'cmd.file' which contains 2500 lines, one for each command. e.g.:

seqret -sequence 'genbank:ab1681*' -outseq 'outseq1'
seqret -sequence 'swissprot:P16310' -outseq 'outseq2'
seqret -sequence 'genpept:M31661' -outseq 'outseq3'

seqret -sequence 'refseqnt:nc_011*' -outseq 'outseq4'

Each command line in the cmd.file should appear just as they would be entered on a command line.

3. Submit this swarm job to the cluster

There are one flag of swarm that's required '-f' and two other flags of swarm user most possibly needs to specify when submit a swarm job: '-t' and '-g'.

-f: the swarm command file name above (required)
-t: number of processors per node to use for each line of the commands in the swarm file above.(optional)
-g: GB of memory needed for each line of the commands in the swarm file above.(optional)

By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If this is not what you want, you will need to specify '-t' and '-g' flags when you submit the job on biowulf.

Say if each line of the commands above also will need to use 10gb of memory instead of the default 1gb of memory, make sure swarm understands this by including '-g 10' flag:

biowulf> $ swarm -g 10 -f cmdfile

For more information regarding running swarm, see swarm.html

Useful tip. It is obviously time-consuming and error-prone to create a large swarm command file by hand. You will probably want to write a simple csh or perl script to build this swarm command file. If you are unfamiliar with csh, Introduction to scripting with bash may be useful. The following is an exmaple using csh to build a command file:

        helix% cd my_sequence_directory
        helix% touch cmdfile
        helix% foreach file (*)
        foreach> echo "patmatmotifs $file $file.out >> cmdfile end

