The input for RepeatMasker is a fasta-format sequence file. Multiple sequences can be contained within a single file.On Helix/Biowulf, RepeatMasker has been configured to have Crossmatch as the default search engine. Other search engines available are NCBI Blast (i.e. RMBlast) and HMMER. To use an alternative search engine, use the -e flag.
-e(ngine) [crossmatch|ncbi|hmmer] Use an alternate search engine to the default.
Note that wublast, abblast, and decypher are not configured as search engines on Helix/Biowulf. The only valid choices are
-e crossmatch -e ncbi -e hmmer
Please contact email@example.com if you have a particular need for a different search engine.
In this sample session, a sequence is obtained by using the EMBOSS seqret program. This sequence is then analyzed with RepeatMasker using the default (NCBI Blast) search engine, and then again using Crossmatch as the search engine.
helix% emboss [...] [user@helix ~]$ seqret Reads and writes (returns) sequences Input (gapped) sequence(s): genbank:ay001401 output sequence(s) [ay001401.fasta]: [user@helix ~]$ repeatmasker ay001401.fasta RepeatMasker version open-4.0.0 Search Engine: NCBI/RMBLAST [ 2.2.27+ ] Master RepeatMasker Database: /usr/local/apps/RepeatMasker/4.0.0/Libraries/RepeatMaskerLib.embl ( Complete Database: 20120418 ) Building species libraries in: /usr/local/apps/RepeatMasker/4.0.0/Libraries/20120418/homo_sapiens - 1860 ancestral and ubiquitous sequence(s) for homo sapiens - 9 lineage specific sequence(s) for homo sapiens analyzing file ay001401.fasta Checking for E. coli insertion elements identifying Simple Repeats in batch 1 of 1 identifying full-length ALUs in batch 1 of 1 identifying full-length interspersed repeats in batch 1 of 1 identifying remaining ALUs in batch 1 of 1 identifying most interspersed repeats in batch 1 of 1 identifying long interspersed repeats in batch 1 of 1 identifying ancient repeats in batch 1 of 1 identifying retrovirus-like sequences in batch 1 of 1 identifying tough LINE1s in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 No repetitive sequences were detected in ay001401.fasta [user@helix ~]$ repeatmasker -e crossmatch ay001401.fasta RepeatMasker version open-4.0.0 Search Engine: Crossmatch [ 1.090518 ] Master RepeatMasker Database: /usr/local/apps/RepeatMasker/4.0.0/Libraries/RepeatMaskerLib.embl ( Complete Database: 20120418 ) analyzing file ay001401.fasta Checking for E. coli insertion elements identifying Simple Repeats in batch 1 of 1 identifying full-length ALUs in batch 1 of 1 identifying full-length interspersed repeats in batch 1 of 1 identifying remaining ALUs in batch 1 of 1 identifying most interspersed repeats in batch 1 of 1 identifying long interspersed repeats in batch 1 of 1 identifying ancient repeats in batch 1 of 1 identifying retrovirus-like sequences in batch 1 of 1 identifying tough LINE1s in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 No repetitive sequences were detected in ay001401.fasta helix%
Set up a batch script along the following lines:
#!/bin/bash cd /data/mydir repeatmasker -e hmmer -species human myfile.fasta
Submit this job with:
qsub -l nodes=1 myjob.bat
Set up a swarm command file containing one line for each of your Repeatmasker runs. Typically, only the input sequence name will change from line to line, but in the example below, different parameters are being applied to each sequence.
Sample swarm command file
--------file sample.com------------------------------- repeatmasker -gccalc /data/username/protein1/file1.seq repeatmasker -s /data/username/protein/file2.seq repeatmasker -q /data/username/protein/file3.seq .... ------------------------------------------------------
Submit this set of runs to the batch system by typing
swarm -f sample.com
If you have over 1000 repeatmasker commands, they should be bundled with the '-b' flag to swarm. '-b 25' will send 25 of the commands to a single processor, and then submit two such bundles as a single swarm job. This hugely decreases the number of individual jobs and therefore decreases the overhead for such large numbers of small jobs. (More information about swarm options)
Thus, to run repeatmasker on 5000 sequences, you would set up a swarm command file with one line per sequence as above. This file would be submitted to the swarm program using:
swarm -b 50 -f sample.comswarm will send 50 commands to a single processor, and 50x2 = 100 commands as a single batch job to a node. The total number of jobs will be 5000 / 100 = 50 swarm jobs.
As always, jobs can be monitored using the Biowulf cluster monitors. Click on 'List status of running jobs only', and then your username or job number on the resultant page to view your own jobs only, as in the image on the right.
--------file sample.com------------------------------- repeatmasker -pa 4 -gccalc /data/username/protein1/file1.seq repeatmasker -pa 4 -s /data/username/protein/file2.seq repeatmasker -pa 4 -q /data/username/protein/file3.seq .... ------------------------------------------------------
swarm -f sample.com -t 4This may not be faster than not using the '-pa' flag at all, and simply running swarm normally. (The Biowulf staff would be interested in any user's comparisons of the two methods; please email us at firstname.lastname@example.org)
%repeatmasker NAME RepeatMasker - Mask repetitive DNA SYNOPSIS RepeatMasker [-options] <seqfiles(s) in fasta format> DESCRIPTION The options are: -h(elp) Detailed help Default settings are for masking all type of repeats in a primate sequence. -w(ublast) Use WU-blast, rather than cross_match as engine -pa(rallel) [number] The number of processors to use in parallel (only works for batch files or sequences over 50 kb) -s Slow search; 0-5% more sensitive, 2-3 times slower than default -q Quick search; 5-10% less sensitive, 2-5 times faster than default -qq Rush job; about 10% less sensitive, 4->10 times faster than default (quick searches are fine under most circumstances) repeat options -nolow /-low Does not mask low_complexity DNA or simple repeats -noint /-int Only masks low complex/simple repeats (no interspersed repeats) -norna Does not mask small RNA (pseudo) genes -alu Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA) -div [number] Masks only those repeats < x percent diverged from consensus seq -lib [filename] Allows use of a custom library (e.g. from another species) -cutoff [number] Sets cutoff score for masking repeats when using -lib (default 225) -species <query species> Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Some examples are: -species human -species mouse -species rattus -species "ciona savignyi" -species arabidopsis Other commonly used species: mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu, danio, "ciona intestinalis" drosophila, anopheles, elegans, diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize Contamination options -is_only Only clips E coli insertion elements out of fasta and .qual files -is_clip Clips IS elements before analysis (default: IS only reported) -no_is Skips bacterial insertion element check -rodspec Only checks for rodent specific repeats (no repeatmasker run) -primspec Only checks for primate specific repeats (no repeatmasker run) Running options -gc [number] Use matrices calculated for 'number' percentage background GC level -gccalc RepeatMasker calculates the GC content even for batch files/small seqs -frag [number] Maximum sequence length masked without fragmenting (default 51000) -maxsize [nr] Maximum length for which IS- or repeat clipped sequences can be produced (default 4000000). Memory requirements go up with higher maxsize. -nocut Skips the steps in which repeats are excised -noisy Prints cross_match progress report to screen (defaults to .stderr file) -nopost Do not postprocess the results of the run ( i.e. call ProcessRepeats ). NOTE: This options should only be used when ProcessRepeats will be run manually on the results. output options -dir [directory name] Writes output to this directory (default is query file directory, "-dir ." will write to current directory). -a(lignments) Writes alignments in .align output file; (not working with -wublast) -inv Alignments are presented in the orientation of the repeat (with option -a) -cut ***NOT AVAILABLE IN THIS RELEASE*** Saves a sequence (in file.cut) from which full-length repeats are excised -small Returns complete .masked sequence in lower case -xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked -x Returns repetitive regions masked with Xs rather than Ns -poly Reports simple repeats that may be polymorphic (in file.poly) -ace Creates an additional output file in ACeDB format -gff Creates an additional Gene Feature Finding format output -u Creates an additional annotation file not processed by ProcessRepeats -xm Creates an additional output file in cross_match format (for parsing) -fixed Creates an (old style) annotation file with fixed width columns -no_id Leaves out final column with unique ID for each element (was default) -e(xcln) Calculates repeat densities (in .tbl) excluding runs of >25 Ns in the query SEE ALSO Crossmatch, Blast, MaskerAid COPYRIGHT Copyright 2004 Arian Smit, Institute for Systems Biology AUTHOR Arian Smit (email@example.com) Robert Hubley (firstname.lastname@example.org)