biowulf_logo

Status
About
Hardware
Applications
Batch queues
Disk storage

MPI
Performance
New Users
User Guide
Documentation
Research
Photos


RepeatMasker on Biowulf

    RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green. The RepeatMasker program was developed at Washington University by Adrian Smit.

    RepeatMasker documentation

    Repeatmasker uses the Repbase libraries. Users should be aware of the Repbase academic license agreement before using Repeatmasker on the Helix Systems.

    The input for RepeatMasker is a fasta-format sequence file. Multiple sequences can be contained within a single file.

    How to Run RepeatMasker on many sequences

    Use the swarm utility. Set up a swarm command file containing one line for each of your Repeatmasker runs. Sample swarm command file eric
    --------file sample.com-------------------------------
    repeatmasker -gccalc /data/username/protein1/file1.seq
    repeatmasker -s /data/username/protein/file2.seq
    repeatmasker -q /data/username/protein/file3.seq
    ....
    ------------------------------------------------------
    

    Submit this set of runs to the batch system by typing

    swarm -f sample.com
    

    If you have over 1000 repeatmasker commands, they should be bundled with the '-b' flag to swarm. '-b 25' will send 25 of the commands to a single processor, and then submit two such bundles as a single swarm job. This hugely decreases the number of individual jobs and therefore decreases the overhead for such large numbers of small jobs. (More information about swarm options)

    Thus, to run repeatmasker on 5000 sequences, you would set up a swarm command file with one line per sequence as above. This file would be submitted to the swarm program using:

    swarm -b 50 -f sample.com
    
    swarm will send 50 commands to a single processor, and 50x2 = 100 commands as a single batch job to a node. The total number of jobs will be 5000 / 100 = 50 swarm jobs.

    As always, jobs can be monitored using the Biowulf cluster monitors. Click on 'List status of running jobs only', and then your username or job number on the resultant page to view your own jobs only, as in the image on the right.


    Parallelization: Repeatmasker has its own parallelization option (the -pa(rellel) parameter). This can parallelize the job across the 2 processors in one node. It will not parallelize across multiple nodes. Thus, when running Repeatmasker on the Biowulf cluster, the maximum value for this parameter should be 2. Since each job will be using 2 processors of the node, swarm needs to be set to run only one job per node so that the node doesn't get overloaded.
    --------file sample.com-------------------------------
    repeatmasker -pa 2 -gccalc /data/username/protein1/file1.seq
    repeatmasker -pa 2 -s /data/username/protein/file2.seq
    repeatmasker -pa 2 -q /data/username/protein/file3.seq
    ....
    ------------------------------------------------------
    
    Submit these jobs by typing
    swarm -f sample.com -n 1
    
    This may not be faster than not using the '-pa' flag at all, and simply running swarm normally. (The Biowulf staff would be interested in any user's comparisons of the two methods; please email us at staff@helix.nih.gov)


    Repeatmasker options Typing 'repeatmasker' at the biobos prompt produces the following brief description of repeatmasker options. More information can be obtained by typing 'repeatmasker -h'.

    %repeatmasker
    NAME
        RepeatMasker - Mask repetitive DNA
    
    SYNOPSIS
          RepeatMasker [-options] 
    
    DESCRIPTION
        The options are:
    
        -h(elp)
            Detailed help
    
        Default settings are for masking all type of repeats in a primate
        sequence.
    
        -w(ublast)
            Use WU-blast, rather than cross_match as engine
    
        -pa(rallel) [number]
            The number of processors to use in parallel (only works for batch
            files or sequences over 50 kb)
    
        -s  Slow search; 0-5% more sensitive, 2-3 times slower than default
    
        -q  Quick search; 5-10% less sensitive, 2-5 times faster than default
    
        -qq Rush job; about 10% less sensitive, 4->10 times faster than default
            (quick searches are fine under most circumstances) repeat options
    
        -nolow /-low
            Does not mask low_complexity DNA or simple repeats
    
        -noint /-int
            Only masks low complex/simple repeats (no interspersed repeats)
    
        -norna
            Does not mask small RNA (pseudo) genes
    
        -alu
            Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
    
        -div [number]
            Masks only those repeats < x percent diverged from consensus seq
    
        -lib [filename]
            Allows use of a custom library (e.g. from another species)
    
        -cutoff [number]
            Sets cutoff score for masking repeats when using -lib (default 225)
    
        -species 
            Specify the species or clade of the input sequence. The species name
            must be a valid NCBI Taxonomy Database species name and be contained
            in the RepeatMasker repeat database. Some examples are:
    
              -species human
              -species mouse
              -species rattus
              -species "ciona savignyi"
              -species arabidopsis
    
            Other commonly used species:
    
            mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
            danio, "ciona intestinalis" drosophila, anopheles, elegans,
            diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize
    
        Contamination options
    
        -is_only
            Only clips E coli insertion elements out of fasta and .qual files
    
        -is_clip
            Clips IS elements before analysis (default: IS only reported)
    
        -no_is
            Skips bacterial insertion element check
    
        -rodspec
            Only checks for rodent specific repeats (no repeatmasker run)
    
        -primspec
            Only checks for primate specific repeats (no repeatmasker run)
    
        Running options
    
        -gc [number]
            Use matrices calculated for 'number' percentage background GC level
    
        -gccalc
            RepeatMasker calculates the GC content even for batch files/small
            seqs
    
        -frag [number]
            Maximum sequence length masked without fragmenting (default 51000)
    
        -maxsize [nr]
            Maximum length for which IS- or repeat clipped sequences can be
            produced (default 4000000). Memory requirements go up with higher
            maxsize.
    
        -nocut
            Skips the steps in which repeats are excised
    
        -noisy
            Prints cross_match progress report to screen (defaults to .stderr
            file)
    
        -nopost
            Do not postprocess the results of the run ( i.e. call ProcessRepeats
            ). NOTE: This options should only be used when ProcessRepeats will
            be run manually on the results.
    
        output options
    
        -dir [directory name]
            Writes output to this directory (default is query file directory,
            "-dir ." will write to current directory).
    
        -a(lignments)
            Writes alignments in .align output file; (not working with -wublast)
    
        -inv
            Alignments are presented in the orientation of the repeat (with option -a)
    
        -cut ***NOT AVAILABLE IN THIS RELEASE***
            Saves a sequence (in file.cut) from which full-length repeats are excised
    
        -small
            Returns complete .masked sequence in lower case
    
        -xsmall
            Returns repetitive regions in lowercase (rest capitals) rather than masked
    
        -x  Returns repetitive regions masked with Xs rather than Ns
    
        -poly
            Reports simple repeats that may be polymorphic (in file.poly)
    
        -ace
            Creates an additional output file in ACeDB format
    
        -gff
            Creates an additional Gene Feature Finding format output
    
        -u  
            Creates an additional annotation file not processed by ProcessRepeats
    
        -xm 
            Creates an additional output file in cross_match format (for parsing)
    
        -fixed
            Creates an (old style) annotation file with fixed width columns
    
        -no_id
            Leaves out final column with unique ID for each element (was default)
    
        -e(xcln)
            Calculates repeat densities (in .tbl) excluding runs of >25 Ns in the query
    
    SEE ALSO
            Crossmatch, Blast, MaskerAid
    
    COPYRIGHT
        Copyright 2004 Arian Smit, Institute for Systems Biology
    
    AUTHOR
        Arian Smit 
        Robert Hubley 
    

This document is available as http://biowulf.nih.gov/apps/repeatmasker/index.html
Biowulf home page | Helix Systems | NIH