VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, and 4.0), nevertheless, the users are encouraged to use the latest version VCFv4.0. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well.
VCFTools is maintained and developed by Adam Auton, Peter Danecek and collaborators. VCFTools paper.
Please Note, tabix and bgzip are both under the same directory.
It is important that the paths be set up correctly for VCFtools. This can be done by typing 'module load vcftools' as in the example below.
To run bcftools/htslib commands, run 'module load bcftools_htslib' instead.
1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of location before running. Remember, the two $PERL5LIB and $PATH environmental variables have to be set correctly first.
#!/bin/bash # This file is vcftools # #PBS -N vcftools #PBS -m be #PBS -k oe module load vcftools cd /data/user/somewhereWithInputfile compare-vcf inputFile1 inputFile2
2. Submit the script using the 'qsub' command on Biowulf.
qsub -l nodes=1:g4 /data/username/theScriptFileAbove
The job has been submitted to a node with 4 GB of memory ('g4' in the command above). Use 'freen' to see available node types.
Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.
The command 'module load vcftools' can be included in the swarm command file, as in the example below. It can also be added to your .bashrc or .cshrc file, and then it will not need to be included in the swarm command file.
Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:
module load vcftools; cd /data/user/somewhereWithInputfile1; compare-vcf inputFile1 inputFile2 module load vcftools; cd /data/user/somewhereWithInputfile2; compare-vcf inputFile1 inputFile2 module load vcftools; cd /data/user/somewhereWithInputfile3; compare-vcf inputFile1 inputFile2 [....etc....] module load vcftools; cd /data/user/somewhereWithInputfile15; compare-vcf inputFile1 inputFile2
Submit this swarm with:
swarm -f cmdfile
By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If each VCFtools command requires more than 1 GB of memory, you should specify the memory required by using the '-g #' flag to swarm, where # is the number of Gigabytes of memory required.
For example, if each of the vcftools commands in the swarm command file above requires 10 GB of memory, then you will need to submit the swarm job with:
biowulf> $ swarm -g 10 -f cmdfile
For more information regarding running swarm, see swarm.html
Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready
pXXXX> $ cd /data/user/myruns pXXXX> $ module load vcftools pXXXX> $ cd /data/userID/vcftools/run1 pXXXX> $ compare-vcf file1.gz file2.gz pXXXX> $ .......... pXXXX> exit qsub: job 2236960.biobos completed [user@biowulf ~]$
If you want a specific type of node, you can specify that on the qsub command line. For example, to request a node with 24 GB of memory, use