Biowulf at the NIH
RSS Feed
eqtlbma on Biowulf

The eQtlBma package implements Bayesian statistical methods to detect eQTLs jointly in multiple subgroups (e.g. tissues). Key features are to borrow information across subgroups, to explicitly model heterogeneity (qualitatively and quantitatively), and to borrow information across genes to estimate hyper-parameters from the data ("empirical Bayes").

[eqtlbma website]

Running a single eqtlbma job on Biowulf

Set up a batch script along the following lines:

#!/bin/bash
# this file is called eqtlbma.bat

module load eqtlbma

cd $PBS_O_WORKDIR

eqtlbma \
--geno file_with_paths_to_genotype_files.txt \
--scoord file_with_SNP_coordinates.bed.gz \
--pheno file_with_paths_to_phenotype_files.txt \
--fcoord file_with_gene_coordinates.bed.gz \
--out output_eqtlbma \
--outss \
--step 3 \
--gridL file_with_large_grid.txt \
--gridS file_with_small_grid.txt \
--bfs all \
Submit this job with:
qsub -l nodes=1 eqtlbma.bat
This will submit the job to a node with default 2 GB of memory. If your job will require more memory, you can specify this on the qsub command line. e.g. if your job will need 20 GB of memory, submit to a g24 node:
qsub -l nodes=1:g24 eqtlbma.bat

Running a set of eqtlbma processes

When dealing with many genes (e.g. 20,000) and SNPs (e.g. 5 millions), the eqtlbma documentation recommends splitting the analysis into batches, and then launching them in parallel.

To split the analysis 100 ways, you need to split all the genes into 100 lists. Starting with the initial BED file, you can either do the splitting on an interactive node, or via a batch job. It is not recommended to run the splitting on the Biowulf login node for large BED files.

biowulf% qsub -l nodes=1 -I
qsub: waiting for job 4897291.biobos to start
qsub: job 4897291.biobos ready

[user@p301 ~]$ cd /data/user/mydir
[user@p301 ~]$ export nbBatches="100"
[user@p301 ~]$ mkdir lists_genes
[user@p301 ~]$ cd lists_genes
[user@p301 ~]$ zcat ../gene_coords.bed.gz | cut -f1-4 | split \
-l $(echo "scale=0; $(zcat ../gene_coords.bed.gz | wc -l)/${nbBatches}" | bc -l) \
--suffix-length=3 --numeric-suffixes=1 --additional-suffix=.bed \
--filter='gzip > $FILE.gz' - list_genes_

[user@p301 ~]$ exit
Job 4897291.biobos  ended

Create a swarm command file for these 100 genes using commands (or put them in a script) like the following:

biowulf% cd /data/user/mydir/lists_genes
biowulf% for i in {1..100}; do
echo "eqtlbma \
--geno file_with_paths_to_genotype_files.txt \
--scoord file_with_SNP_coordinates.bed.gz \
--pheno file_with_paths_to_phenotype_files.txt \
--fcoord lists_genes/list_genes_001.bed.gz \
--out output_eqtlbma_001 \
--step 5 \
--outraw \
--qnorm \
--maf 0.05 \
--gridL file_with_large_grid.txt \
--gridS file_with_small_grid.txt \
--bfs all \
--mvlr \
--fitnull \
--nperm 10000 \
--trick 2 \
--pbf all >> swarmfile.txt
biowulf%
Now submit this swarm with
swarm -f swarmfile.txt --module eqtlbma
If each of these commands requires more than 1 GB of memory, specify the required memory with the -g flag. e.g.
swarm -g 3 -f swarmfile.txt --module eqtlbma
will tell swarm that each command requires 3 GB of memory. You should experiment with a single such job to determine the required memory.

Running a eqtlbma job interactively
For debugging purposes, it may be useful to run the occasional interactive job.

biowulf% qsub -I -l nodes=1:g4
qsub: waiting for job 4897291.biobos to start
qsub: job 4897291.biobos ready

[user@p301 ~]$ cd /data/user/mydir
[user@p301 ~]$ eqtlbma \
--geno file_with_paths_to_genotype_files.txt \
--scoord file_with_SNP_coordinates.bed.gz \
--pheno file_with_paths_to_phenotype_files.txt \
--fcoord lists_genes/list_genes_001.bed.gz \
--out output_eqtlbma_001 \
--step 5 \
--outraw \
--qnorm \
--maf 0.05 \
--gridL file_with_large_grid.txt \
--gridS file_with_small_grid.txt \
--bfs all \
--mvlr \
--fitnull \
--nperm 10000 \
--trick 2 \
--pbf all
[user@p301 ~] exit
Job completed

Documentation

Type 'module load eqtlbma' and then 'info eqtlbma'

eqtlbma manual