N I H H e l i x S y s t e m s
Steven Fellini
sfellini@nih.gov
CIT
1 Dec 2005
This page is at
http://biowulf.nih.gov/easy.html
The Biowulf Home Page is at
http://biowulf.nih.gov
| Helix (SGI) | Biowulf (cluster) |
| one computer system with CPUs, memory and disks | many systems (nodes) |
| proprietary hardware and software | commodity hardware and open software (Linux) |
| moderate number of CPUs (8-32) | 2000+ CPUs |
| shared memory | distributed memory |
| large memory (8-32 GB) | smaller memory (1-4 GB) |
| computation on login system | computation on computational nodes |
| system runs several applications simultaneously | node dedicated to one computation |
| interactive | queuing system (batch) |
|
| nodes (2p) | processors | memory | networks |
| 805 | AMD Opteron 2.8, 2.2 & 2.0 GHz | 2 & 4 GB | Infiniband, Myrinet & Gigabit ethernet |
| 388 | Intel Xeon 2.8 GHz | 1, 2 & 4 GB | Myrinet, Gigabit ethernet & Fast ethernet |
| 203 | AMD Athlon 1.8 & 1.4 GHz | 1 & 2 GB | Myrinet & Fast ethernet |
| Location | Creation | Backups | Performance | Amount of Space | Accessible from (*) | |
| /home | network (NFS) | with Biowulf account | yes | high | 200 MB (quota) | B,C |
| /scratch (nodes) | local | created by user | no | best | 6 - 30 GB dedicatedwhile node is allocated | C |
| /scratch (biowulf) | network (NFS) | created by user | no | low | 120 GB shared | B,H,N |
| /data | network (NFS) | with Biowulf account | yes | high | based on quota (48 GB default) | B,C,H,N |
$ ls -l foo.tmp -rw-r--r-- 1 steve wheel 2 Mar 11 2005 foo.tmp [steve@biobos steve]$ rm foo.tmp [steve@biobos steve]$ ls -l foo.tmp ls: foo.tmp: No such file or directory [steve@biobos steve]$ cd .snapshot [steve@biobos .snapshot]$ ls _hourly.0 _hourly.3 _nightly.0 _nightly.11 _nightly.2 _nightly.5 _nightly.8 _weekly.1 _hourly.1 _hourly.4 _nightly.1 _nightly.12 _nightly.3 _nightly.6 _nightly.9 _weekly.2 _hourly.2 _hourly.5 _nightly.10 _nightly.13 _nightly.4 _nightly.7 _weekly.0 _weekly.3 [steve@biobos .snapshot]$ cd _nightly.0 [steve@biobos _nightly.0]$ ls -l foo.tmp -rw-r--r-- 1 steve wheel 2 Mar 11 2005 foo.tmp [steve@biobos _nightly.0]$ cp foo.tmp /home/steve
Not Suitable:
Phylogenetic/Linkage Analysis
Open a connection to biowulf.nih.gov (or
helix.nih.gov)
Change directory to /data/username/
Put your files into that directory.
biobos% easyblast
EasyBlast: Blast for large numbers of sequences
Enter the directory which contains your input sequences: data/username/blast/myseqs
Enter the directory where you want your Blast output to go: /data/username/blast/results
** WARNING: There are already files in /data/username/blast/results which will be deleted by this job.
** Continue? (y/n) :y
BLAST programs:
blastn - nucleotide query sequence against nucleotide database
blastp - protein query sequence against protein database
blastx - nucleotide query translated in all 6 reading frames
against a protein database
tblastn - protein query sequence against a nucleotide database
translated in all 6 reading frames
tblastx - 6-frame translations of a nucleotide query sequence
against the 6-frame translations of a nucleotide database
blastpgp - PSI-BLAST protein query against protein database
Which program do you want to run: blastn
The following nucleotide databases are available:
(or enter your own database with full pathname)
nt - all nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
hs_genome - human genome assembly (Build 33, 14 Apr 2003)
est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
patnt - from the patent division of Genbank
pdbnt - from the 3-dimensional structures
htgs - high throughput genome sequences
ecoli.nt - ecoli genomic sequences
mito.nt - mitochondrial sequences
yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
drosoph.nt - drosophila sequences
hs.fna - RefSeq human sequences
other_genomic - non-human genomic sequences
mouse_genome - mouse genome
mouse_masked - mouse genome, masked
Database to run against: nt
Want a summary file in the output directory? (y/n, default y) : n
http://biowulf.nih.gov/apps/blast.htmlg has a full list of available parameters.
Any additional Blast parameters (e.g. -v 10):
Checking node situation....
Submitting to 20 nodes. Job number is 709061.biobos
Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?709061.biobos
Monitoring your job
Use the URL that EasyBlast gives you to watch your job run. You can see the full range of Biowulf monitors at http://biowulf.nih.gov/sysmon/
Blue: no load
Green: load ~ 1
Yellow: load ~ 2 (i.e. fully utilized)
Red: load > 2 (problem?)
What to expect:
Contact the helix staff (staff@helix.nih.gov or 4-6248) if you have questions or concerns about your job.
What EasyBlast does
Note: To run against your own database, enter its full pathname. e.g.
Database to run against: /data/username/blast_db/my_own_dbwhere my_own_db is a Blast database formatted with the formatdb program. (available in /usr/local/blast/formatdb).
Blastn
Query: 1000 nucleotide EST sequences
Database: NCBI nt nucleotide database (1,431,631 sequences,
1.8 Gb)
Serial blast runs with GCG's Netblast against NCBI server ~ 18 hrs
Silicon Graphics R14000, 4 processors (nimbus.nih.gov) - 5.5 hrs
10 Biowulf p2800 nodes: 33 mins
This program is NOT parallelized on Biowulf. The advantage of using the Biowulf cluster would be to run Repeatmasker on a large number of sequences.
repeatmasker NM_00110* repeatmasker NM_00111* repeatmasker NM_00112* repeatmasker NM_00113* repeatmasker NM_00114* repeatmasker NM_00115* repeatmasker NM_00116* repeatmasker NM_00117* repeatmasker NM_00118* repeatmasker NM_00119*
swarm -f swarmcmd
# # this file is cmdfile # myprog -param a < infile-a > outfile-a myprog -param b < infile-b > outfile-b myprog -param c < infile-c > outfile-c myprog -param d < infile-d > outfile-d myprog -param e < infile-e > outfile-e myprog -param f < infile-f > outfile-f myprog -param g < infile-g > outfile-g |
2. Submit the job via the 'swarm' command.
swarm -f cmdfile
#!/bin/bash -v # This file name is my_script # #PBS -N run1 #PBS -m be #PBS -k oe PATH=/usr/local/mpich/bin:$PATH; export PATH mpirun -machinefile $PBS_NODEFILE -np $np ght < test.in |
qsub -l nodes=1 my_scriptYou can test the job interactively:
biobos% qsub -I -l nodes=1 qsub: waiting for job 664776.biobos to start qsub: job 664776.biobos ready [user@p2 ~]$ cd /data/username/mydir [user@p2 mydir]$ setenv PATH /usr/local/mpich/bin:$PATH [user@p2 mydir]$ mpirun -machinefile $PBS_NODEFILE -np 1 /usr/local/bin/ght < test.in ************************************************************************ * * * GENEHUNTER-TWOLOCUS - A modified version of GENEHUNTER * * (version 1.3) * * * ************************************************************************ Type 'help' or '?' for help. Can't find help file - detailed help information is not available. See installation instructions for details. running on 1 nodes npl:1> 'photo' is on: file is 'two02.out' npl:2> Fri Nov 25 13:54:28 2005 npl:3> Single point mode is now 'off' npl:4> Count recs is now 'off' npl:5> Haplotype output is now 'off' npl:6> Unaffected children are now used. npl:7> Currently analyzing a maximum of 9 bits per pedigree npl:8> Large pedigrees are now used but trimmed. npl:9> The current analysis type is 'BOTH' [...] [user@p2 /data/user/mydir] exit logout qsub: job 668223.biobos completed [susanc@biobos ~]$ |
You must exit the node after an interactive run!