Biowulf at the NIH
RSS Feed
Biowulf User Guide

About:

Image of biowulf cluster

The NIH Biowulf cluster is a GNU/Linux parallel processing system designed and built at the National Institutes of Health and managed by the Helix Systems Staff. Biowulf consists of a main login node and 2300 compute nodes with a combined processor core count of over 12000. The computational nodes are connected to high-speed networks and have access to high-performance fileservers.


Citing Biowulf and Scientific Publications

The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the intramural program. In the past we have been successful in obtaining support by citing published work which involved the use of our systems.

Accounts

Biowulf accounts require a pre-existing Helix account. Once you have a Helix account, you can apply for a Biowulf account by filling out the Biowulf account request form. You will need to authenticate with your NIH login name and password, and your Principal Investigator (PI) will need to approve your request before the account is set up.

Biowulf accounts not accessed within a 60 day period are automatically locked as a security precaution. A user can unlock their own account in this situation by going HERE (restricted to NIH) and authenticating with their NIH login name and password. If this is not possible, send email to staff@helix.nih.gov

Logging in

The Biowulf system is accessed from the NIH network via ssh to biowulf.nih.gov. We recommend the PuTTY SSH client for Windows users as it is free, feature-rich, and widely used. Mac and Unix/Linux users can use the ssh client available from the command line in a terminal window.

Xwindows connection software is also recommended for GUI/visualization applications that are run on the login or computational nodes (e.g. x-povray or interactive SAS) display on your desktop workstation. We provide free Xwindows software for Macs and PCs to Helix Systems users.

Password Information

Users log in to Helix and Biowulf using their NIH domain username and password.

Shells

The Bourne-Again SHell (bash) is the default shell on most Linux systems including Biowulf. Other available shells, including csh and tcsh, are listed in /etc/shells. To change your default shell, use the command:

chsh -s shell_name

where shell_name is the full pathname of the desired shell (e.g., /bin/csh).

Home directories

Biowulf home directories are in a shared NFS (Network File System) filespace, therefore the access to files in your home directory is identical from any node in Biowulf. Note that your Biowulf home directory is the same directory as your Helix home directory.

Mail

E-mail can be sent from any node on Biowulf. However, all mail sent out from Biowulf will have a return address of helix.nih.gov (rather than the address of the node). In addition, mail sent to a user on a Biowulf node will be automatically forwarded to that user on helix.nih.gov. In other words, you can send, but not receive mail on Biowulf.

If you do not read mail on helix, be sure to set up a .forward file in your helix home directory.

You can communicate with the Biowulf/Helix staff by sending email to: staff@biowulf.nih.gov

Biowulf Hardware Configuration

Diagram of Biowulf's network

Node configurations:

# of nodes processor cores per node memory network
24
16 x 2.6 GHz Intel E5-2670
hyperthreading enabled
20 MB secondary cache
256 GB
1 Gb/s Ethernet
328
12 x 2.8 GHz Intel X5660
hyperthreading enabled
12 MB secondary cache
24 GB
1 Gb/s Ethernet
352
8 x 2.67 GHz Intel X5550
hyperthreading enabled
8 MB secondary cache
320 x 24 GB
32 x 72 GB
1 Gb/s Ethernet
250
8 x 2.8 GHz Intel E5462
12 MB secondary cache
8 GB
1 Gb/s Ethernet
16 Gb/s Infiniband
1
32 x 2.5 GHz AMD 0pteron 8380
512 KB secondary cache
512 GB
1 Gb/s Ethernet
232
4 x 2.8 GHz AMD Opteron 290
1 MB secondary cache
8 GB 1 Gb/s Ethernet
471
2 x 2.8 GHz AMD Opteron 254
1 MB secondary cache
40 x 8 GB
226 x 4 GB
91 x 2 GB
1 Gb/s Ethernet
160 x 8 Gb/s Infinipath
289
4 x 2.6 GHz AMD Opteron 285
1 MB secondary cache
8 GB 1 Gb/s Ethernet
389
2 x 2.2 GHz AMD Opteron 248
1 MB secondary cache
129 x 2 GB
66 x 4 GB
1 Gb/s ethernet
91
2 x 2.0 GHz AMD Opteron 246
1 MB secondary cache
48 x 1 GB
43 x 2 GB
1 Gb/s Ethernet

All nodes are connected to a 1 Gb/s switched Ethernet network while sub-sets of nodes are on high-performance Infinipath or Infiniband networks. The hostnames for these interfaces are p2 through p2576. The hostname for biowulf's internal interface is biowulf-e0 (i.e., nodes wishing to communicate with the login machine, should use the name biowulf-e0 rather than biowulf).

The ethernet switches are Brocade (Foundry) FCX 648S, and FESx 448 switches. The backbone switch is a Brocade (Foundry) RX-32 using 10 Gb/s fiber links to the FCX/FESxs edge switches.

For those applications that can benefit from increased bandwidth and lower latency than Ethernet, there are two high performance switched networks available:

The login node (biowulf.nih.gov) is an HP DL380 (two 2.67 GHz Xeon X5550, eight cores) with 24 GB RAM. It is a minimal system designed for log-in, light-weight tasks, 64-bit compiling/development and the submission of jobs to the batch system for runs on the cluster. No compute intensive jobs are permitted on the login node.

Disk Storage

There are several options for disk storage on Biowulf; please review this section carefully to decide where to place your data. Contact the Biowulf systems staff if you have any questions.

Except where noted, there are no quotas, time limits or other restrictions placed on the use of space on the Biowulf system, but please use the space responsibly; even hundreds of terabytes won't last forever if files are never deleted. Disk space on the Biowulf system should never be used as archival storage.

Users who require more than the default disk storage quota should fill out the online storage request form.

Summary of file storage options
  Location Creation Backups Speed Space Available from
/home network (NFS) with Helix account yes high 8 GB default quota B,C,H
/scratch (nodes) local created by user no medium 30 - 130 GB dedicated while node is allocated (**) C
/scratch (biowulf) network (NFS) created by user no low 500 GB shared B,H
/data network (NFS) with Biowulf account yes high 100 GB default quota B,C,H
H = helix, B = biowulf login node, C = biowulf compute nodes
(**) 3 TB on DL-785

/scratch (nodes)

Each Biowulf node has a directly attached disk containing a /scratch filesystem. Note that scratch space is not backed up to tape, and thus, users should store any programs and data of importance in their home directories. Use of /scratch on the batch nodes should be for the duration of your job only. It is your job's responsibility to check for sufficient disk space. Your job may delete any and all files from /scratch to make space available (use the clearscratch command). Please use /scratch instead of /tmp for storage of temporary files.

/scratch on the login node (biowulf.nih.gov) is actually a shared (NFS) filesystem, accessible from both the login node and Helix. Files on this filesystem which have not been accessed for 14 days are automatically deleted by the system.

/data

These are RAID-6 filesystems mounted over NFS or GPFS from one of the following, all of which are configured for high availability: eight NetApp FAS8040 controllers, a DataDirect Networks S2A9900 storage system with eight fileservers, a DataDirect Networks SFA10K storage system with eight fileservers and two DataDirect Networks SFA12K storage systems with eight fileservers each. This storage offers high performance NFS/GPFS access, and is exported to Biowulf over a dedicated high-speed network. /data is accessible from all computational nodes as well as Biowulf and Helix, and will be the filesystem of choice for most users to store their large datasets. Biowulf users are assigned an initial quota of 100 GB on /data; please contact the Biowulf staff if you need to increase your quota.

Note: your /data directory is actually physically located on filesystems named /spin1, /gs1, /gs2, /gs3 or /gs4. The /data directory consists of links to one of those filesystems. ALWAYS refer to your data directory through the /data links as opposed to the physical location because the physical location is subject to change based on administrator needs. In other words, use /data/yourname rather than (for example) /gs1/users/yourname in your scripts.

In addition to all data being regularly backed up to tape, snapshots are available that allow users to recover files that have been inadvertently deleted. For more information on backups and snapshots, please refer to the File Backups section on the Helix website.

If a user's data directory is located on the /spin1 storage area, it is possible that the size of snapshots in the user's data directory may cause a user to have less available storage space than he believes he should.

Each data directory on /spin1 has its own snapshot space that is separate from a user's regular file or data space. However, if a user deletes too much data within a short period of time (a week or less), it is possible that the snapshot space for that data directory can be exceeded. When this occurs, a user's regular file space (allocated quota space) will be used as additional snapshot space. Unfortunately there is no way for a user to either determine whether he has encountered this situation or to actually delete his snapshots. However, if a user does not see an expected increase in available space in the output of the 'checkquota' command following the cleanup of large files, the possibility exists that he has encountered this situation and should contact the Helix/Biowulf staff for assistance.

GPFS

Users having access to a data directory on /gs2, /gs3 or /gs4 MUST use the 'gpfs' resource when submitting jobs via 'qsub' or 'swarm' or when allocating a node to themselves. Data directories on /gs2, /gs3 or /gs4 are only accessible on Helix, Helixdrive, the Biowulf login nodes, the "e2666" and "x2800" nodes, and the high memory g24/g72/g256/VLM (very large memory) nodes.

To submit jobs via 'swarm' with the "gpfs" resource:

% swarm -R gpfs ...additional swarm options...

To submit jobs via 'qsub' with the "gpfs" resource:

% qsub -l nodes=1:gpfs ...additional qsub options...

The "gpfs" resource is NOT compatible with the older o2000, o2200, o2600 or o2800 nodes.

Checking your disk storage usage

Use the checkquota command to determine how much disk space you are using:

$ checkquota

Mount                   Used      Quota  Percent    Files    Limit 
/data:               70.4 GB   100.0 GB   70.41%   307411  6225917
/data(SharedDir):    11.4 GB    20.0 GB   56.95%     6273  1245180
/home:                2.0 GB     8.0 GB   25.19%    11125      n/a
mailbox:             74.7 MB     1.0 GB    7.29%         

 

Using the Biowulf Cluster

The Biowulf Cluster is highly heterogeneous with respect to processor speed, memory size and networks. Nodes can contain between 2 and 32 processor cores with memory sizes from 1G to 512G, on message-passing networks that range from very fast (16 Gb/s) to moderate (1 Gb/s) in bandwidth. This can initially create confusion on the part of users in deciding on what nodes the need to run their work. However, once the cluster hardware is understood it should be quite easy for users to ascertain what nodes are suitable for their purpose without over-using resources or denying resources to others.

Here are some common errors that beginners make when using the Biowulf cluster:

Common Terms:

Clarification of terms used in this guide.

Node categories on the Biowulf cluster:

Using the Batch System on Biowulf

Batch (or queueing) systems allow the user to submit jobs to the cluster without regard for what resources (cpu, memory, networks) are available. When those resources become available the batch system will start the job. There are many batch systems available; the one used on the Biowulf cluster is Altair's PBS Pro.

Submitting Jobs

In order to run a job under batch you must submit a script which contains the commands to be executed. The command to do this is qsub. The qsub command on Biowulf is heavily customized and this documentation should be used in lieu of that provided by Altair. The batch script (or batch command file) is simply a Linux shell script with optional commands to PBS (called directives).

Sample Batch Script

A very minimal batch script example:

#!/bin/bash
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#
cd $PBS_O_WORKDIR
myprog < /data/me/mydata

One could simply run this as a shell script on the command line, however when executed by the batch system, the lines beginning with #PBS have special meaning and are PBS directives. These directives can also be used on the qsub command line. The most commonly used PBS directives are:

-N: name of the batch job.
-m: send mail when the job begins ("b") and when it ends ("e").
-k: keep the STDOUT ("o") and STDERR ("e") files in the /home/user directory. By default, these files will appear in the working directory. 
-W: dependencies. Set one job to complete only after one or more jobs start or complete. This is a complex directive; type 'man qsub' to read the man pages.

The environment variable $PBS_O_WORKDIR is the directory which was the current directory when the qsub command was issued.

Note that if the "myprog" program above ordinarily sends its output (STDOUT) to the terminal, that output will appear in a file called MyJob.o<jobnumber>. Any errors (STDERR) will appear as MyJob.e<jobnumber>.

Submitting a Job with qsub

The simplest form of the qsub command is:

% qsub -l nodes=1 myjob.bat

The resource list (-l) here consists of the number of nodes (1). You must specify a number of nodes, even if it is one.

The batch system allocation is done by nodes (not processors). When you are allocated a node by the batch system, your job is the only job running on the node. Since a node may contain between 2 and 24 processing cores, you should ensure that you utilize the node fully (see the swarm program below).

Parallel jobs running on more than 2 cpus will need to specify more than 1 node. Additionally, since you will want a parallel job running on nodes of identical clock speed, you will probably want to specify a node type as well:

qsub -l nodes=8:o2800 myparalleljob

Node types currently supported are (from fastest to slowest):

x2800   2.8 GHz Xeon
e2666   2.67 GHz Xeon
o2800   2.8 GHz Opteron
o2600   2.6 GHz Opteron
o2200   2.2 GHz Opteron
o2000   2.0 GHz Opteron

Nodes can also be selected based on memory size. There are two ways to select nodes based on memory

mN: the amount of memory per processor core, where N is 1, 2, or 4
gN: the total amount of memory in a node, where N is 4, 8, 24 or 72
Specifying memory per core. Use the "m" property where is the size of memory in gigabytes. Most nodes in the cluster are "m1" (ie, 1 GB/core). You should use this property for most jobs requiring a specific memory size. It is particularly useful for submitting swarms, since 'swarm' may place your jobs on nodes with differing numbers of cores. This property ensures that each process will have access to the same amount of memory regardless of the number of cores on the node.

Specifying memory per node. Use the "g" property where is the total memory in the node in gigabytes. This property should be used when your job requires the entire memory of the node for a single process.

Examples:

m2      2 GB/core
m4      4 GB/core
g4      4 GB/node
g72     72 GB/node

(Note: Since the primary purpose of the g72 nodes is to provide a large memory, it is permissible to underutilize the number of processors. That is, it's ok to run less than 8 processes per node).

Nodes can also be selected based on network:

gige    gigabit ethernet
ib      infiniband

Properties based on cores per node:

c2      2 cores/node
c4      4 cores/node
c16     16 cores/node
c24     24 cores/node

Note: Don't make additional specifications unless you need to! In most cases, the less you specify the better. The more flexibility the scheduler has in allocating nodes for you job, the less likely your job will wind up having to wait for specific resources.

Not all combinations of processor, memory and network are available. The chart below shows available configurations (in green) along with the PBS node properties (in bold). To see how many nodes are available in each configuration, use the freen command (below).

Processor Cores m1
1 GB mem. per core
m2
2 GB mem. per core
m4
4 GB mem. per core
g4
4 GB total mem.
g8
8 GB total mem.
g24
24 GB total mem.
g72
72 GB total mem.
Network
x2800
2.8 GHz Xeon
c24




gige
Gigabit
Ethernet
e2666
2.67 GHz Xeon
c16



o2800:dc
2.8 GHz Opteron
c4




o2600:dc
2.6 GHz Opteron
c4



o2800
2.8 GHz Opteron
c2



o2200
2.2 GHz Opteron
c2



o2000
2.0 GHz Opteron
c2





2.8 GHz Xeon 8





ib
Infiniband
2.8 GHz Opteron 2





ipath
Infinipath

Additional Notes:
Very Large Memory Jobs

For jobs requiring between 72 and 512 GB of memory there are twenty-five very large memory (VLM) nodes: 24 are configured with 256 GB and a single one with 512 GB. Each node has 32 cores and the 512 GB node has a 3 TB local /scratch. In contrast to most nodes, these nodes can be sharedsg amongst jobs. To allocate VLM nodes the memory size requested, and the number of cores (if > 1) requested must be specified to qsub. Note that on shared nodes, the batch system will kill the job if it goes over the requested memory or requested number of cores.

% qsub -l nodes=1,mem=128gb myvlmjob

% qsub -l nodes=1,mem=384gb,ncpus=4 bigjob.bat
Jobs requiring between 72-250 GB allocate a 256 GB node and jobs larger than 250 GB allocate the 512 GB node.

Note that there is a default walltime limit of 168 hrs for the 256 GB and 512 GB nodes. If you have a running job that is getting close to the walltime limit, please send email to staff@biowulf.nih.gov asking for the walltime limit to be extended for this job. Users cannot increase the walltime limit themselves.

Other Options for qsub
-r y|n       this switch controls whether your job is restartable.  If 
             one of the nodes that your job is running on should hang,
             the operations staff will often restart the job.  If you
             do not want your job to be restarted, use "-r n" with qsub.

-v varlist   this switch allows you to pass one or more variables from
             the qsub command line to your batch script.  For instance,
             with "qsub -v np=4 myjob", the myjob script can use the $np
             variable with a value of 4.  You can also list an environment
             variable without a value, and that envvar will be exported
             with its current value.

-I           for interactive jobs, see below.

-V           export all environment variables to the batch script.  Useful
             when running X11 clients on interactive nodes, see below.

See the qsub man page for additional options.

Allocating Nodes for Interactive Use with PBS

Using the batch system for interactive use may sound like an oxymoron, but this allows you to run compute-intensive processes without overloading the Biowulf login node.

The following example shows how to allocate an interactive node:

biobos$ qsub -I -V -l nodes=1
qsub: waiting for job 2011.biobos to start
qsub: job 2011.biobos ready
p139$ exit
logout
qsub: job 2011.biobos completed
biobos$ 

Note the following:

Other PBS Commands

See the man pages for details on the following PBS commands:

qstat -u [username]            list jobs belonging to username
qdel [jobid]                   delete job jobid
qdel -Wforce [jobid]           delete a job on a hung node
qselect -u [user] -s [state]   select jobs based on criteria

For example, to delete all of your current jobs:

qdel `qselect -u myname`

To delete only your queued jobs:

qdel `qselect -u myname -s Q`
Resource Limits

There are no time limits for jobs running in the batch queue. However, while debugging a program, or if there is otherwise a possibility that your job could "runaway" due to a programming error, please use the walltime switch to limit the time your job can run before it is terminated by the batch system. For example, to limit your job to 72 hours use "-l walltime=72:00:00" as an argument to the qsub command.

Scheduling on Biowulf is done using a Fair Share algorithm. This means that, when more jobs are waiting to run than can be started, the next job to run will be the one belonging to the user with the least amount of system usage during the previous 7 days. This should allow users to submit as many jobs to the queue as they would like without concern that they will take an unfair amount of processing time.

The batch system enforces a maximum number of jobs and cpus (or cores) allocated to each user. This number can vary depending on system load and other factors. To see the current limits, use the batchlim command:

$ batchlim
Batch system max jobs  per user:  4000
Batch system max cores per user:  1280

            Max Cores
Queue        Per User
----------- ---------
norm            1280
ib               384
vol              256
norm3            128
g72              128
ipath            128
gpu               96
volib             64

            Max Mem    Max Mem
            Per User   on System
            ---------- -----------
DL-785 mem  512gb      512gb

Most user jobs run in the norm queue, so in the example above, the maximum per user allocation is 1280 cores.

Note: You should never specify a queue when submitting a job. Jobs will be automatically directed to the appropriate queue by the batch system. You should only specify the minimum resources needed by your job, e.g. memory or cores.

Running Multiple Serial Jobs and the swarm Command

Because each Biowulf node has between two and 24 processor cores, but the PBS batch system allocates nodes rather than cores, submitting multiple single process jobs to the system results in a poor utilization of processors (i.e., only one processor per node).

In the case of simply running two processes on a single dual-processor node, the following example uses the wait command to prevent the batch command script from exiting before the application processes have finished:

#!/bin/bash
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#PBS -k oe
#
myprog -arg arg1 < infile1 > outfile1 &
myprog -arg arg2 < infile2 > outfile2 &
wait

Note how this script runs 2 instances of a program by putting them in the background (using the ampersand "&"), and then using the shell wait command to make the script wait for each background process before exiting.

When running many single-threaded jobs, setting up many batch command files can be cumbersome. The swarm command can be used to automatically generate batch command files and submit them to the batch system.

The swarm Command

swarm allows you to submit an arbitrary number of serial jobs to the batch system by simply creating a command file with one command per line. swarm automatically creates batch command files and submits them to the batch system. Two commands are submitted for each node, making optimal use of the processors.

Here is an example command file:

myprog -param a < infile-a > outfile-a
myprog -param b < infile-b > outfile-b
myprog -param c < infile-c > outfile-c
myprog -param d < infile-d > outfile-d
myprog -param e < infile-e > outfile-e
myprog -param f < infile-f > outfile-f
myprog -param g < infile-g > outfile-g

The command file is submitted using the following command:

swarm -f cmdfile

The result is 4 jobs submitted to the batch system, 3 jobs with 2 processes each, and the last with a single process.

When submitting a very large swarm (1000s of processes), the bundle option to swarm should be used:

swarm -f cmdfile -b 40

If cmdfile contains 2500 commands, approximately 63 bundles of 40 commands each will be created and submitted as 32 batch jobs (2 bundles per job, one for each processor on a node).

Note1: swarm will correctly allocate the correct number of processes to 2-, 4-, 16- or 24-core nodes.

Note2: if the programs you are running via swarm either (1) require more than 1 GB memory, or (2) are multi-threaded, you must specify additional information to swarm.

See the swarm documentation for more details.

Resubmitting part of a swarm: If a few jobs in your swarm fail to complete, you can determine the batch script associated with those jobs using the 'submitinfo' command. You can then resubmit just those jobs. For example, 2 jobs in this swarm, with job ids 4104732 and 4104738, did not complete successfully due to a typo in the original swarm command file. You can edit the batch scripts and resubmit just these two jobs.

biowulf% submitinfo 4104732
4104739:  (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm48n32646

biowulf% submitinfo 4104738
4104738:  (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm47n32646

[...edit the 2 batch scripts ctrl-swarm48n32646 and ctrl-swarm47n32646...]

biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm48n32646
410836.biobos
biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm47n32646
410837.biobos

The multirun command

If you wish to submit more than 2 single-threaded jobs but want them under control of a single job, then an mpi "shell" can be used (note: In many cases this will not be an optimal use of resources. Unless all processes exit at roughly the same time, idle nodes will not be freed by the batch system until the last process has exited).

The basic procedure is as follows (generation of these scripts can be done automatically by writing a higher order script):

1. Create an executable shell script which will run multiple instances of your program. Which will run depends on the "mpi task id" of the instance.

#!/bin/tcsh
#
# this file is run6.sh
#
switch ($MP_CHILD)
 case 0:
 your_prog with args0
breaksw
case 1:
your_prog with args1
breaksw
case 2:
your_prog with args2
breaksw
case 3:
your_prog with args3
breaksw
case 4:
your_prog with args4
breaksw
case 5:
your_prog with args5
breaksw
endsw

2. Use mpirun in your batch command file to run the mpi shell program (multirun):

#!/bin/tcsh
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#PBS -k oe
#
set path=(/usr/local/mpich/bin $path)
mpirun -machinefile $PBS_NODEFILE -np 6 \
 /usr/local/bin/multirun -m /home/me/run6.sh

3. Submit the job to the batch system:

qsub -l nodes=3 myjob.sh

This job will run 6 instances of the program on 3 nodes.

Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using PBS's job dependencies options to qsub. For example, if you have two jobs, Job1.bat and Job2.bat, you can utilize job dependencies as in the example below.

biowulf% qsub -l nodes=1 Job1.bat
112323.biobos

biowulf% qsub -W depend=afterany:112323.biobos -l nodes=1 Job2.bat
112324.biobos

The first job was submitted with the normal qsub command. The second job was submitted with the flag '-W depend=afterany:112323.biobos'. This tells the batch system to make Job2 dependent on the completion of Job1. 'afterany' indicates that Job2 will run after any exit status of Job1, i.e. whether Job1 completed successfully or exited with an error.

'qstat' will show the status of the jobs.

biowulf% qstat -u username

112323.biobos  username  norm     Job1   8203   1   1   22gb   --  R 00:01
112324.biobos  username  norm     Job2    --    1   1   22gb   --  H   -- 
Note that the second job is in 'H' (hold) state. 'qstat -f 112324' will provide detailed information about the job dependency.

Once job 112323.biobos completes, job 112324.biobos will be released from H into Q state by the batch system, and then will run as the appropriate node(s) become available.

Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of '0' means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the '-W depend' flag that depend on the status of Job1. e.g.

-W depend=afterany:Job1Job2 will start after Job1 completes with any exit status
-W depend=after:Job1Job2 will start any time after Job1 starts
-W depend=afterok:Job1Job2 will run only if Job1 completed with an exit status of 0
-W depend=afternotok:Job1Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is trivial. This is accomplished in the example below:

biowulf% qsub -l nodes=1 Job1.bat
112323.biobos

biowulf% qsub -W depend=afterany:112323.biobos -l nodes=1 Job2.bat
112324.biobos

biowulf% qsub -W depend=afterany:112323.biobos -l nodes=1 Job3.bat
112325.biobos

Making a job depend on the completion of several other jobs: example below.

biowulf% qsub -l nodes=1 Job1.bat
112323.biobos

biowulf% qsub -W depend=afterany:112323.biobos -l nodes=1 Job2.bat
112324.biobos

biowulf% qsub -W depend=afterany:112323.biobos:112324.biobos -l nodes=1 Job3.bat
112325.biobos
Licensed Software

Some licensed products such as Matlab are available on the Biowulf cluster. There are a limited number of licenses. Thus, each user who wants to submit a job that requires a license needs to specify that resource on the qsub command line. For example, a job requiring a Matlab license would be submitted as

biowulf% qsub -l nodes=1,matlab=1 jobscript

The individual application pages contain details about the resources that need to be specified.

The utility 'licenses' can be used on Biowulf to get a list of total and free licenses. Some licenses are limited so that an individual user can only run a certain number of simultaneous jobs; these limits are also listed in the 'licenses' output.

[user@biowulf]$ licenses
 License                     Resource         Total    Free  Per-User Limit
---------------------------------------------------------------------------
Mathematica                  math                 7      5      4
Matlab                       matlab              23     20      5
Bioinformatics Toolbox       matlab-bio           2      0      -
Compiler                     -                    1      1      -
Curve Fitting Toolbox        matlab-curve         3      2      2
Identification Toolbox       matlab-ident         2      2      -
Image Toolbox                matlab-image         4      3      2
Neural Network Toolbox       matlab-neural        2      2      -
Optimization Toolbox         matlab-opt           4      3      2
PDE Toolbox                  matlab-pde           2      2      -
Signal Toolbox               matlab-signal        6      6      3
SimBiology Toolbox           matlab-simbio        1      1      -
Simulink                     matlab-simulink      1      1      -
Statistics Toolbox           matlab-stat          7      4      4
Symbolic Toolbox             matlab-symb          2      2      -
Wavelet Toolbox              matlab-wave          2      2      -
SAS                          sas                 48     48     48

Monitoring Your Jobs

Once a batch job has been submitted it can be monitored using both command line and web-based tools.

Using either the jobload command or the user job monitor (see below), you can determine the overall behavior of your job based on the load of each node. Perfectly behaving jobs will have loads of near 100% on all nodes. If the nodes in a parallel job are running at loads below 75% (or are green), the job probably isn't scaling well, and you should rerun the job with fewer nodes. Loads of less than 75% for non-parallel jobs may mean that not all processors are being used, or that the job is i/o bound. Contact the Biowulf staff for help in deciding whether your job is making best use of its resources.

The sections below provide information about various ways of monitoring your jobs.

qstat

The qstat command reports the status of jobs submitted to the batch system. 'qstat -a' shows the status of all batch jobs; 'qstat -n' shows, in addition, the nodes assigned to each job. Qstat with the "-f" switch gives detailed information about a specific job, including the assigned nodes and resources used. See the man page for 'qstat' for more details.

freen

The freen command can be used to determine the currently available nodes (free/total). By default, a qsub request with no node specifications will get you a node from the 'Default Pool', i.e. a node with 2 cores. If you specify more than 2 cores (e.g. 'c2', 'c4', 'c24') or a larger amount of memory (e.g. 'm4', 'g8','g24') you will be allocated a node from the OnDemand Pool.

% freen
Free Nodes by memory-per-core
      Cores     m0.5      m1      m2      m4   Total
------------------ DefaultPool -----------------------
o2800    2       /       /     13/211    /     13/211
o2200    2       /    117/300  62/62     /    179/362
o2000    2      7/42   38/38     /       /     45/80 
------------------ Infiniband  -----------------------
ib       8       /     24/250    /       /     24/250
ipath    2       /       /     69/126    /     69/126
------------------  OnDemand   -----------------------
e2666    8       /       /      0/320    /      0/320
o2800:dc 4       /       /     19/226    /     19/226
o2800    2       /     53/90     /     40/40   93/130
o2600:dc 4       /      0/41   37/212    /     37/253


Free nodes by total node memory
      Cores       g4      g8     g24     g72   Total
------------------ DefaultPool -----------------------
o2800    2     13/211    /       /       /     13/211
o2200    2     62/62     /       /       /     62/62 
------------------  OnDemand   -----------------------
e2666    8       /       /      0/320  11/32   11/352
e2400    8       /       /      0/42     /      0/42 
o2800:dc 4       /     19/226    /       /     19/226
o2600:dc 4       /     37/253    /       /     37/253

--------------------  DL-785  ------------------------
Available: 31 processors, 504.2 GB memory

jobload

The jobload command is used to report the load on nodes for a job or for a user:

$ jobload juser
Jobs for  juser     Node    Cores   Load    

2383856.biobos      p1922     4     100%
2383858.biobos      p1926     4     100%
2383859.biobos      p1918     4     100%
2383862.biobos      p1920     4      99%
2383863.biobos      p1928     4      79%
2383865.biobos      p1919     4      85%
Core Total: 24     User Average:     93%

All but 2 nodes allocated to this user are running to capacity.

$ jobload 869186
869186.biobos       Node    Load    
                    p554    100%
                    p555    100%
                    p557    100%
                    p564    100%
                    p565    100%
                    p566    101%
                    p567    100%
                    p568    100%
            Job Average:    100%

This shows a well-behaved parallel job.

# jobload 763334
763334.biobos       Node    Load    
                    p1495    25%
                    p1497     0%
                    p1498     0%
                    p1500     0%
                    p1501     0%
                    p1503     0%
                    p1504     0%
                    p1507     0%
            Job Average:      3%

This last example shows a improperly running parallel job.

Monitoring memory usage

If your jobs require less than 0.5 GB of memory per process, they will run successfully on any Biowulf node. If they require a minimum per-core or total node memory, you will need to specify the node type to ensure that the jobs run on a node with sufficient memory.

Running jobload with the -m option will show the memory usage of your jobs.

biowulf% jobload -m user1
Jobs for  user1     Node    Cores   Load    Memory  
                                          Used/Total (G)
1898152.biobos      p1747     4      25%    5.7/8.1 
1898156.biobos      p1752     4      25%    4.7/8.1 
1898157.biobos      p1753     4      25%    5.9/8.1 

The jobs above for user1 are appropriately sized for the allocated nodes.

biowulf% jobload -m user2
Jobs for  user2      Node    Cores   Load    Memory  
                                          Used/Total (G)
2016779.biobos      p889      8      12%    2.2/74.2
2016780.biobos      p895      8      12%    2.1/74.2
2016781.biobos      p887      8      12%    2.1/74.2
2016782.biobos      p881      8      12%    2.1/74.2

The jobs for user2 have allocated the g72 nodes, but are only using ~2 GB of memory on each node. This is an inappropriate use of resources.

Some jobs may fluctuate in their memory usage. The batch system will track the resources used for each job, and if the -m e flag to qsub is used, this information will be emailed to the user at the end of the job. The value resources_used.mem will report the peak memory used by the job. This value should be used to allocate nodes with appropriate memory.

For example, the job below was submitted with qsub -m be -l nodes=1:g8 myscript. At the end of the job, the user will get a mail message along the following lines:

From: adm 
Date: October 13, 2010 9:39:29 AM EDT
To: user3@biobos.nih.gov
Subject: PBS JOB 2024340.biobos

PBS Job Id: 2024340.biobos
Job Name:   memhog2.bat
Execution terminated
Exit_status=0
resources_used.cpupercent=111
resources_used.cput=00:18:32
resources_used.mem=7344768kb
resources_used.ncpus=1
resources_used.vmem=7494560kb
resources_used.walltime=00:17:05

The batch system reports that the job has used 7344768kb, or 7.3 GB. This job is appropriately sized for the 'g8' nodes.

Estimating memory

Before you submit a large swarm, you will want to estimate the memory requirements of each command (one line in the swarm command file). If you simply submit your swarm without any knowledge of the memory requirements, swarm will set up your jobs so that each command in your swarm gets 1 GB of memory. If your commands require more than 1 GB, the allocated nodes will get overloaded, and the jobs will be automatically deleted by the batch system or will hang the nodes. Therefore it is very important to estimate the memory for your commands.

Set up a swarm job containing just one swarm command, i.e. only one line. For example:

# this file is called swarm.bat
cd /data/$USER/mydir; prog1 -a 1 < infile1; prog2 -b 200 
Submit this to a node with reasonably large memory. For example,
swarm -g 20 -f swarm.bat

After the job completes, the swarm output file will report the memory used. e.g.

-------- running PBS epilogue script (5948947.biobos p1852 username) --------

Show some job stats:

5948947.biobos elapsed time:           200 seconds
5948947.biobos walltime:         00:00:00 hh:mm:ss
5948947.biobos memory limit:        20.00 GB
5948947.biobos memory used:          7.3 GB
5948947.biobos cpupercent used:     80.00 %

housekeeping: p1852 
------------- done PBS epilogue script -------------
This job used 7.3 GB of memory. If all the commands in the swarm are similar, it would be reasonable to submit the whole swarm requesting 8 GB of memory for each command.
swarm -g 8 -f whole_swarm.bat

jobcheck

The jobcheck command can be used to check the status of exited jobs. In particular it can be used to identify jobs that have been deleted due to their having exceeded node memory limits.

Show info for a particular job (the "D" indicates the job was probably deleted due to exceeding memory):

$ jobcheck 1019880
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D

Show all jobs exiting with a non-zero exit status for the past 2 days:

$ jobcheck -d 2
Showing jobs for the past 2 days.
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D
1019899      sw12n29227         1:00:25   96   22592   22721   271 D
1024544      sw1n25060          0:03:32  420   22592    8835   271

Show all jobs which have exited during the past day:

$ jobcheck -d 1 -a
Showing ALL jobs including those with a zero exit status.
Showing jobs for the past 1 days.
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D
1024543      sw1n23515          4:08:50 1210   22592   10148     0
1024544      sw1n25060          0:03:32  420   22592    8835   271
1024588      sw1n26558          5:02:38 1409   22592   10339     0
1024599      sw1n27070          1:31:56  504   22592    6436     0

There are other options as well:

$ jobcheck
Usage:  jobcheck jobnumber
        jobcheck [-a] -d days
        jobcheck [-a] -s start [-e end]

where:  -a  show all jobs including zero exit status
        -d  last N days to search
        -s  datetime to begin search in 'YYYY-MM-DD hh:mm' format
        -e  datetime to end search (optional)
        -v  display start/end times

Jobcheck exit status:

Exit Status Description
0 Job completed normally without errors
265 unknown, rare
271 job did not complete normally, e.g. user may have deleted it
271 D memory limits were exceeded and the batch system deleted the job to prevent the compute node from hanging
-1 job hung the compute node and was deleted

cluster monitor (web-based)

The web page at http://biowulf.nih.gov/sysmon lists several ways of monitoring the Biowulf cluster. The "matrix" view of the system shows the load of each node in discrete blocks of 100. For instance, block 17, row 3, column 4 (node p1734) shows a green dot, indicating that the system is 50% loaded (in this case a load of 2, where a load of 4 would be ideal). A healthy job would show yellow dots, an over-loaded job shows red and unused or severely under-loaded systems are indicated with a blue dot.

Biowulf load monitor showing the loads of each node on the cluster.

The color of the dot indicates the load percentage:

Rolling over a dot will expose the node name and the system load (2-processor systems are at 100% load when the system load is 2, 4 processor systems are at 100% at 4, 8-processors systems at 8, etc).

Clicking on the dot corresponding to a node will result in a display of the process, cpu, disk and memory status for that node (output from the top and df commands).

The status of specific batch jobs can be monitored by first clicking on List Status of Batch Jobs which gives output similar to qstat, and then clicking on the Job ID of the job of interest. This results in the display of a matrix which contains dots only for the nodes allocated to the job. The loads of those nodes can be monitored in same way as for the system as a whole.
Biowulf load monitor showing a job monitor matrix.
Biowulf load monitor showing a job monitor matrix.

Finally, the sum total of nodes allocated across all jobs to a user can be monitored by clicking on Username.

Again, for both JobID and Username monitoring, clicking on a dot corresponding to a node results in a display of information about the node.

 

Running Parallel Jobs

On clusters like Biowlf, nodes are physically separated from one another with the exception of a high-level network link (Ethernet, Infiniband etc). Since it is impossible to "share" memory between processes on separate nodes, processes must explicitly pass memory, or "messages" between themselves over the network link via a common protocol. In most cases, the protocol used is MPI (Message Passing Interface).

There are four possible network links available on Biowulf that parallel applications can be built for (see node configurations chart above for numbers and types of nodes) :

For more information on MPI on Biowulf, see the Libraries and Development page. This page pertains mostly to developers but is also useful to users that want to know a little more about the compilers, MPI implementations and libraries available on Biowulf.

Running Ethernet MPI Applications

The Biowulf staff maintains two major Ethernet MPI implementations for our users. The first is MPICH2, an implementation developed at Argonne National Laboratories. The second is OpenMPI, a popular implementation with very active support and development. MPI-based software built and maintained by the Biowulf staff will use one of these implementations. The instructions here are generic, if you're running a program maintained by us, check the Applications Page for specific instructions on launching that application.

Running OpenMPI Applications

Running an application built with OpenMPI for Ethernet requires only an "mpirun" command. On Biowulf, a qsub wrapper for launching such an application will look something like this:

#!/bin/bash
#PBS -N test_job
#PBS -k oe

cd $PBS_O_WORKDIR

/usr/local/OpenMPI/current/gnu/eth/bin/mpirun -n $np -machinefile $PBS_NODEFILE \
	/full/path/to/executable > output_file

The qsub submission will look something like this:

[janeuser@biowulf ~]$ qsub -v np=32 -l nodes=16:gige /path/to/qsub/script/qsub.sh

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Running MPICH2 Applications:

Using MPICH on Biowulf requires one element of special setup: the mpd.conf file.

Setting the MPD password file

If this is the first time you've used MPICH2 you'll need to create a file in your home directory (~/.mpd.conf) that consists of a single line. This command should do the trick:

[janeuser@biowulf ~]$ echo 'password=<password>' > ~/.mpd.conf

This password is used internally by the process controllers and does not have to be remembered. Indeed, you'll quickly forget that it exists. Once it's created you'll need to set the permissions so that only you can read it:

[janeuser@biowulf ~]$ chmod 600 ~/.mpd.conf

Running the MPI processes

Our MPICH2 installations use the MPD process manager to launch and control MPI processes. This requires three discrete steps to run a job: First, start the mpd managers on each node in your job with the "mpdboot" command. Then launch your job with "mpiexec." The last step is to tear-down your mpd managers with "mpdallexit."

Note: you will need to set your PATH to include the MPI installation used to build the application (the applications page entry for the program you're trying to run will tell you this if you did not build the program yourself). While users familiar with MPI will note that this is not strictly true, it is the only sure way to know you're getting the correct versions of the process managers and mpiexe scripts.

A minimal qsub script will look something like this:

#!/bin/bash
#PBS -N test_job
#PBS -k oe

export PATH=/usr/local/mpich2-intel64/bin:$PATH
mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
mpiexec -n $np /full/path/to/executable
mpdallexit

The qsub submission will look something like this:

[janeuser@biowulf ~]$ qsub -v np=8 -l nodes=4:gige /path/to/qsub/script/qsub.sh

Infinipath and Infiniband MPI

A note on Infiniband vs. Infinipath: The Biowulf cluster has two different high-speed networks on two subsets of cluster nodes that are based on the Infiniband standard: Infinipath and Infiniband (often referred to as IPath and IB respectively). While the hardware that drives the two distinct networks is very similar, the underling software is not. Programs compiled for one will not work on the other (see the Development page under "MPI" for information on building for these target networks).

The Infinipath network was designed specifically as a message-passing network: the manufacturer implemented only those portions of the Infiniband standard necessary to create a network for parallel applications. The result is an ultra low-latency network with relatively high bandwidth (~8 Gb/s) that they chose to call "Infinipath" (Pathscale being the company that designed the product).

The Infiniband network is the more recent addition to the Biowulf cluster and it implements the full Infiniband standard. This network is also very low-latency (though theoretically not as low as the Infinipath network) with very high bandwidth (~16 Gb/s). The real-word performance difference between these two networks will often (perhaps usually) be negligible.

Running Infinipath MPI apps

The Infinipath nodes on Biowulf use OpenMPI and a vendor-supplied MPI implementation which is grounded in code from the MPICH1 project. OpenMPI is the recommended implementation for new development and staff supplied programs will usually be built with it.

OpenMPI for Infinipath

To execute an MPI job on the Infinipath cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:

#!/bin/bash
#PBS -N MyJob
#PBS -k oe

cd $PBS_O_WORKDIR

/usr/local/OpenMPI/current/gnu/ipath/bin/mpirun -machinefile $PBS_NODEFILE \
-np $np MyProg < MyProg.in >& MyProg.out

The qsub line will look something like this:

% qsub -v np=64 -l nodes=32:ipath myjob.bat

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.


Pathscale Infinipath MPI

depriciated: use OpenMPI above unless you have a dependancy on Pathscale-related run-times. This MPI uses /usr/bin/mpirun, found in the defaul PATH, to launch jobs. A qsub wrapper for an Infinipath job using the depriciated Infinipath MPI implimentation will look something like this:

#!/bin/bash
#PBS -N MyJob
#PBS -k oe

cd $PBS_O_WORKDIR

mpirun -machinefile $PBS_NODEFILE -np $np MyProg < MyProg.in >& MyProg.out

A request for Infinipath nodes will look like this:

% qsub -v np=64 -l nodes=32:ipath myjob.bat

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Running Infiniband MPI apps

The Biowulf staff maintains two MPI implementations for the Infiniband network. This first is OpenMPI, the second is MVAPICH2, an implementation derived from MPICH2.

OpenMPI for Infiniband

To execute an MPI job on the Infiniband cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:

#!/bin/bash
#PBS -N MyJob
#PBS -k oe

cd $PBS_O_WORKDIR

/usr/local/OpenMPI/current/gnu/ib/bin/mpirun -machinefile $PBS_NODEFILE \
-np $np MyProg < MyProg.in >& MyProg.out

The qsub line will look something like this:

% qsub -v np=128 -l nodes=16:ib myjob.bat

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.


Running Infiniband MPI apps using MVAPICH2

As with the Ethernet MPICH implementation the user will need set up an mpd.conf file that includes a password for use between process managers. You can set up this file with the following commands. It's not necessary to maintain this file and it is not advisable to use a password that you may use anywhere else. A simple word or even blank string is all that's necessary since this is a closed cluster and process managers are assumed to be trusted.

% echo 'password=password' > ~/.mpd.conf
% chmod 400 ~/.mpd.conf

A minimal batch command file for an Infiniband MPI job might look like this:

#!/bin/bash
#PBS -N MyJob
#PBS -k oe

cd $PBS_O_WORKDIR

mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
mpiexec -n $np MyProg < MyProg.in >& MyProg.out
mpdallexit

The batch job is submitted with a qsub command that might look something like this. Note that the "ib" property is used to request an Infiniband node and that the IB nodes have 8 cores each.

% qsub -v np=64 -l nodes=8:ib myjob.bat

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Programming Tools and Libraries
See the Programming Tools and Libraries page.
Benchmarking

Particularly with parallel programs, it's important that you benchmark your job to make sure subsequent runs perform with an acceptable degree of efficiency. Specific benchmarking information can be found, if applicable, on the applications page, if the program is maintained by us.

Users who run parallel programs on multiple nodes must run their own benchmarks to determine the optimal number of nodes. To run a benchmark, a short job should be run on 1, 2, 4, 8, 16, 32 etc. processors. The '/usr/bin/time' command should be used in the batch script to measure the wallclock usage, e.g.

/usr/bin/time  mpirun -machinefile $PBS_NODEFILE -np $np /usr/local/gromacs/bin/mdrun_mpi \
      -s topol.tpr -o traj.trr -c out.after_md -v >  output
The wallclock time will then be reported in the job standard output. The efficiency should be calculated using the formula below for each data point.
                 time for 1 processor
Efficiency = --------------------------- * 100
              n * time for n processors
The optimal number of processors is the point where the efficiency is at least 70%. An example of the results from a benchmark can be found at the Biowulf NAMD benchmark page.

Biowulf nodes are heterogeneous with respect to architecture (x86_64, i686), processor speed, memory size, and networks. If you are running benchmarks, you will want to specify processor speed and memory size so that all your benchmark runs are on the same type of node:

qsub -l nodes=4:o2200:g4 myjob.bat   # run on 2.8 GHz Opteron/4 GB nodes

qsub -l nodes=4:o2800:m2 myjob.bat   # run on 2.8 GHz Opteron/2 GB/core nodes