wolf running on DNA

NIH Helix Systems
Biowulf User Information

dual-core opterons
  • About the NIH Biowulf Cluster
  • Biowulf Account Information
  • Logging In
  • Hardware Configuration
  • Disk storage
  • Using the Biowulf Cluster
  • Batch Queuing System
  • Monitoring Your Jobs
  • Parallel Jobs
  • Programming Tools/Libraries
  • Benchmarking
  • About the NIH Biowulf Cluster

    The NIH Biowulf cluster is a Beowulf parallel processing system designed and built at the National Institutes of Health. Managed by the Helix Systems Staff, Biowulf consists of a main login/administrative node and 1900 compute nodes (4800+ processors) running the Linux operating system. The computational nodes are interconnected by a high speed network and have access to four high-performance NFS RAID fileservers.

    Note: For FY2006, some nodes were funded by the National Cancer Institute (NCI). For FY2004, the cluster was partially funded by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and the National Human Genome Research Institute (NHGRI). NHGRI also partially funded the cluster for FY2003.

    Please send comments or problem reports to staff@biowulf.nih.gov.

    Biowulf and Scientific Publications

    The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the intramural program. In the past we have been successful in obtaining support by citing published work which involved the use of our systems.

    Accounts

    Biowulf accounts require a pre-existing Helix account. A Biowulf account can be obtained by registering on the web (http://biowulf.nih.gov/account_request.html). Biowulf accounts not accessed within a six-month period are automatically locked as a security precaution. A user may request to have his account unlocked by sending email to staff@biowulf.nih.gov.

    back to top

    Logging in

    The Biowulf system is accessed from the NIH network via ssh to biowulf.nih.gov. This login node runs a secure shell daemon, and users are encouraged to login to the cluster using ssh. Access to any of the processing nodes is gained from biowulf.nih.gov using rlogin. The node names are assigned as p#, where # is the node number. Note: Any login session to a compute node which has not been allocated to you will be automatically killed by the batch system.

    Xwindows connection software will usually provide better terminal emulation, and is necessary for any graphics connections from the login or computational nodes (e.g. x-povray or interactive SAS). Unix workstations (including Macs with OS X) can handle Xwindows by default. We provide Free Xwindows emulation software for Macs and PCs to Helix Systems users.

    Password Information

    An initial password is assigned to each user upon creation of an account. Your initial password is the same as your password on Helix at the time the Biowulf account was created. To change your Biowulf password, use the UNIX command 'passwd', which will prompt you for your old password and then a new password. It is required that you select a password that consists of at least seven characters with at least one numeric character, an upper case letter and a lower case letter. None of your previous ten passwords can be used. Your username and password are uniform across all nodes of the Biowulf system. Note: Passwords on helix and biowulf are not synchronized at this time.

    NIH password policy requires that users change their password every 180 days. Whenever you reset your password, the 180 day expiration is also reset. For 2 weeks before your password expires, you will be warned when you log in to Biowulf.

    Shells

    The Bourne shell ('bash') is the default shell. Other available shells, including csh and tcsh, are listed in /etc/shells.

    To change your default shell, use the command 'chsh -s shell_name', where shell_name is the full pathname of the desired shell (e.g., /bin/csh).

    Home Directories

    Biowulf home directories are in a shared NFS (Network File System) filespace. Therefore the access to files in your home directory is identical from any node in Biowulf.

    Mail

    E-mail can be sent from any node on Biowulf. However, all mail sent out from Biowulf will have a return address of helix.nih.gov (rather than the address of the node). In addition, mail sent to a user on a Biowulf node will be automatically forwarded to that user on helix.nih.gov. In other words, you can send, but not receive mail on Biowulf.

    If you do not read mail on helix, be sure to set up a .forward file in your helix home directory.

    You can communicate with the Biowulf/Helix staff by sending email to staff@biowulf.nih.gov

    back to top

    Hardware configuration

    HelixNet Diagram

    The Biowulf cluster is a distributed memory parallel computer consisting of a total of over 4800 Opteron and Xeon processors with an aggregate floating-point performance of 18 TFLOPS. One node is a 32-processor/96 GB SGI Altix 300 system. This special node should be used only for jobs which require more than 4 GB memory. See the Firebolt page (http://biowulf.nih.gov/firebolt.html) for information on submitting jobs to the Altix system. The table below lists the current hardware configuration of the nodes.

    # of nodes processors per node memory network
    232
    4 x 2.8 GHz
    AMD Opteron 290
    1 MB
    secondary
    cache
    8 GB 1 Gb/s ethernet
    471
    2 x 2.8 GHz
    AMD Opteron 254
    1 MB
    secondary
    cache
    40 x 8 GB
    226 x 4 GB
    91 x 2 GB
    1 Gb/s ethernet
    160 x 8 Gb/s Infiniband
    289 4 x 2.6 GHz
    AMD Opteron 285
    1 MB
    secondary
    cache
    8 GB 1 Gb/s ethernet
    389 2 x 2.2 GHz
    AMD Opteron 248
    1 MB
    secondary
    cache
    129 x 2 GB
    66 x 4 GB
    1 Gb/s ethernet
    80 x 2 Gb/s Myrinet
    91 2 x 2.0 GHz
    AMD Opteron 246
    1 MB
    secondary
    cache
    48 x 1 GB
    43 x 2 GB
    1 Gb/s ethernet
    48 x 2 Gb/s Myrinet
    390 2 x 2.8 GHz
    Xeon
    512 kB
    secondary
    cache
    119 x 1 GB
    207 x 2 GB
    64 x 4 GB
    64 x 2 Gb/s Myrinet
    100 Mb/s ethernet
    1 32 x 1.4 GHz
    SGI Itanium 2
    96 GB 1 Gb/s ethernet

    All nodes are interconnected by either 1 Gb/s or 100 Mb/s switched ethernet. The hostnames for these interfaces are p2 through p1930. The hostname for biowulf's internal interface is biowulf-e0 (i.e., nodes wishing to communicate with the login machine, should use the name biowulf-e0 rather than biowulf).

    The ethernet switches are Foundry FESx 448, FastIron 1500, 800, II+ and II switches. The backbone switch is a Foundry BigIron MG-8 using 10 Gb/s fiber links to the FESx 448s and trunked Gigabit ethernet links to the FastIrons.

    In addition, for those applications that can benefit from the increased bandwidth and low latency, there are two very high performance switched networks available: 184 dual-processor nodes have an additional connection to Infiniband using QLogic InfiniPath adapters and connected to a Voltaire ISR 9288 switch. Infiniband provides 10 Gb/s bandwidth and the adapters connect directly to the Opteron HyperTransport bus allowing for very low latencies. In addition, 165 dual-processor nodes have a second connection to Myrinet. This high performance network has a bandwidth of 2 Gb/s with a latency of around 10uS. The nodes are interconnected by two 128-port Myricom Clos-128 switches.

    The login node (biowulf.nih.gov) is an HP/Compaq Proliant ML370 (dual-processor 3.2 GHz Xeon) with 4 GB RAM, Compaq SA 5300 RAID-1 Controller, 36GB 15k rpm disk drives, and 3 additional Syskonnect Gigabit Ethernet interfaces (64-bit/66-MHz).

    Disk Storage

    There are several options for disk storage on Biowulf; please review this section carefully to decide where to place your data. Contact the Biowulf systems staff if you have any questions.

    Except where noted, there are no quotas, time limits or other restrictions placed on the use of space on the Biowulf system, but please use the space responsibly; even hundreds of gigabytes won't last forever if files are never deleted. Disk space on the Biowulf system should never be used as archival storage.

    Summary of file storage options:

    Location Creation Backups Performance Amount of Space Accessible
    from (*)
    /home network (NFS) with Biowulf account yes high 500 MB
    (quota)(**)
    B,C,D
    /scratch (nodes) local created by user no best? (***) 30 - 130 GB dedicated
    while node is allocated
    C
    /scratch (biowulf) network (NFS) created by user no low 120 GB shared B,H,N,D
    /data network (NFS) with Biowulf account yes high based on quota
    (48 GB default)
    B,C,H,N,D
    (*) H = helix, N = nimbus, D = doublehelix nodes, B = biowulf login node, C = biowulf computational nodes
    (**) The home directory quota is the sum of helix (SGI) and biowulf (LINUX) home directories.
    (***) This applies to fast ethernet-connected nodes - i/o to /data directory may be better for gigabit ethernet-connected nodes.

    Local Disk (/scratch)

    Each Biowulf node has a directly attached disk containing a /scratch filesystem. Note that scratch space is not backed up to tape, and thus, users should store any programs and data of importance in their home directories. Use of /scratch on the batch nodes should be for the duration of your job only. It is your job's responsibility to check for sufficient disk space. Your job may delete any and all files from /scratch to make space available. Please use /scratch instead of /tmp for storage of temporary files.

    /scratch on biowulf.nih.gov is actually a shared (NFS) filesystem, accessible from all Helix Systems. Files on this filesystem which have not been accessed for 14 days are automatically deleted by the system.

    /data

    This is a RAID-4 filesystem mounted over NFS from four Network Appliance FAS960 and two FAS3050 Filers configured for high availability. This system offers high performance NFS access, and is exported to Biowulf over a dedicated high-speed network. /data is accessible from all computational nodes as well as Biowulf, and will be the filesystem of choice for most users to store their large datasets. Biowulf users are assigned an initial quota on /data; please contact the Biowulf staff if you need to increase your quota. /data is also accessible from the other Helix Systems: helix and nimbus.

    Note: your /data directory is actually physically located on on filesystems named /data1 through /data6. The /data directory consists of links to one of those filesystems. Please refer to your data directory through the /data links as opposed to the physical location because the physical location is subject to change. In other words, use /data/yourname rather than (for example) /data6/b/yourname in your scripts.

    Hierarchical Storage Managment of /data Directories

    Files which are greater than 4 megabytes in size, and have not been accessed in 12 months will be subject to migration to on-line secondary storage, also known as "near-line" storage.

    You will still be able to access migrated files exactly as you did before they were migrated. This is due to the fact that a symbolic link is created in place of the original file. As with files on /data, migrated files will be accessible from all Helix Systems, including the computational nodes of the Biowulf Cluster.

    For example, before migration:

    % ls -l
    -rw-r----- 1 joeu joeu 5715508 Dec 13 1999 jobout.dat
    After migration:
    % ls -l
    lrwxrwxrwx 1 root root 34 Feb 12 16:13 jobout.dat -> /dest2/data4/joeu/jobout.dat
    or, using the "-L" switch (which "follows" the link)
    % ls -lL
    -rw-r----- 1 joeu joeu 5715508 Dec 13 1999 jobout.dat
    One consequence of using symbolic links as "place holders" for the original file is that if you wish to delete a file that has been migrated, you should first delete the migrated copy, and then the link that points to it. Continuing the example above:
    % rm /dest2/data4/joeu/jobout.dat
    % rm jobout.dat
    Differences between primary and near-line storage:
  • the disk technology implementing secondary storage is slower than that of primary storage.
  • the technology implementing secondary storage is not configured as high availability storage. That is, if the fileserver fails, the storage will not be accessible until the fileserver returns to service. Note that data will not be affected, only access to it until the fileserver returns to service.
  • Checking your disk storage usage

    Use the checkquota command to determine how much disk space you are using:
    $ checkquota
    Mount Used Quota Percent Files /data7: 5.95 GB 48.00 GB 12.40% 5007 /home: 359.39 MB 500.00 MB 71.88% 3510
    back to top

    Using the Biowulf Cluster

    The Biowulf Cluster is highly heterogeneous with respect to processor speed, memory size and networks. And with the advent of multi-core processor chips, nodes can contain either 2 or 4 processors.

  • node - a node is a computer containing one or more processors. Sometimes simply called a "machine" or "box".
  • processor - a processor is a cpu, each able to run a single application process at any given time. Also referred to as a "core".
  • process - the execution instance of an application. An MPI program will typically consist of 2 or more processes. Some program processes can have more than one thread of execution.
  • There are 3 categories of nodes in the Biowulf cluster:

  • login nodes - these are nodes used for program development, compiling, debugging and submission of jobs to the batch system. There is currently one login node (biowulf.nih.gov). Please do not run application code on login nodes.
  • batch nodes - the majority of nodes in the cluster are used for production jobs run under the control of the batch queueing system. These nodes are dedicated to your job.
  • interactive nodes - these are shared nodes reserved for jobs which request interactive nodes using the "-I" switch to qsub.
  • Using the Batch System on Biowulf

    Batch (or queueing) systems allow the user to submit jobs to the cluster without regard for what resources (cpu, memory, networks) are available. When those resources become available the batch system will start the job.

    There are many batch systems available; the one used on the Biowulf cluster is Altair's PBS Pro.

    Submitting Jobs

    In order to run a job under batch you must submit a script which contains the commands to be executed. The command to do this is qsub. The qsub command on Biowulf is heavily customized and this documentation should be used in lieu of that provided by Altair.

    The batch script (or batch command file) is simply a Linux shell script with optional commands to PBS (called directives).

    Sample Batch Script

    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    cd $PBS_O_WORKDIR
    myprog < /data/me/mydata
    

    One could simply run this as a shell script on the command line, however when executed by the batch system, the lines beginning with #PBS have special meaning. These directives are as follows:

    -N: name of the batch job.
    -m: send mail when the job begins ("b") and when it ends ("e").
    -k: keep the STDOUT ("o") and STDERR ("e") files.
    
    Other directives can be found on the qsub man page.

    The environment variable $PBS_O_WORKDIR is the directory which was the current directory when the qsub command was issued.

    Note that if the "myprog" program above ordinarily sends its output (STDOUT) to the terminal, that output will appear in a file called MyJob.o<jobnumber>. Any errors (STDOUT) will appear as MyJob.e<jobnumber>.

    Submitting a Job with qsub

    The simplest form of the qsub command is:
    qsub -l nodes=1 myjob.bat
    
    The resource list (-l) here consists of the number of nodes (1). You must specify a number of nodes, even if it is one.

    The batch system allocation is done by nodes (not processors). When you are allocated a node by the batch system, your job is the only job running on the node. Since a node may contain 2 or 4 processors, you should ensure that you utilize the node fully (see the swarm program below).

    Parallel jobs running on more than 2 cpus will need to specify more than 1 node. Additionally, since you will want a parallel job running on nodes of identical clock speed, you will probably want to specify a node type as well:

    qsub -l nodes=8:o2800 myparalleljob
    
    Node types currently supported are (from fastest to slowest):
    o2800	2.8 GHz Opteron
    o2200	2.2 GHz Opteron
    o2000	2.0 GHz Opteron
    p2800	2.8 GHz Xeon
    
    Nodes can also be selected based on memory size:
    m2048	2 GB (1 GB/processor)
    m4096	4 GB (2 GB/processor)
    m8192	8 GB (4 GB/processor)
    
    (Note: All nodes have at least 1 GB. Also, the above memory properties are normalized for dual-processor nodes, ie, 4 GB 2p nodes and 8 GB 4p nodes both have m4096 properties).

    Nodes can also be selected based on network:

    gige	gigabit ethernet
    ib	infiniband
    myr2k	myrinet
    
    (Note: only the p2800 nodes are not gige connected).

    Other properties:

    dc	dual-core node (4p)
    x86-64	64-bit node
    
    Note: Don't make additional specifications unless you need to! In most cases, the less you specify the better. The more flexibility the scheduler has in allocating nodes for you job, the less likely your job will wind up having to wait for specific resources.

    Additional Notes:

  • At this time, any node the system allocates to you will be a 2 processor node. You have to explicitly request a 4-processor node with the "dc" property. The exception to this is when nodes are allocated by the swarm command, which ensures that the correct number of processes are started on each type of node.
  • Your job may not run properly if your startup file (.bashrc, .cshrc, etc) contains commands which attempt to set terminal characteristics or do output to STDOUT. Such commands can be skipped by testing the environment variable PBS_ENVIRONMENT:
    #csh
    if ( ! $?PBS_ENVIRONMENT ) then
       do terminal stuff here
    endif
    
  • If your default shell is csh or tcsh, your job.o###### files will contain the warning:
    Warning: no access to tty (Bad file descriptor).
    Thus no job control in this shell.
    
    This warning can be ignored. Alternatively, you can change your shell to bash (with the chsh command) or tell qsub to use a different shell with:
    qsub -l nodes=1 -S /bin/sh myscript
    
  • Other Options for qsub

    -r y|n       this switch controls whether your job is restartable.  If 
                 one of the nodes that your job is running on should hang,
                 the operations staff will often restart the job.  If you
                 do not want your job to be restarted, use "-r n" with qsub.
    
    -v varlist   this switch allows you to pass one or more variables from
                 the qsub command line to your batch script.  For instance,
                 with "qsub -v np=4 myjob", the myjob script can use the $np
                 variable with a value of 4.  You can also list an environment
                 variable without a value, and that envvar will be exported
                 with its current value.
    
    -I           for interactive jobs, see below.
    
    -V           export all environment variables to the batch script.  Useful
                 when running X11 clients on interactive nodes, see below.
    
    See the qsub man page for additional options.

    Allocating Nodes for Interactive Use with PBS

    Using the batch system for interactive use may sound like an oxymoron, but doing so allows you to allocate an arbitrary number of nodes without interfering with other jobs.

    The following example shows how to allocate 4 nodes:

    biobos$ qsub -I -V -l nodes=4
    qsub: waiting for job 2011.biobos to start
    qsub: job 2011.biobos ready
    p139$ cat $PBS_NODEFILE
    p139
    p138
    p137
    p136
    p139$ exit
    logout
    qsub: job 2011.biobos completed
    biobos$
    Note the following:
  • The qsub command, used with the -I switch, automatically logs you in to the first allocated node.
  • The -V switch exports all current environment variables to the batch session. This is required in order to run an X11 client on the interactive node.
  • As in batch, the $PBS_NODEFILE variable is the name of a file which contains the allocated nodes.
  • The nodes are unallocated when the interactive session is ended using the exit command.
  • Other PBS Commands

    See the man pages for details on the following PBS commands:
    qstat -u <username>            list jobs belonging to username
    qdel <jobid>	               delete job jobid
    qdel -Wforce <jobid>           delete a job on a hung node
    qselect -u <user> -s <state>   select jobs based on criteria
    
    For example, to delete all of your current jobs:
    qdel `qselect -u myname`
    
    To delete only your queued jobs:
    qdel `qselect -u myname -s Q`
    

    Resource Limits

    There are no time limits for jobs running in the batch queue. However, while debugging a program, or if there is otherwise a possibility that your job could "runaway" due to a programming error, please use the walltime switch to limit the time your job can run before it is terminated by the batch system. For example, to limit your job to 72 hours use "-l walltime=72:00:00" as an argument to the qsub command.

    Scheduling on Biowulf is done using a Fair Share algorithm. This means that, when more jobs are waiting to run than can be started, the next job to run will be the one belonging to the user with the least amount of system usage during the previous 7 days. This should allow users to submit as many jobs to the queue as they would like without concern that they will take an unfair amount of processing time.

    The batch system enforces a maximum number of cpus (or cores) allocated to each user. This number can vary depending on system load and other factors. To see the current limits, use the batchlim command:

    $ batchlim
                 Max CPUs    Max CPUs
                 Per User   Available
                ---------- -----------
    ib          48         n/a        
    norm        160        n/a        
    nist1       32         172        
    norm3       32         100        
    nist2       16         48         
    
    Most user jobs run in the norm queue, so in the example above, the maximum per user allocation is 160 cores (which could be 80 x 2p nodes or 40 x 4p nodes).

    Note: You should never specify a queue when submitting a job.

    Running Multiple Serial Jobs and the swarm Command

    Because each Biowulf node has two or four processors, but the PBS batch system allocates nodes rather than processors, submitting multiple single process jobs to the system results in a poor utilizaion of processors (i.e., only one processor per node).

    In the case of simply running two processes on a single dual-processor node, the following example uses the wait command to prevent the batch command script from exiting before the application processes have finished:

    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -arg arg1 < infile1 > outfile1 &
    myprog -arg arg2 < infile2 > outfile2 &
    wait

    Note how this script runs 2 instances of a program by putting them in the background (using the ampersand "&"), and then using the shell wait command to make the script wait for each background process before exiting.

    When running many single-threaded jobs, setting up many batch command files can be cumbersome. The swarm command can be used to automatically generate batch command files and submit them to the batch system.

    The swarm Command

    swarm allows you to submit an arbitrary number of serial jobs to the batch system by simply creating a command file with one command per line. swarm automatically creates batch command files and submits them to the batch system. Two commands are submitted for each node, making optimal use of the processors.

    Here is an example command file:

    myprog -param a < infile-a > outfile-a
    myprog -param b < infile-b > outfile-b
    myprog -param c < infile-c > outfile-c
    myprog -param d < infile-d > outfile-d
    myprog -param e < infile-e > outfile-e
    myprog -param f < infile-f > outfile-f
    myprog -param g < infile-g > outfile-g

    The command file is submitted using the following command:

    swarm -f cmdfile
    The result is 4 jobs submitted to the batch system, 3 jobs with 2 processes each, and the last with a single process.

    When submitting a very large swarm (1000s of processes), the bundle option to swarm should be used:

    swarm -f cmdfile -b 40

    If cmdfile contains 2500 commands, approximately 63 bundles of 40 commands each will be created and submitted as 32 batch jobs (2 bundles per job, one for each processor on a node).

    Note: swarm will correctly allocate the correct number of processes to 2- or 4-processor nodes.

    See the swarm documentation (http://biowulf.nih.gov/swarm.html) for more details.

    The multirun Command

    If you wish to submit more than 2 single-threaded jobs but want them under control of a single job, then an mpi "shell" can be used (note: In many cases this will not be an optimal use of resources. Unless all processes exit at roughly the same time, idle nodes will not be freed by the batch system until the last process has exited).

    The basic procedure is as follows (generation of these scripts can be done automatically by writing a higher order script):

    1. Create an executable shell script which will run multiple instances of your program. Which will run depends on the "mpi task id" of the instance.

    #!/bin/tcsh
    #
    # this file is run6.sh
    #
    switch ($MP_CHILD)
    case 0:
    your_prog with args0
    breaksw
    case 1:
    your_prog with args1
    breaksw
    case 2:
    your_prog with args2
    breaksw
    case 3:
    your_prog with args3
    breaksw
    case 4:
    your_prog with args4
    breaksw
    case 5:
    your_prog with args5
    breaksw
    endsw

    2. Use mpirun in your batch command file to run the mpi shell program (multirun):

    #!/bin/tcsh
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    set path=(/usr/local/mpich/bin $path)
    mpirun -machinefile $PBS_NODEFILE -np 6 \
    /usr/local/bin/multirun -m /home/me/run6.sh

    3. Submit the job to the batch system:

    qsub -l nodes=3 myjob.sh

    This job will run 6 instances of the program on 3 nodes.

    back to top

    Monitoring Your Jobs

    Once a batch job has been submitted it can be monitored using both command line and web-based tools.

    Using either the jobload command or the user job monitor (see below), you can determine the overall behavior of your job based on the load of each node. Perfectly behaving jobs will have loads of near 2.0 on all nodes. If the nodes in a parallel job are running at loads below 1.5 (or are green), the job probably isn't scaling well, and you should rerun the job with fewer nodes. Loads of less than 1.5 for non-parallel jobs may mean that only one processor is being used, or that the job is i/o bound. Contact the Biowulf staff for help in deciding whether your job is making best use of its resources.

    The sections below provide information about various ways of monitoring your jobs.

    qstat

    The qstat command reports the status of jobs submitted to the batch system. 'qstat -a' shows the status of all batch jobs; 'qstat -n' shows, in addition, the nodes assigned to each job. Qstat with the "-f" switch gives detailed information about a specific job, including the assigned nodes and resources used. See the man page for 'qstat' for more details.

    freen

    The freen command can be used to determine the currently available nodes (free/total):
    % freen
    
            m1024   m2048   m4096   Total
    ------------- GeneralPool -------------
    o2800   /       17/91   1/241   18/332
    o2200   /       84/223  27/44   111/267
    o2000   /       6/40    /       6/40
    p2800   31/80   125/204 63/63   219/347
    p1800   31/33   4/4     /       35/37
    -------------   Myrinet   -------------
    o2200   /       11/67   /       11/67
    o2000   16/48   /       /       16/48
    p2800   22/38   /       /       22/38
    ------------- Infiniband  -------------
    o2800   /       /       4/60    4/60
    -------------  Reserved   -------------
    o2800   /       /       14/15   14/15
    o2600   /       /       0/39    0/39
    o2200   /       /       4/14    4/14
    

    jobload

    The jobload command is used to report the load on nodes for a job or for a user:
    $ jobload juser
    Jobs for  juser     Node    Load    
    455157.biobos       p488    100%
    455260.biobos       p331    100%
    455261.biobos       p405    100%
    455262.biobos       p451     99%
    455263.biobos       p452    100%
    455265.biobos       p1425   100%
    561609.biobos       p506     99%
    699962.biobos       p416     50%
    812953.biobos       p744     50%
    
    User Average:       88%
    
    All but 2 nodes allocated to this user are running to capacity.

    $ jobload 869186
    869186.biobos       Node    Load    
                        p554    100%
                        p555    100%
                        p557    100%
                        p564    100%
                        p565    100%
                        p566    101%
                        p567    100%
                        p568    100%
                Job Average:    100%
    
    This shows a well-behaved parallel job.
    # jobload 763334
    763334.biobos       Node    Load    
                        p1495    25%
                        p1497     0%
                        p1498     0%
                        p1500     0%
                        p1501     0%
                        p1503     0%
                        p1504     0%
                        p1507     0%
                Job Average:      3%
    
    This last example shows a improperly running parallel job.

    cluster monitor (web-based)

    The web page at http://biowulf.nih.gov/sysmon lists several ways of monitoring the Biowulf cluster. The "matrix" view of the system shows the load of each node. The position on the matrix corresponds to the node number, for instance row 14 column 3 is p143. Biowulf itself is (0,0).

    Biowulf Monitor

    The color of the dot indicates the load:

  • white - the node is down
  • blue - idle (load < 0.2)
  • green - load >0.2 <1.2
  • yellow - load >1.2 <2.2
  • red - load >2.2
  • Clicking on the dot corresponding to a node will result in a display of the process, cpu, disk and memory status for that node (output from the top and df commands).

    The status of specific batch jobs can be monitored by first clicking on List Status of Batch Jobs which gives output similar to qstat, and then clicking on the Job ID of the job of interest. This results in the display of a matrix which contains dots only for the nodes allocated to the job. The loads of those nodes can be monitored in same way as for the system as a whole.

    Job Monitor

    Finally, the sum total of nodes allocated across all jobs to a user can be monitored by clicking on Username.

    User Monitor

    Again, for both JobID and Username monitoring, clicking on a dot corresponding to a node results in a display of information about the node.

    back to top

    Parallel Jobs

    Due to the distributed memory architecture of clusters, parallel programming on the system is generally done by explicit message passing. Most programs are written in C or Fortran using the MPI message passing library (see below).

    Minimally parallel jobs up to 4p using shared memory can be run on a single node (4p jobs require dual-core nodes).

    Running MPICH Jobs under Batch

    An example of a batch command file is shown below:

    #!/bin/bash
    # This file is mpi-hello.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    PATH=/usr/local/mpich/bin:$PATH; export PATH
    mpirun -machinefile $PBS_NODEFILE -np $np mpi-hello

    The batch job is submitted with the qsub command:

    qsub -v np=16 -l nodes=8 mpi-hello.sh

    In this example a batch file with the name mpi-hello.sh is submitted to the batch queue with a job name of "MyJob". Note that many switches to qsub can be specified either on the command line or as part of the script by use of the special comment string (also called a directive) #PBS. The job runs a 16-process mpi program named mpi-hello. The "-m" switch specifies that mail should be sent to the user when the job begins and when it ends. The "-k" switch specifies which output files should be "kept", the argument of "oe" specifies that both stdout and stderr files are written.

    Other important features of the example above:

    1. The PATH is set to use MPICH.
    2. The special environment variable $PBS_NODEFILE is set by the batch system to specify the name of the node file, and this variable is used as the argument for the -machinefile switch.
    3. The -l switch to qsub is used to specify a list of resources required. Here 8 nodes are being requested. Note that you must specify nodes=N even if N = 1.
    4. The qsub command allows user-defined variables to be passed to the batch control file by using the "-v" switch. Here this switch is used to define a variable named np which makes it convenient to specify the number of mpi processes to run on the qsub command line. This variable is then used as the argument to the -np switch to mpirun.
    5. Note finally that since the nodes are dual processor, the number of nodes requested is one-half the number of processes that will run.

    Myrinet Jobs under Batch

    (please contact the biowulf staff before proceeding)

    Here is an example of a complete batch command file for running a job over Myrinet/GM:

    #!/bin/csh
    #PBS -N MyJob
    #PBS -k oe
    #
    # csh
    set path = (/usr/local/mpich-gm2k/bin /usr/local/bin $path .)
    # for bash, use...
    # PATH=/usr/local/mpich-gm2k/bin:/usr/local/bin:$PATH:.

    #
    cd $PBS_O_WORKDIR
    mpirun -machinefile $PBS_NODEFILE -np $np MyProg < MyProg.in >& MyProg.out
    #

    The batch job is submitted with the qsub command:

    qsub -v np=8 -l nodes=4:myr2k myjob.bat

    Note the node specifier "4:myr2k", which results in a request for 4 nodes with the "myr2k" property (i.e., connected to Myrinet 2000).

    Infiniband Jobs under Batch

    Before submitting jobs to the IB nodes, you must copy a file to your ~/.ssh directory:
    % cd ~/.ssh
    % cp /usr/local/etc/ssh_config_ib config
    % chmod 600 config
    
    (If you already have a ssh config file, you should append the contents of /usr/local/etc/ssh_config_ib to it).

    Use the "ib" property when submitting to the infiniband nodes, for example:

    qsub -v np=16 -l nodes=8:ib myibjob.bat
    

    IB Node Allocation

    Jobs will be automatically directed to one of two pools, one reserved for "large" (64-128 processor) jobs, and one reserved for "small" (<64 processor) jobs. The systems staff will be able to dynamically adjust the size of the pools to allow for demand.

    You can determine the total number of processors in each pool by using the 'batchlim' command:

    $ batchlim
    (only the relevant lines are shown here)
            Max CPUs   Max CPUs
            Per User   Available
           ---------- -----------
    ib2         48         74
    ib          128        196
    
    The "ib2" pool is for small jobs, the "ib" pool for large jobs. The batchlim output shows that there are 196 and 74 processors respectively assigned to the ib and ib2 pools. The maximum per-user limits are 128 and 48 processors respectively.

    back to top

    Programming Tools & Libraries

    All nodes run the Redhat Linux operating system, including the GNU EGCS C compiler (gcc) and GNU Fortran (g77).

    Two potentially higher performance compiler suites are also available:

    Portland Group compilers include: C (pgcc), C++ (pgCC), Fortran 77 (pgf77), and Fortran 90 (pgf90). This compiler suite also includes the PGPROF profiler (pgprof) and PGDBG debugger (pgdbg), each with an X window interface. Documentation for the Portland Group compilers is available on-line:

    http://biowulf.nih.gov/doc/pgdoc.

    Intel compiler software includes a C/C++ compiler (icc), a Fortran compiler (ifort) and an application debugger (idb).  Documentation for Intel's compilers and debugger is on-line:

    C/C++:
    http://biowulf.nih.gov/doc/intel/cc10/Doc_Index.htm
    Fortran:
    http://biowulf.nih.gov/doc/intel/fc10/Doc_Index.htm
    Application Debugger:
    http://biowulf.nih.gov/doc/intel/idb10/Doc_Index.htm

    Pathscale Compilers are available for users that want optimized binaries on our 64-bit Opteron nodes. There is a debugger available as well as an option generator that will analyze source code and suggest compiler options. Documentation for the Pathscale Compilers can be found here (PDF Format):

    User Guide
    http://biowulf.nih.gov/doc/pathscale/UserGuide.pdf
    Debugger
    http://biowulf.nih.gov/doc/pathscale/PathDB_UserGuide.pdf
    Optimization Generator
    http://helixweb.cit.nih.gov/doc/pathscale/pathOGen.txt

    Compiling with the Intel, Portland Group or Pathscale compilers requires that the user set up their shell environment by sourcing the appropriate script.  The following table shows the various compiler versions available on Biowulf, environment set-up scripts (one for C-Shell users and one for Bash users) and the location/availability of an MPICH and/or MPICH-GM installation for each compiler.

    GCC 4.1.2 is available for those users who need support for the newer compilers (gfortran in particular), require gcj (for compiling Java code into native binaries) or need any of the additional features found in the GCC 4 suite.

    Compilers and set-up scripts available on Biowulf:
    C/C++ Front-end Environment Setup MPICH Home MPICH-GM Home
    GCC 3.2.3 gcc Default /usr/local/mpich /usr/local/mpich-gm2k
    GCC 3.2.3 x86_64 gcc Default on x86_64 nodes Available on request n/a
    PGI 7.1 pgcc (C)
    pgCC (C++)
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    /usr/local/mpich-pg /usr/local/mpich-gm2k-pg
    PGI 7.1 x86_64 pgcc (C)
    pgCC (C++)
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    /usr/local/mpich-pg64 n/a
    Intel v10.1 icc % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    /usr/local/mpich-i
    (C only)
    /usr/local/mpich-gm2k-i
    (C only)
    Intel v10.1 x86_64 icc % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    /usr/local/mpich-i64 (C only) n/a
    Pathscale 2.5
    (32- & 64-bit)
    pathcc (C)
    pathCC (C++)
    % source /usr/local/pathscale/pathvars.sh
    % source /usr/local/pathscale/pathvars.csh
    /usr/local/mpich-ps
    /usr/local/mpich-ps64
    /usr/local/mpich-gm2k-ps
    GCC 4 gcc (C)
    g++ (C++)
    % source /usr/local/gcc/gccvars.sh
    % source /usr/local/gcc/gccvars.csh
    /usr/local/mpich-gcc4
    /usr/local/mpich-gm2k-gcc4
    GCC 4 x86_64 gcc (C)
    g++ (C++)
    % source /usr/local/gcc/gccvars.sh
    % source /usr/local/gcc/gccvars.csh
    /usr/local/mpich-gcc4_64
    n/a
     
    Fortran77/90 Front-end Environment Setup MPICH Home MPICH-GM Home 
    GCC 3.2.3 g77 Default /usr/local/mpich /usr/local/mpich-gm2k
    GCC 3.2.3 x86_64 g77 Default on x86_64 nodes Available on request n/a
    PGI 7.1 pgf77
    pgf90
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    /usr/local/mpich-pg /usr/local/mpich-gm2k-pg
    PGI 7.1 x86_64 pgf77
    pgf90
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    /usr/local/mpich-pg64 n/a
    Intel v10.0 ifort % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    /usr/local/mpich-i /usr/local/mpich-gm2k-i
    Intel v10.0 x86_64 ifort % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    /usr/local/mpich-i64 n/a
    PathScale 2.5
    (32- & 64-bit)
    pathf90 % source /usr/local/pathscale/pathvars.sh
    % source /usr/local/pathscale/pathvars.csh
    /usr/local/mpich-ps
    /usr/local/mpich-ps64
    /usr/local/mpich-gm2k-ps
    GCC 4
    (Fortran 77,90 & 95)
    gfortran % source /usr/local/gcc/gccvars.sh
    % source /usr/local/gcc/gccvars.csh
    /usr/local/mpich-gcc4
    /usr/local/mpich-gm2k-gcc4
    GCC 4 x86_64
    (Fortran 77,90 & 95)
    gfortran % source /usr/local/gcc/gccvars.sh
    % source /usr/local/gcc/gccvars.csh
    /usr/local/mpich-gcc4_64
    n/a

    The Valgrind tool-set is available for debugging and memory profiling.  Documentation and information on Valgrind can be found at http://valgrind.org/.

    There are 2 MPI library implementations available on Biowulf: MPICH and MPICH-GM. MPICH is a popular implementation of the MPI library. Interprocess communicaction using MPICH occurs using TCP/IP sockets, and so allows communication over virtually any network technology. However, communicating over TCP/IP has significant overhead resulting in realtively high latencies which can limit the scalability and performance of codes with demanding communication requirements.

    MPICH-GM is a version of MPICH written to make calls directly to the Myrinet GM protocol layer, thereby bypassing the TCP/IP protocol stack. This should result in significantly improved latency and bandwidth. Check the hardware table above for the nodes with connections to Myrinet.

    A User Guide for MPICH and man pages are available on-line. The user guide and guide to MPE (MPICH MPI Extensions) are also available in the form of postscript files:

    User's Guide for mpich, v 1.2.0 /usr/local/doc/mpich-1.2.0-usersguide.ps.gz
    User's Guide for MPE, v 1.2.0 /usr/local/doc/mpich-1.2.0-mpeguide.ps.gz

    After deciding which MPI library you wish to use, it is very important to make sure your PATH is properly set (see below).

    The instructions provided for running MPI in this section are for running MPI jobs interactively. Instructions for running jobs in batch are given in the next section below.

    The table below indicates the current software versions available on the Biowulf cluster.

     
      Biowulf computational
    nodes (Xeon)
    computational
    nodes (Opteron)
    computational
    nodes (Opteron)
    computational nodes
    (Opteron/Myrinet)
    computational nodes
    (Opteron/IB)
    Redhat RHEL 3.9 WS
    32-bit
    RHEL3-WS
    32-bit
    RHEL3-WS
    64-bit (*)
    CentOS 5.0 (**)
    64-bit (*)
    RHEL3-WS
    32-bit
    CentOS 4.2 (**)
    64-bit (*)
    Linux kernel 2.4.21-37.EL 2.6.10 2.6.9 2.6.18-8.el5 2.6.10 2.6.9-22.EL
    C library glibc-2.3.2 glibc-2.3.2 glibc-2.3.2 glibc-2.5 glibc-2.3.2 glibc-2.3.4
    GNU gcc & g77 3.2.3 3.2.3 3.2.3 4.1.1 3.2.3 3.4.4
    Portland Group
    C, C++ and Fortran
    6.1   6.1 (64-bit)     6.1 (64-bit)
    Intel
    C, C++ and Fortran
    10.1   10.1 (64-bit)     10.1 (64-bit)
    PathScale
    C, C++ and Fortran
    2.5   2.5 (64-bit)     2.5 (64-bit)
    PathScale
    InfiniPath
              3.1
    FFTW 3.1          
    MPICH 1.2.7          
    MPICH-GM 1.2.6..14a          
    GM 2.0.6          
    PBS Pro 8.0          
    MySQL 5.0.16          
    Sun Java 2 SDK 1.6.0          
    wxPython 2.5.2          
    ACML
    Includes BLAS and
    LAPACK routines.
    3.0.0          
    (*)includes 32-bit compatibility libraries (**)CentOS is a clone of RHEL

    Compiling MPI Programs

    Biowulf has a number of compilers and mpi libraries to choose from. The table below summarizes which compilers are available for each MPI library, and the libraries under which they were built. If you require a particular combination not listed, contact the Biowulf systems staff.


    MPICH MPICH-GM InfiniPath
    GNU gcc x x
    GNU g77 x x
    Portland Group pgcc x x
    Portland Group pgCC x x
    Portland Group pgf77 x x
    Portland Group pgf90 x x
    Intel icc x
    Intel ifort x
    PathScale pathcc x x
    PathScale pathCC x x
    PathScale pathf90 x x

    There are 2 steps in compiling your code:

  • Set your PATH environment variable.
  • Compile the program.
  • 1. Setting your PATH

    Verify that your path environment variable is correctly set. The following paths should be used to select the libraries and compilers:


    MPICH MPICH-GM InfiniPath
    GNU
    Compilers
    /usr/local/mpich/bin or
    /usr/local/mpich-gnu/bin
    /usr/local/mpich-gm2k/bin or
    /usr/local/mpich-gm2k-gnu/bin
    -
    Portland Group
    Compilers
    /usr/local/mpich-pg/bin
    /usr/local/mpich-pg64/bin
    /usr/local/mpich-gm2k-pg/bin -
    Intel
    Compilers
    /usr/local/mpich-i/bin - -
    PathScale
    Compilers
    /usr/local/mpich-ps/bin
    /usr/local/mpich-ps64/bin
    - no special PATH
    (/usr/bin)

    2. MPI compile commands

    MPI programs are compiled using "wrapper scripts". Which one you use depends on the underlying language being compiled:

    c wrapper c++ wrapper Fortran 77
    wrapper
    Fortran 90
    wrapper
    GNU compilers mpigcc mpiCC mpif77
    Portland Group
    compilers
    mpicc mpiCC mpif77 mpif90
    Intel compilers mpicc mpiCC mpif77 mpif90
    PathScale compilers mpicc mpiCC mpif90

    The simplest syntax for compiling using the mpich wrappers is:

    mpicc -o prog proc.c

    An Example

    Here is an example of compiling a Fortran program under MPICH and using the Portland Group F77 compiler:
    biowulf$ export PATH=/usr/local/mpich-pg/bin:$PATH
    biowulf$ mpif77 -o MyPi pi3.f
    Linking:
    biowulf$ ls -l MyPi
    -rwxr-xr-x 1 steve wheel 414340 Feb 16 15:11 MyPi
    biowulf$

    Compiling MPI for Infiniband

    Compiling for infiniband requires a special environment due to the specific software requirements of PathScale InfiniPath. Therefore, compiling MPI applications for infiniband requires that you log on to an infiniband node by using an interactive batch job:
    % qsub -l nodes=1:ib -I
    
    (note: if all ib nodes are allocated to jobs, you'll have to wait until one becomes free. Please do not login directly to ib nodes.

    You do not need to set any special PATH to access the mpi wrapper scripts. Be sure to "exit" from the interactive job when finished to return the node to the available pool of nodes.

    Note: because the software environment is different from that on any other nodes, executables built on ib nodes will probably not run on any other nodes.

    Using MPICH Interactively

    Once you have compiled your code, proceed as follows.

    1. Allocate the appropriate number of interactive nodes:

    qsub -I -V -l nodes=#
    The batch system will allocate '#' nodes, and will log you onto the 'master' node.

    2. Startup your program with mpirun:

    mpirun -machinefile $PBS_NODEFILE -np 8 myprog
    3. Exit from the nodes to complete the batch job. (IMPORTANT!)
    exit
    note: make sure your PATH is set correctly! The path selects mpirun in the same way it selects the mpi compile command. $PBS_NODEFILE is an environment parameter that is automatically set by the batch system. This file contains the list of nodes allocated to your job. (you can 'cat $PBS_NODEFILE' to check).

  • MPICH allows you to start the mpi job from the host biowulf, however you must specify -nolocal to prevent the job from running on biowulf itself.
  • -np 8 specifies running 8 processes. When running on dual-processor nodes, this should be at most twice the number of nodes specified in i-nodes.
  • mpirun adds 2 switches to your application's command line arguments: -p4pg and -p4wd. Your program need not use these arguments, but if it processes command line arguments, it must be prepared to ignore them.
  • Libraries

    The GNU Scientific Library (GSL) is a collection of routines for numerical computing. The routines are written from scratch by the GSL team in C, and are meant to present a modern Applications Programming Interface (API) for C programmers, while allowing wrappers to be written for very high level languages. For more information about GSL, see the GSL Reference Manual. It is also available in gzipped postscript (500K).

    back to top

    Benchmarking

    Biowulf nodes are heterogeneous with respect to architecture (x86_64, i686), processor speed (between 1.8 Ghz and 2.8 GHz), memory size (between 1 and 4 GB), and networks (1 Gb/s, 100 Mb/s). The batch system will not distinguish amongst nodes with differing cpu clock speeds, memory sizes or networks. This will have no consequences for most production codes (other than varying runtimes), however, if you are running benchmarks, you will want to specify processor speed and memory size:

    qsub -l nodes=4:p2800:m2048 myjob.bat run on 2.8 GHz Xeon/2 GB nodes
    qsub -l nodes=4:o2800:m4096 myjob.bat run on 2.8 GHz Opteron/4 GB nodes

    back to top


    This document is available as http://biowulf.nih.gov/user_guide.html
    Biowulf home page | Helix Systems | NIH