N I H H e l i x S y s t e m s

B i o w u l f: A Beowulf for Bioscience

Steven Fellini
sfellini@nih.gov


Susan Chacko
susanc@helix.nih.gov


CIT

November 4, 2009

This page is at http://biowulf.nih.gov/seminar.html
This Biowulf User Guide is at http://biowulf.nih.gov/user_guide.html
See also http://biowulf.nih.gov/development.html
The Biowulf Home Page is at http://biowulf.nih.gov


Morning Schedule: Afternoon Schedule:

What is Biowulf?

  • A large-scale Linux cluster for NIH scientists
  • Linux clusters (also known as Beowulf-type clusters): commodity computers interconnected by (mostly) commodity network running the Linux (Unix) operating system
  • Funded by the NIH Management Fund and CIT, built and supported by Helix Systems Staff, CIT (1999)

  • Production facility (high availability, data integrity and backup)
  • General purpose scientific (not dedicated to any one application type)
  • Access to shared central storage (Helix)
  • Research on Biowulf (http://helix.nih.gov/research.html)

    • Next-generation sequence assembly/mapping
    • Molecular dynamics
    • Linkage analysis
    • DNA/Protein sequence analysis
    • NMR spectral analysis
    • Statistical analysis
    • Microarray data analysis
    • Protein folding
    • PET and EPR imaging
    • Free energy calculations
    • Rendering

    Cluster Basics



    Important concepts

  • A node is one computer out of the 2100+ computers in the Biowulf cluster.
  • A processor (or core) is one cpu on a node. Each Biowulf node has 2, 4, or 8 processors.
  • A process is one command or program that will run on 1 processor.
  • The Biowulf login node, and all the computational nodes, run Linux.
  • interactive vs batch processing. The Biowulf cluster is primarily a batch system. (PBS).

  • shared vs distributed memory. The cluster is a distributed memory system.
    heterogenous system
  • serial programs (swarms) vs. parallel programs (message passing)
  • parallel job scaling (you must benchmark parallel code)

    When do you need to use Biowulf?

    • Long jobs (e.g. molecular dynamics)
    • Many jobs (e.g. bioinformatics with 50+ sequences)
    • Large-memory jobs (e.g. Gaussian requiring 8GB memory)
    • Parallel jobs (e.g. NAMD on 16 processors)

    What jobs are unsuitable for Biowulf?

    • "Serial" jobs.
    • Small numbers of short jobs.
    • Interactive jobs

    Accounts & Passwords

  • Every user must have his/her own account. NO SHARING of accounts.
  • requires pre-existing Helix account (http://helix.nih.gov/new_users/accounts.html).
  • registering for a Biowulf account (http://helix.nih.gov/register/biowulf.html).
  • Passwords -- initially the same as your Helix password, but not thereafter. We can sync your Helix and Biowulf password upon request.
  • Default shell: bash
  • Connecting to Biowulf

    ssh to biowulf.nih.gov (see http://helix.nih.gov/new_users/connect.html)
    No major computation on the login node!
    Email goes to user@helix.nih.gov.
    Important: forward your Helix email if you don't normally read it! (see http://helix.nih.gov/docs/online/email/email.html#forward)

    Storage & Backups

  • /home/user -- 1 GB, shared between Helix and Biowulf /home/user.
  • /data/user -- 100 GB.

    Location Creation Backups Amount of Space Accessible
    from (*)
    /home network (NFS) with Helix accountyes 1 GB
    (quota)
    B,C
    /scratch (nodes)local created by userno 30 - 166 GB dedicated
    while node is allocated
    C
    /scratch (biowulf)network (NFS)created by userno 500 GB sharedB,H
    /data network (NFS) with Biowulf accountyesbased on quota
    (100 GB default)
    B,C,H
    (*) H = helix, B = biowulf login node, C = biowulf computational nodes

  • checkquota command
  • don't use /tmp on login node
  • use /data/yourname, not /data3/c/yourname
  • clearscratch - to clear out local /scratch on nodes
  • Quota increases - request at https://helix.nih.gov/nih/storage_request.html
  • Snapshots. (see http://helix.nih.gov/new_users/backups.html)

    Running Programs on Biowulf

    Demo: set up and submit a simple batch job.
    #!/bin/tcsh
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < /data/me/mydata
    
    qsub -l nodes=1 myjob.sh
    

    Notes:

  • Jobs are queued.
  • Nodes are allocated exclusively.
  • Most jobs should be run in batch.
  • Output from myprog will appear in MyJob.o##### and MyJob.e######.

    Monitoring Batch Jobs

  • qstat
  • cluster monitor
  • jobload

  • Demo: fully utilizing the node.
    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < infile1 > outfile1 &
    myprog -a 200 < infile2 > outfile2 &
    wait
    

    Notes:

  • If a job ends unexpectedly, check the standard error/output files for clues.
  • Also check your quota!
  • qdel - to delete a job.

    Interactive Batch

    qsub -I -l nodes=1
    
    Demo: Matlab on an interactive node.

    Installed Applications

    http://biowulf.nih.gov/apps

    Demo: easyblast



    Biowulf Hardware Configuration (2100 nodes, 6300 processors)


    Fileservers Network Switches
  • NetApp FAS960c (2 Filers)
  • NetApp 3050c (4 Filers)
  • NetApp R200 NearStore
  • DDN Gridscaler (8 fileservers)

  • Foundry MG-8 (core)
    10 GbE & GbE
  • Foundry EdgeSwitch FES448
    GbE
  • Foundry FastIron
    100Base-T & GbE
  • Myricom Clos-128
    Myrinet 2000
  • Voltaire ISR 9288
    Infiniband
  • Voltaire ISR 2012
    Infiniband
  • Node configurations

    # nodes node type processors memory sizes networks
    250Quad-core Xeon8 x 2.8 GHz
    Intel Xeon (E5462) EMT64
    8 GBInfiniband DDR 16 Gb/s
    1 GbE
    521Dual-core Opteron4 x 2.8, 2.6 GHz
    AMD Opteron (290, 285)
    8 GB1 GbE
    951Opteron2 x 2.8, 2.2, 2.0 GHz
    AMD Opteron (254, 248, 246)
    8, 4, 2, 1 GB1 GbE
    Infinipath 8 Gb/s (136 nodes)
    Myrinet 2 Gb/s (109 nodes)
    380Xeon2 x 2.8 GHz
    Intel Xeon (32-bit)
    4, 2, 1 GB100 Mb/s ethernet
    Myrinet 2 Gb/s (40 nodes)

    Large Memory Nodes

    # nodes node type processors memory sizes networks
    16HP DL-160G68 x 2.8 GHz
    Intel Xeon (X5550)
    72 GB1 GbE
    1SGI Altix32 x 1.4 GHz
    Intel Itanium 2
    96 GB1 GbE
    1HP DL-78532 x 2.5 GHz
    AMD Opteron (8380)
    512 GB1 GbE

  • 2-4 GB swap space
  • 30-136 GB local scratch disk
  • 22 different node configurations
  • previous generations of nodes: 450 MHz, 550 MHz and 866 MHz Pentiums,
    and 1.4 GHz and 1.8 GHz Athlons

  • Jobs on the Biowulf Cluster

  • The batch system allocates nodes (2, 4 or 8 processors/node)
  • Each allocated processor should run a single process
  • small numbers of serial jobs (~1-4)

    Do you need a cluster???

    swarms of serial jobs

    #!/bin/bash
    #
    # this file is myjob.sh
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    #
    myprog -a 100 < infile1 > outfile1 &
    myprog -a 200 < infile2 > outfile2 &
    wait
    

    Using swarm

    #
    # this file is cmdfile
    #
    myprog -param a < infile-a > outfile-a
    myprog -param b < infile-b > outfile-b
    myprog -param c < infile-c > outfile-c
    myprog -param d < infile-d > outfile-d
    myprog -param e < infile-e > outfile-e
    myprog -param f < infile-f > outfile-f
    myprog -param g < infile-g > outfile-g
    

    swarm -f cmdfile
    

    32 lines (processes) => 16 jobs
    2000 lines (processes) => 1000 jobs

    # bundle option
    #
    swarm -b 20 -f cmdfile
    
    2000 lines (processes) => 100 bundles => 50 jobs

    swarmdel



    parallel jobs

    Issues:
  • N processes running on N cpus
  • distributed memory vs. shared memory
  • message passing libraries (MPI, PVM)
  • communication and scalabilty
  • high-performance interconnects
  • Parallel (MPI), multi-node applications only
  • Benchmarking application is essential
  • Low latency more important than bandwidth
  • Bypasses TCP/IP stack
  • Requires compilation against special libraries
  • Two high performance interconnects: Myrinet and Infiniband

  • gigabit
    ethernet
    MyrinetInfinipath
    (Pathscale)
    Infiniband
    bandwidth
    (full duplex)
    1 Gb/s2 Gb/s8 Gb/s16 Gb/s
    latency"high""low""low""low"
    TCP/IP?yesnonono
    # nodesall
    64-bit
    150136250
    # cores-3002722000

  • benchmarks
  • GROMACS
  • NAMD
  • charmm

  • large memory jobs (>4 GB)

    memory 2.8 GHz
    Opterons

    (single-core)
    dual- and
    quad-core
    nodes
    2.8 GHz
    Xeons

    (quad-core)
    SGI Altix 350
    (1.4 GHz Itanium2)
    HP DL-785
    (2.5 GHz Opteron)
    8 GB 40 770 - - -
    72 GB - - 16 - -
    96 GB
    (shared)
    - - - 1 (32 processors) -
    512 GB
    (shared)
    - - - - 1 (32 processors)

    See also Firebolt web page (http://biowulf.nih.gov/firebolt.html) for running jobs on the Altix.

    Node Selection and Allocation

  • Node priorities (fastest nodes have highest priorities)
  • Fair Share (1 week half-life)
  • PBS Node Properties

    property selects
    e28002.8 GHz Xeon/EMT64 processorreserved (ib)
    o28002.8 GHz Opteron processor
    o26002.6 GHz Opteron processorreserved (dual-core)
    o22002.2 GHz Opteron processor
    o20002.0 GHz Opteron processor
    k82.2 GHz Opteron processor
    2.0 GHz Opteron processor
    p28002.8 GHz Xeon processor
    m20482 GB memory
    m40964 GB memory
    m81928 GB memoryreserved
    g7272 GB memoryreserved
    gigeGigabit ethernet network
    myr2kMyrinet networkreserved
    ipathInfinipath networkreserved
    ibInfiniband networkreserved
    x86-6464-bit node
    dcDual-core processors
    8 GB memory
    reserved
    altixSGI Altix processorreserved

    Notes:

  • specify only the properties your job requires
  • nodes with reserved properties are allocated only if explicitly specified
  • memory properties are per 2 processors
  • if a property list can't ever be satisfied, the job will remain queued forever
  • ncpus= and mem= required for Altix jobs
  • mem= required for DL-785 jobs
  • the freen command
  • Single-core vs. Dual/Quad-core nodes

  • dual- and quad-core nodes are reserved
  • swarm and multiblast will automatically "do the right thing"
  • most SMP parallel-code scales poorly! (exception: NAMD)
  • Examples

    qsub -l nodes=1 myjob
    qsub -l nodes=1:x86-64 my64bitjob
    qsub -l nodes=8:o2800:gige -v np=16 namdjob
    qsub -l nodes=16:ipath -v np=32 bignamd.sh
    qsub -l nodes=16:ipath -v np=128 biggernamd.sh
    qsub -l nodes=1:m4096 bigjob.bat
    swarm -l nodes=1:m4096 -f bigjobs
    qsub -l nodes=1,mem=384gb verybigmem.bat
    

    Limits

    $ batchlim
                 Max CPUs    Max CPUs
                 Per User   Available
                ---------- -----------
    norm        300        n/a        
    nist2       32         n/a        
    nist1       64         n/a        
    ib          256        n/a        
    norm3       40         100        
    ipath       128        n/a        
    altix       8          31         
    
                Max Mem    Max Mem
                Per User   on System
                ---------- -----------
    altix       48gb       91 gb
    DL-785      512gb      512gb
    


    System Software

      Biowulf
    login node
    (Opteron)
    compute
    nodes (Xeon)
    compute
    nodes (Opteron, EMT64)
    hardware 64-bit 32-bit 64-bit
    system
    software/
    compilers
    32-bit 32-bit 64-bit
    application
    software
    32-bit 32-bit 32-bit
    64-bit
    Linux distrib RHEL 5.3 CentOS 5.3 CentOS 5.3
    Linux kernel 2.6.18-128.7.1.el5PAE 2.6.18-92.el5 2.6.18-92.el5
    C library glibc-2.5 glibc-2.5 glibc-2.5

    Note: CentOS is a clone of RHEL

    Compiling Code

  • 32-bit code: compile on biowulf login node
  • 64-bit code: compile on a 64-bit node
    qsub -I -l nodes=1:x86-64
    
  • default Linux compilers: GNU 4.x
  • potentially higher performing compilers: Portland Group, Intel, PathScale
  • Compilers and set-up scripts available on Biowulf:

    compiler Front-ends Environment Setup
    GCC 4.1.2 gcc (C)
    g++ (C++)
    g77 (Fortran77)
    gfortran (Fortran90/95)
    Default
    GCC 4.3.3 gcc (C)
    g++ (C++)
    g77 (Fortran77)
    gfortran (Fortran90/95)
    % source /usr/local/gcc/gccvars.sh
    % source /usr/local/gcc/gccvars.csh
    PGI 9.0 pgcc (C)
    pgCC (C++)
    pgf77 (Fortran77)
    pgf90 (Fortran90)
    pgf95 (Fortran95)
    % source /usr/local/pgi/pgivars.sh
    % source /usr/local/pgi/pgivars.csh
    Intel v10.1 icc (C)
    icpc (C++)
    ifort (Fortran77/90/95)
    % source /usr/local/intel/intelvars.sh
    % source /usr/local/intel/intelvars.csh
    Pathscale 3.1 pathcc (C)
    pathCC (C++)
    pathf90 (Fortran77/90)
    pathf95 (Fortran95)
    % source /usr/local/pathscale/pathvars.sh
    % source /usr/local/pathscale/pathvars.csh

    Debuggers

    • GNU Debugger (gdb)
    • Valgrind (valgrind)
    • Intel Debugger (idb)
    • Portland Group Debugger (pgdbg)
    • Pathscale Debugger (pathdb)

    Compiling MPI parallel programs

  • Source compiler environment file (as above).
  • Set your PATH environment variable.
    PATH=<MPI Home>/bin:$PATH 
    
  • Compile the program.
  • Runtime PATH must be set to the same PATH as above!
  • Notes:
  • For Infinipath/Infiniband, compile on compute node:
    qsub -l nodes=1:ipath -I
    qsub -l nodes=1:ib -I
    
  • Infinipath/Infiniband doesn't require a special PATH
  • Myrinet = 32-bit, Infinipath/Infiniband = 64-bit, ethernet = 32- or 64-bit
  • MPI Home PATHs:

    CompilerEthernet
    (MPICH2)
    Myrinet
    (MPICH1)
    Infinipath
    (MPICH1)
    Infiniband
    (MPICH2)
    Infiniband
    (OpenMPI)
    GNU/usr/local/mpich2
    /usr/local/mpich2-gnu64
    /usr/local/mpich-gm2kDefault PATH/usr/local/openmpi-gcc
    PGI/usr/local/mpich2-pgi
    /usr/local/mpich2-pgi64
    /usr/local/mpich-gm2k-pg/usr/local/mvapich2-pgi/usr/local/openmpi-pgi
    Intel/usr/local/mpich2-intel
    /usr/local/mpich2-intel64
    /usr/local/mpich-gm2k-i/usr/local/mvapich2-intel/usr/local/openmpi-intel
    Pathscale/usr/local/mpich2-pathscale
    /usr/local/mpich2-pathscale64
    /usr/local/mpich-gm2k-psDefault
    PATH
    /usr/local/mvapich2-pathscale/usr/local/openmpi-pathscale

    MPI compiler wrappers:

    cmpicc
    c++mpicxx
    fortranmpif77
    mpif90

    Sample MPI program:

    #include <stdio.h>
    #include "mpi.h"
    int
    main(int argc, char **argv)
    {
      int myrank, n_processes, srcrank, destrank;
      char mbuf[512], name[40];
      MPI_Status mstat;
      MPI_Init(&argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
      MPI_Comm_size(MPI_COMM_WORLD, &n_processes);
      if (myrank != 0) {
        gethostname(name,39);
        sprintf(mbuf, "Hello, from process %d on node %s!", myrank, name);
        destrank = 0;
        MPI_Send(mbuf, strlen(mbuf)+1, MPI_CHAR, destrank, 90, MPI_COMM_WORLD);
      } else {
        for (srcrank = 1; srcrank < n_processes; srcrank++) {
          MPI_Recv(mbuf, 512, MPI_CHAR, srcrank, 90, MPI_COMM_WORLD, &mstat);
          printf("From process %d: %s\n", srcrank, mbuf);
        }
      }
      MPI_Finalize();
    }
    

    % PATH=/usr/local/mpich/bin:$PATH
    % mpicc -o hello_mpi hello_mpi.c
    

    Running MPI Programs (MPICH2)

    #!/bin/bash
    # This file is hello-mpich2.bat
    #
    #PBS -N Hello
    PATH=/usr/local/mpich2/bin:$PATH; export PATH
    mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
    mpiexec -n $np /home/steve/hello/hello-mpich2
    mpdallexit
    

    qsub -v np=16 -l nodes=8 hello-mpich2.sh

    Running MPI Programs (MPICH1)

    #!/bin/bash
    # This file is hello-mpich.bat
    #
    #PBS -N MyJob
    #PBS -m be
    #PBS -k oe
    PATH=/usr/local/mpich/bin:$PATH; export PATH
    mpirun -machinefile $PBS_NODEFILE -np $np hello_mpi
    
    qsub -v np=8 -l nodes=4:myr2k hello-myr.sh
    qsub -v np=16 -l nodes=8:ib hello-ib.sh
    

    Scientific Libraries

    • FFTW
    • Intel Math Kernel Library (MKL)
    • AMD Core Math Library (ACML)
    • Intel Integrated Performance Primitives (IPP)
    • GNU Scientific Library (GSL)

    Contact Biowulf Staff at staff@biowulf.nih.gov