Biowulf at the NIH
RSS Feed
R on Biowulf
Rlogo

R (the R Project) is a language and environment for statistical computing and graphics. R is similar to the award-winning S system, which was developed at Bell Laboratories by John Chambers et al. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...). and is highly extensible. R provides an open-source alternative to S.

R is designed as a true computer language with control-flow constructions for iteration and alternation, and it allows users to add additional functionality by defining new functions. For computationally intensive tasks, C, C++ and Fortran code can be linked and called at run time. A list of packages installed is available here. A list of obsolete packages is available here.

All nodes in the Biowulf cluster as well as the Biowulf head node are 64-bit as of June 2010. The default version of R (R or R64)is also 64-bit.

NOTE: Starting with version 2.14, R comes with direct support for parallel programs. It is single-threaded, which means that it can only be run on 1 processor. Single, serial jobs are best run on your desktop machine or on Helix. There are two situations in which it is an advantage to run R on Biowulf:

For basic information about setting up an R job, see the R documentation listed at the end of this page. Also see the Batch Queuing System in the Biowulf user guide.

Create a script such as the following:

                   script file /home/username/runR
--------------------------------------------------------------------------
#!/bin/bash
# This file is runR
#
#PBS -N R
#PBS -m be
#PBS -k oe
date

module load R
R --vanilla < /data/username/R/Rtest.r > /data/username/R/Rtest.out
--------------------------------------------------------------------------

Submit the script using the 'qsub' command, e.g.

qsub -v -l nodes=1 /home/username/runR

The swarm program is a convenient way to submit large numbers of jobs. Create a swarm command file containing a single job on each line, e.g.

                 swarm command file /home/username/Rjobs
--------------------------------------------------------------------------
R --vanilla < /data/username/R/R1 > /data/username/R/R1.out
R --vanilla < /data/username/R/R2 > /data/username/R/R2.out
R --vanilla < /data/username/R/R3 > /data/username/R/R3.out
R --vanilla < /data/username/R/R4 > /data/username/R/R4.out
R --vanilla < /data/username/R/R5 > /data/username/R/R5.out
....
--------------------------------------------------------------------------
If each R process (a single line in the file above) requires less than 1 GB of memory, submit this by typing:
swarm -f /home/username/Rjobs --module R
Swarm will create the PBS batch scripts and submit the jobs to the system. See the Swarm doc umentation for more information.

The multicore package has been installed on Biowulf. Multicore provides functions for parallel execution of R code on machines with multiple cores or CPUs. Unlike other parallel processing methods all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started.

On the Biowulf cluster, multicore would be used to utilize all the processors on a node for a single R job. Users should be aware that the cluster includes single-core (2 processors per node) and dual-core (4 processors per node) nodes. When using 'multicore', it is simplest to always assume 4p per node and always submit to the dual-core ('dc') nodes.

If you are submitting a swarm of R jobs that each use multicore, each node should run only a single R command, since the multicore paralellization will utilize all the processors on that node. Thus, the swarm command should be :

swarm -t auto -f myswarmfile --module R

Documentation for multicore

Rmpi provides an MPI interface for R [Rmpi documentation].
The package snow (Simple Network of Workstations) implements a simple mechanism for using a workstation cluster for ``embarrassingly parallel'' computations in R. [snow documentation]

With the implementation of modules, users who wish to use Rmpi and SNOW, should load the openmpi/1.4.4/gnu/eth module on biowulf.

Sample Rmpi batch script:

#!/bin/bash
#PBS -j oe

cd $PBS_O_WORKDIR

# Get OpenMPI in our PATH.  openmpi_ipath and openmpi_ib
# can also be used if running over those interconnects.

module load R 
module load openmpi/1.4.4/gnu/eth
`which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla > myrmpi.out <<EOF
library(Rmpi)
mpi.spawn.Rslaves(nslaves=$np)
mpi.remote.exec(mpi.get.processor.name())
n <- 3
mpi.remote.exec(double, n)
mpi.close.Rslaves()
mpi.quit()

EOF

Sample batch script using snow:

#!/bin/bash
#PBS -j oe

cd $PBS_O_WORKDIR
# Get OpenMPI in our PATH.  openmpi_ipath and openmpi_ib
# can also be used if running over those interconnects. 

module load R 
module load openmpi/1.4.4/gnu/eth
`which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla > myrmpi.out <<EOF

library(snow)
cl <- makeCluster($np, type = "MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
clusterCall(cl, runif, $np)
stopCluster(cl)
mpi.quit()
EOF

Either of the above scripts could be submitted with:

qsub -v np=8 -l nodes=2:dc myscript.bat
Note that it is entirely up to the user to run the appropriate number of processes for the nodes requested. In the example above, the $np variable is set to 4 and exported via the qsub command, and this variable is used in the script to run 4 snow processes on 2 dual-cpu nodes. Note: myrmpi.out contains the results from the finished job.

Production runs should be run with batch as above, but for testing purposes an occasional interactive run may be useful.

Sample interactive session with Rmpi:

[user@biowulf ~]$ module load openmpi/1.4.4/gnu/eth
[user@biowulf ~]$ qsub -I -V -l nodes=2
qsub: waiting for job 136623.biobos to start
qsub: job 136623.biobos ready

[user@p227 ~]$ module load R
[user@p227 ~]$ `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
...
[Previously saved workspace restored]

> library(Rmpi)
> mpi.spawn.Rslaves(nslaves=4)
	4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: p167 
slave1 (rank 1, comm 1) of size 5 is running on: p168 
slave2 (rank 2, comm 1) of size 5 is running on: p167 
slave3 (rank 3, comm 1) of size 5 is running on: p168 
slave4 (rank 4, comm 1) of size 5 is running on: p167 

> demo("simplePI")
...
> simple.pi(100000)
[1] 3.141593
> mpi.close.Rslaves()
> mpi.quit()                      #very important
[user@p227 ~]$ exit
logout

qsub: job 136623.biobos completed

Sample interactive session with snow: (user input in bold)

[user@biowulf ~]$ module load openmpi/1.4.4/gnu/eth
[user@biowulf ~]$ qsub -I -V -l nodes=2
qsub: waiting for job 136706.biobos to start
qsub: job 136706.biobos ready
[user@p227 ~]$ module load R
[user@p227 ~]$ `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
...
[Previously saved workspace restored]

> library(snow)
> cl <- makeCluster(4, type = "MPI")
Loading required package: Rmpi
        4 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename  machine 
  "p227" "x86_64" 

[[2]]
nodename  machine 
  "p228" "x86_64" 

[[3]]
nodename  machine 
  "p227" "x86_64" 

[[4]]
nodename  machine 
  "p228" "x86_64" 

> sum(parApply(cl, matrix(1:100,10), 1, sum))
[1] 5050

> clusterCall(cl, runif, 3)
[[1]]
[1] 0.01032138 0.62865716 0.62550058

[[2]]
[1] 0.01032138 0.62865716 0.62550058

[[3]]
[1] 0.01032138 0.62865716 0.62550058

[[4]]
[1] 0.01032138 0.62865716 0.62550058

> stopCluster(cl)
[1] 1
> mpi.quit()

[user@p227 ~]$ exit
logout

qsub: job 136706.biobos completed

'Rswarm' is a utility to create a series of R input files from a single R (master) template file with different output filenames and with unique seeds (for the random number generator). It will simultaneously create a swarm command file that can be used to submit the swarm of R jobs. Rswarm was originally developed by Lori Dodd and Trevor Reeve with modifications by the Biowulf staff. To demonstrate the use of Rswarm, we first provide an example. After the example, you will find a more detailed description about its usage.

Say, for the purposes of this example, that the goal of the simulation study is to evaluate properties of the t-test. The code below is meant to create an example that is sufficiently general, but simple. The function "sim.fun" is a loop that generates random normal data, performs the t-test, and extracts the p-value many times. Suppose the following code is saved in a file called sim.fun.R:

sim.fun<-function(n.samp=100, mu=0, sd=1, n.sim, output1, seed){

#######################################
#n.samp is number of samples generated within each simulation
#mu is the specified mean
#sd is the standard deviation
#nsim is the number of simulations, which will be specified as 50
#output1, is an  output tables 
#seed is the seed for set.seed
#######################################

set.seed(seed)

for (i in 1:n.sim){
    y<-rnorm(n.samp, mean=mu, sd=sd)
    out.table1<-t.test(y)$p.value
    if (i==1) APPEND<-FALSE
    else APPEND<-TRUE
   
    write.table(out.table1, output1, append=APPEND,       row.names=FALSE,col.names=FALSE)

}
}

Now suppose you create a two-line file called Rfile.R:

source("sim.fun.R")
sim.fun(n.sim=DUMX, output1="DUMY1",seed=DUMZ)

To swarm this code, we need replicates of the Rfile.R file (like 100 of them), each with a different seed and different output file. The Rswarm function will create the specified number of replicates, supply each with a different seed (from an external file containing seed numbers), and create unique output files for each replicate. Note, that we allow for you to specify the number of simulations within each file, in addition to specifying the number of replicates.

Typing the following Rswarm command at the Biowulf prompt will create 2 replicate files, each specifying 50 simulations, a different seed from a file entitled, "seedfile.txt," and unique output files.

Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user/dir/ 
    --reps=2 --sims=50 --start=0 --ext1=.txt

The first of the two output files will be named Rfile1.R and its contents have been changed to:

source("sim.fun.R")
sim.fun(n.sim=50, output1="//data//user//dir//Rfile1.txt",seed=25)

The second file is similar except that the outputfile is named "Rfile2.txt" and the seed is different. The corresponding swarm file is generated, so that these two files can be submitted simply by typing the following command:

swarm -f Rfile.sw --module R

Below you will find additional details about using Rswarm. Rswarm usage:

Usage: Rswarm [options]
   --rfile=[file]   (required) R program requiring replication
   --sfile=[file]   (required) file with generated seeds, one per line
   --path=[path]    (required) directory for output of all files
   --reps=[i]       (required) number of replicates desired
   --sims=[i]       (required) number of sims per file
   --start=[i]      (required) starting file number
   --ext1=[string]    (optional) file extension for output file 1
   --ext2=[string]    (optional) file extension for output file 2`
   --help, -h         print this help text

To use Rswarm, create an R template file containing the desired R commands (see example template below). Within the template file four things must be specified, as described below. Each of these has a specific notation that will be recognized by the Rswarm utility:


Notation in R template file
Number of simulations to be specified in each replicate file DUMX
Output file 1 which captures results "DUMY1"
Output file 2 (optional) "DUMY2"
Random seed DUMZ

For example, DUMX indicates the number of simulations to be performed. When creating the replicate files, Rswarm will replace occurrences of DUMX with the specified number of simulations. Likewise, Rswarm will replace occurrences of "DUMY1" with the name of the output file, and of DUMZ with a unique seed that is pulled from a random seed file. The R template file might be called 'Rfile.R'. The random seeds file is a text file with a p-by-1 vector of randomly generated numbers to use as seeds. Typing

Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user//dir/ 
   --reps=2 --sims=50 --start=0 --ext1=.txt
(all on one line) will produce 2 files, named Rfile1.R and Rfile2.R, each of which performs 50 simulations. The seed for Rfile1.R will be the first element in seedfile.txt, while the seed for Rfile2.R will be the second element in seedfile.txt. The output files will be in /data/user/dir/ with the names Rfile1.txt and Rfile2.txt. Note that if a start number other than "0" is specified the files will have different numbers. For example if we type --start=5, the files will Rfile6.R and Rfile7.R. The corresponding output files will change accordingly.

This creates two replicate files, and the swarm file. At the Biowulf prompt, typing:

swarm -f Rfile.sw --module R
will execute the swarm command for these files.

The output files Rfile1.txt and Rfile2.txt will be created. After the program has completed these files can be concatenated into a single file named outRfile.txt with the following command:

cat Rfile*.txt > outRfile.txt

The RSPerl package allows for calling R from Perl and Perl from R. Load the R module:

module load R

Next, source the correct environment variables to use R from perl:

. /usr/local/R-2.15-64_cluster/lib64/R/library/RSPerl/scripts/RSPerl.bsh

Last, include the specific perl modules in your script. For example, here is plot.pl

use R;
use RReferences;

&R::initR("--silent");
&R::library("RSPerl");

$z = &R::call("rnorm",1);
printf "rnorm: $z\n";

&R::call("x11");

@x=1..3;
&R::call("plot", \@x);
&R::call("plot", (1,2));
sleep(4);

Now try it!

$ perl plot.pl

For more information about RSPerl, see http://www.omegahat.org/RSPerl/.