R (the R Project) is a language and environment for statistical computing and graphics. R is similar to the award-winning S system, which was developed at Bell Laboratories by John Chambers et al. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...). and is highly extensible. R provides an open-source alternative to S. |
R is designed as a true computer language with control-flow constructions for iteration and alternation, and it allows users to add additional functionality by defining new functions. For computationally intensive tasks, C, C++ and Fortran code can be linked and called at run time. A list of packages installed is available here. A list of obsolete packages is available here.
All nodes in the Biowulf cluster as well as the Biowulf head node are 64-bit as of June 2010. The default version of R (R or R64)is also 64-bit.
NOTE: Starting with version 2.14, R comes with direct support for parallel programs. It is single-threaded, which means that it can only be run on 1 processor. Single, serial jobs are best run on your desktop machine or on Helix. There are two situations in which it is an advantage to run R on Biowulf:
- if you have a large number of independent R jobs (e.g. processing many independent datasets), you can submit them as a 'swarm' of jobs which can all run simultaneously.
- the Rmpi, snow and multicore packages can be used to parallelize R computations. Other R packages such as 'hcluster' are also multithreaded and can be set to use all available processors on a node.
For basic information about setting up an R job, see the R documentation listed at the end of this page. Also see the Batch Queuing System in the Biowulf user guide.
Create a script such as the following:
script file /home/username/runR
--------------------------------------------------------------------------
#!/bin/bash
# This file is runR
#
#PBS -N R
#PBS -m be
#PBS -k oe
date
module load R
R --vanilla < /data/username/R/Rtest.r > /data/username/R/Rtest.out
--------------------------------------------------------------------------
Submit the script using the 'qsub' command, e.g.
qsub -v -l nodes=1 /home/username/runR
The swarm program is a convenient way to submit large numbers of jobs. Create a swarm command file containing a single job on each line, e.g.
swarm command file /home/username/Rjobs
--------------------------------------------------------------------------
R --vanilla < /data/username/R/R1 > /data/username/R/R1.out
R --vanilla < /data/username/R/R2 > /data/username/R/R2.out
R --vanilla < /data/username/R/R3 > /data/username/R/R3.out
R --vanilla < /data/username/R/R4 > /data/username/R/R4.out
R --vanilla < /data/username/R/R5 > /data/username/R/R5.out
....
--------------------------------------------------------------------------
swarm -f /home/username/Rjobs --module RSwarm will create the PBS batch scripts and submit the jobs to the system. See the Swarm doc umentation for more information.
The multicore package has been installed on Biowulf. Multicore provides functions for parallel execution of R code on machines with multiple cores or CPUs. Unlike other parallel processing methods all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started.
On the Biowulf cluster, multicore would be used to utilize all the processors on a node for a single R job. Users should be aware that the cluster includes single-core (2 processors per node) and dual-core (4 processors per node) nodes. When using 'multicore', it is simplest to always assume 4p per node and always submit to the dual-core ('dc') nodes.
If you are submitting a swarm of R jobs that each use multicore, each node should run only a single R command, since the multicore paralellization will utilize all the processors on that node. Thus, the swarm command should be :
swarm -t auto -f myswarmfile --module R
Rmpi provides an MPI interface for R [Rmpi documentation].
The package snow (Simple Network of Workstations) implements a simple
mechanism for using a workstation cluster for ``embarrassingly parallel''
computations in R. [snow
documentation]
With the implementation of modules, users who wish to use Rmpi and SNOW, should load the openmpi/1.4.4/gnu/eth module on biowulf.
Sample Rmpi batch script:
#!/bin/bash #PBS -j oe cd $PBS_O_WORKDIR # Get OpenMPI in our PATH. openmpi_ipath and openmpi_ib # can also be used if running over those interconnects. module load R `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla > myrmpi.out <<EOF library(Rmpi) mpi.spawn.Rslaves(nslaves=$np) mpi.remote.exec(mpi.get.processor.name()) n <- 3 mpi.remote.exec(double, n) mpi.close.Rslaves() mpi.quit() EOF
Sample batch script using snow:
#!/bin/bash
#PBS -j oe
cd $PBS_O_WORKDIR
# Get OpenMPI in our PATH. openmpi_ipath and openmpi_ib
# can also be used if running over those interconnects.
module load R
`which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla > myrmpi.out <<EOF
library(snow)
cl <- makeCluster($np, type = "MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
clusterCall(cl, runif, $np)
stopCluster(cl)
mpi.quit()
EOF
Either of the above scripts could be submitted with:
module load openmpi/1.4.4/gnu/eth qsub -v np=4 -V -l nodes=2 myscript.batNote that it is entirely up to the user to run the appropriate number of processes for the nodes requested. In the example above, the $np variable is set to 4 and exported via the qsub command, and this variable is used in the script to run 4 snow processes on 2 dual-cpu nodes. Note: myrmpi.out contains the results from the finished job.
Production runs should be run with batch as above, but for testing purposes an occasional interactive run may be useful.
Sample interactive session with Rmpi:
[user@biowulf ~]$ module load openmpi/1.4.4/gnu/eth
[user@biowulf ~]$ qsub -I -V -l nodes=2
qsub: waiting for job 136623.biobos to start
qsub: job 136623.biobos ready
[user@p227 ~]$ module load R
[user@p227 ~]$ `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla
R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
...
[Previously saved workspace restored]
> library(Rmpi)
> mpi.spawn.Rslaves(nslaves=4)
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: p167
slave1 (rank 1, comm 1) of size 5 is running on: p168
slave2 (rank 2, comm 1) of size 5 is running on: p167
slave3 (rank 3, comm 1) of size 5 is running on: p168
slave4 (rank 4, comm 1) of size 5 is running on: p167
> demo("simplePI")
...
> simple.pi(100000)
[1] 3.141593
> mpi.close.Rslaves()
> mpi.quit() #very important
[user@p227 ~]$ exit
logout
qsub: job 136623.biobos completed
Sample interactive session with snow: (user input in bold)
[user@biowulf ~]$ module load openmpi/1.4.4/gnu/eth
[user@biowulf ~]$ qsub -I -V -l nodes=2
qsub: waiting for job 136706.biobos to start
qsub: job 136706.biobos ready
[user@p227 ~]$ module load R
[user@p227 ~]$ `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla
R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
...
[Previously saved workspace restored]
> library(snow)
> cl <- makeCluster(4, type = "MPI")
Loading required package: Rmpi
4 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename machine
"p227" "x86_64"
[[2]]
nodename machine
"p228" "x86_64"
[[3]]
nodename machine
"p227" "x86_64"
[[4]]
nodename machine
"p228" "x86_64"
> sum(parApply(cl, matrix(1:100,10), 1, sum))
[1] 5050
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.01032138 0.62865716 0.62550058
[[2]]
[1] 0.01032138 0.62865716 0.62550058
[[3]]
[1] 0.01032138 0.62865716 0.62550058
[[4]]
[1] 0.01032138 0.62865716 0.62550058
> stopCluster(cl)
[1] 1
> mpi.quit()
[user@p227 ~]$ exit
logout
qsub: job 136706.biobos completed
'Rswarm' is a utility to create a series of R input files from a single R (master) template file with different output filenames and with unique seeds (for the random number generator). It will simultaneously create a swarm command file that can be used to submit the swarm of R jobs. Rswarm was originally developed by Lori Dodd and Trevor Reeve with modifications by the Biowulf staff. To demonstrate the use of Rswarm, we first provide an example. After the example, you will find a more detailed description about its usage.
Say, for the purposes of this example, that the goal of the simulation study is to evaluate properties of the t-test. The code below is meant to create an example that is sufficiently general, but simple. The function "sim.fun" is a loop that generates random normal data, performs the t-test, and extracts the p-value many times. Suppose the following code is saved in a file called sim.fun.R:
sim.fun<-function(n.samp=100, mu=0, sd=1, n.sim, output1, seed){
#######################################
#n.samp is number of samples generated within each simulation
#mu is the specified mean
#sd is the standard deviation
#nsim is the number of simulations, which will be specified as 50
#output1, is an output tables
#seed is the seed for set.seed
#######################################
set.seed(seed)
for (i in 1:n.sim){
y<-rnorm(n.samp, mean=mu, sd=sd)
out.table1<-t.test(y)$p.value
if (i==1) APPEND<-FALSE
else APPEND<-TRUE
write.table(out.table1, output1, append=APPEND, row.names=FALSE,col.names=FALSE)
}
}
Now suppose you create a two-line file called Rfile.R:
source("sim.fun.R")
sim.fun(n.sim=DUMX, output1="DUMY1",seed=DUMZ)
To swarm this code, we need replicates of the Rfile.R file (like 100 of them), each with a different seed and different output file. The Rswarm function will create the specified number of replicates, supply each with a different seed (from an external file containing seed numbers), and create unique output files for each replicate. Note, that we allow for you to specify the number of simulations within each file, in addition to specifying the number of replicates.
Typing the following Rswarm command at the Biowulf prompt will create 2 replicate files, each specifying 50 simulations, a different seed from a file entitled, "seedfile.txt," and unique output files.
Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user/dir/
--reps=2 --sims=50 --start=0 --ext1=.txt
The first of the two output files will be named Rfile1.R and its contents have been changed to:
source("sim.fun.R")
sim.fun(n.sim=50, output1="//data//user//dir//Rfile1.txt",seed=25)
The second file is similar except that the outputfile is named "Rfile2.txt" and the seed is different. The corresponding swarm file is generated, so that these two files can be submitted simply by typing the following command:
swarm -f Rfile.sw --module R
Below you will find additional details about using Rswarm. Rswarm usage:
Usage: Rswarm [options] --rfile=[file] (required) R program requiring replication --sfile=[file] (required) file with generated seeds, one per line --path=[path] (required) directory for output of all files --reps=[i] (required) number of replicates desired --sims=[i] (required) number of sims per file --start=[i] (required) starting file number --ext1=[string] (optional) file extension for output file 1 --ext2=[string] (optional) file extension for output file 2` --help, -h print this help text
To use Rswarm, create an R template file containing the desired R commands (see example template below). Within the template file four things must be specified, as described below. Each of these has a specific notation that will be recognized by the Rswarm utility:
| Notation in R template file | |
| Number of simulations to be specified in each replicate file | DUMX |
| Output file 1 which captures results | "DUMY1" |
| Output file 2 (optional) | "DUMY2" |
| Random seed | DUMZ |
For example, DUMX indicates the number of simulations to be performed. When creating the replicate files, Rswarm will replace occurrences of DUMX with the specified number of simulations. Likewise, Rswarm will replace occurrences of "DUMY1" with the name of the output file, and of DUMZ with a unique seed that is pulled from a random seed file. The R template file might be called 'Rfile.R'. The random seeds file is a text file with a p-by-1 vector of randomly generated numbers to use as seeds. Typing
Rswarm --rfile=Rfile.R --sfile=seedfile.txt --path=//data//user//dir/ --reps=2 --sims=50 --start=0 --ext1=.txt
This creates two replicate files, and the swarm file. At the Biowulf prompt, typing:
swarm -f Rfile.sw --module R
The output files Rfile1.txt and Rfile2.txt will be created. After the program has completed these files can be concatenated into a single file named outRfile.txt with the following command:
cat Rfile*.txt > outRfile.txt
The RSPerl package allows for calling R from Perl and Perl from R. Load the R module:
module load R
Next, source the correct environment variables to use R from perl:
. /usr/local/R-2.15-64_cluster/lib64/R/library/RSPerl/scripts/RSPerl.bsh
Last, include the specific perl modules in your script. For example, here is plot.pl
use R;
use RReferences;
&R::initR("--silent");
&R::library("RSPerl");
$z = &R::call("rnorm",1);
printf "rnorm: $z\n";
&R::call("x11");
@x=1..3;
&R::call("plot", \@x);
&R::call("plot", (1,2));
sleep(4);
Now try it!
$ perl plot.pl
For more information about RSPerl, see http://www.omegahat.org/RSPerl/.
- The R Homepage
- R manuals (web)
- PDF manuals including An Introduction to R.
- The R FAQ
- 'multicore' documentation
- Rmpi documentation
- Snow user guide



