start:development:scripting_languages:r_bio

# R and BioConductor

Much like Pythons virtualenv Renv offers a R version management. As with virtualenv we advise to use Renv, especially if you intend using different versions of R or different versions of R packages.

Generally though, a simple install.packages() will ask for a directory and you can direct it to your home directory.

Please ask the HPC-Team to install bigger libraries or libraries of (putatively) wider interest for you.

To see all installed libraries for a specific (module) version of R type:

print(as.data.frame(installed.packages()[,c(1,3:4)]))

To limit your view to user installed packages, type:

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE] print(ip) R per default writes a lot of messages to stdout. Using the library sink this can be suppressed and with SLURM the output can be redirected to a file. sink is used as follows: library(sink) sink("/dev/null") # suppress everything from hereon ... sink() # to undo prior suppression, back to normal now # Compiling functions Compiling self-declared functions with the compiler library can help to speed up code enormously. This holds, for lengthy functions and numerical code. It does not for already defined functions within R. library(compiler) f <- function(n, x) for (i in 1:n) x = (1 + x)^(-1) g <- cmpfun(f) # this one is not installed on mogon per default library(microbenchmark) compare <- microbenchmark(f(1000, 1), g(1000, 1), times = 1000) library(ggplot2) autoplot(compare) There also is the option of a just in time compiler in this library: library(compiler) enableJIT(1) The numeric argument in enableJIT specifies the “level” of compilation. At level 2 and 3 more functions are compiled. In a few rare cases enableJIT will slow down code, particularly, if most of it is already pre-compiled or written in C, and/or if the code creates functions repeatedly, which then need to be compiled every time. This is more likely to happen with enableJIT(2) or enableJIT(3), though these have the potential to speed up code more, as well. # Bioconductor Bioconductor is a versatile tool package for handling biological data based upon the R statistical programming language. By loading the module R/3.1.2 or newer versions Bioconductor with MPI support becomes available. Missing Bioconductor packages can be installed or updated upon request, albeit not in a rapid fashion. Please ensure that the appropriate packages are loaded, too - before submitting your job. # Submitting R Scripts There are two ways to submit scripts in interpreted languages: The script can be in any other scripting language (e.g. bash) by invoking $ sbatch <jobscript> with a jobscript like:

#!/bin/bash

#SBATCH -J descriptive_job_name
#SBATCH -p smp/parallel  # M2 example
#SBATCH -N 1       # e.g. one full node - do not do this, when your script is not using parallel code!
#SBATCH -t 120     # e.g. 2 hours
#SBATCH ...        # further options

# do not forget to export OMP_NUM_THREADS, if the library you use, supports this
export OMP_NUM_THREADS=1 # the example holds for MOGON I, but most libraries do
# not scale up to 64 threads
srun <srun parameters, e.g. threads per R-instance> R --no-save --slave -f <R-script_name>

Note the –no-save-flag. This prevents name space clashes, if you have used R previously in different scenarios.

A note on OMP_NUM_THREADS: On MOGON R is linked against OpenBlas. When unlimited OpenBlas tries to create as many threads as cores in our environment. If you create 64 (on MOGON I) instances of R, like in the example below, there will be 64*64 threads. This is not meaningful, would slow the computation tremendously and impossible. Therefore, when calculated bigger matrices, there should be fewer instances than cores of R, but OMP_NUM_THREADS can be set to higher values, such that OMP_NUM_THREADS * No. of R-tasks = number of cores.

Here, the problem lies within the way the shell treads the interpreted scripts: programmers have to supply a shebang along with a fully qualified path to the interpreter. So when loading a given R-module the which command gives you the desired path:

$module load lang/R/...$ which R
However, the loaded R-modules are complex and you would loose this environment.

From hereon the submission header is analogous to the standard one in bash.

# Using R in parallel

R offers several packages for parallel or HP-computing. Rmpi and snow probably are the most common packages for that purpose in conjunction with MPI (the snow link contains a bunch of examples).

Rmpi has been installed with various R-modules. Loading the R-module issue a warning that the respective MPI module has to be loaded. Ignore this warning, if you are not going to use MPI.

Some examples using “pure” RMPI can be found here.

In addition to Rmpi and snow (see below) we provide the dompi package. See below for an example.

For the following example we will look at the snow library.

At first we will look into an example confined to a single node. This script will work on many files named *.in and is to be submitted with the shell script example from above.

#!/env/bin R

library(snow)

cl <- makeCluster(64, type="SOCK")

file_worker <- function(fname) {
print(paste0("working on '", fname, "'"))
}

res <- clusterCall(
cl,
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
# The following function is
# the callback function.
#
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!
file_worker, Sys.glob('*.in')
)

# Print results
print(do.call(rbind,res))

# Stop the cluster
stopCluster(cl)

# Using Rmpi & snow (sample script)

The following sample script shows how to use the packages Rmpi and snow. Note that the modules used to compile Rmpi have to be loaded prior to submitting the script or else the script will crash.

The script uses a callback function, within this function you should place your code to be parallized. Such function can, of course, be called multiple times and should be placed outside the actual clusterCall()-function.

This script is placed and edited courtesy of W. Spitz (University of Bonn) (after this web entry).

# Load the R MPI package, if it is not already loaded
# package twice in more complex setups
library("Rmpi")
}

library(snow)

# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)

.Last <- function(){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}

# How many nodes are reserved for the job?
nodelist <- strsplit(Sys.getenv("LSB_MCPU_HOSTS"), split = ' ')[[1]]
nodes = c()
for (index in seq(1, length(nodelist), by=2)) {
nodes<-c(nodes, rep(nodelist[index], nodelist[index+1]))
}

print(nodes)

# Create cluster
cl <- makeMPIcluster(length(nodes))

# Run a call back function on each reserved node
res <- clusterCall(
cl,
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
# The following function is
# the callback function. In
# more elaborate settings it
# should be placed elsewhere
#
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!
function() {
Sys.info()["nodename"]
}
)

# Print results
print(do.call(rbind,res))

# Stop the cluster
stopCluster(cl)
mpi.quit()

In order to submit parallel R scripts using Rmpi, invoke $sbatch <jobscript> with a jobscript like: #!/bin/bash #SBATCH -J descriptive_job_name #SBATCH -N 1 # e.g. one full node #SBATCH -t 2:00 # e.g. 2 hours #SBATCH --mem=100G # e.g. with 100 GB of RAM #SBATCH ... # further options # do not forget to export OMP_NUM_THREADS, if the library you use, supports this export OMP_NUM_THREADS=64 # the example holds for Mogon I, but most libraries do # not scale up to 64 threads srun R --no-save --slave -f <R-script_name> Note the –no-save-flag. This prevents name space clashes, if you have used R previously in different scenarios. # dompi This example script shows the simple usage: library(doMPI) library(foreach) cl <- startMPIcluster(verbose=TRUE) registerDoMPI(cl) data(geneData, package = 'Biobase') pair <- combn(1:nrow(geneData), 2, simplify = F) fakeData <- cbind(geneData, geneData, geneData, geneData) pair2 <- sample(pair, 2000) print(system.time( out <- foreach(p = pair2, .packages = 'boot', .combine = 'rbind') %dopar% { mydata <- cbind(fakeData[p[1],], fakeData[p[2], ]) mycor <- function(x, i) cor(x[i,1], x[i,2]) boot.out <- boot(mydata, mycor, 1000) ci <- boot.ci(boot.out, type = 'bca')$bca[4:5]
c(p, ci)
}
))

closeCluster(cl)
mpi.finalize()

It can be started with \$ sbatch jobscript.sh, like in the example above.

# Dynamic Parallelization

Dynamic Parallelization is offered by the package parallelize.dynamic. Apart from the documentation on the project site, a paper described its setup.

• start/development/scripting_languages/r_bio.txt