User Tools

Site Tools


development:scripting_languages:r

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

development:scripting_languages:r [2019/10/12 07:40] (current)
meesters created
Line 1: Line 1:
 +====== R and BioConductor ======
 +
 +===== Installing R Libraries =====
 +
 +Much like [[software:​python|Python]]s [[https://​pypi.python.org/​pypi/​virtualenv|virtualenv]] [[https://​github.com/​viking/​Renv|Renv]] offers a R version management. As with virtualenv we advise to use [[https://​github.com/​viking/​Renv|Renv]],​ especially if you intend using different versions of R or different versions of R packages.
 +
 +Generally though, a simple ''​install.packages()''​ will ask for a directory and you can direct it to your home directory.
 +
 +Please ask the HPC-Team to install //bigger libraries// or libraries of (putatively) //wider interest// for you.
 +
 +==== Listing Installed Libraries ====
 +
 +To see all installed libraries for a specific (module) version of R type:
 +<code Rsplus>
 +print(as.data.frame(installed.packages()[,​c(1,​3:​4)]))
 +</​code>​
 +
 +To limit your view to user installed packages, type:
 +<code Rsplus>
 +ip <- as.data.frame(installed.packages()[,​c(1,​3:​4)])
 +rownames(ip) <- NULL
 +ip <- ip[is.na(ip$Priority),​1:​2,​drop=FALSE]
 +print(ip)
 +</​code>​
 +
 +
 +===== Suppressing output =====
 +
 +R per default writes a lot of messages to stdout. Using the library ''​sink''​ this can be suppressed and with SLURM the output can be redirected to a file.
 +
 +''​sink''​ is used as follows:
 +<code Rsplus>
 +library(sink)
 +sink("/​dev/​null"​) ​   # suppress everything from hereon
 +...
 +sink() ​              # to undo prior suppression,​ back to normal now
 +</​code>​
 +
 +
 +
 +
 +====== Compiling functions ======
 +
 +Compiling self-declared functions with the ''​compiler''​ library can help to speed up code enormously. This holds, for lengthy functions and numerical code. It does not for already defined functions within R.
 +
 +<code Rsplus>
 +library(compiler)
 +f <- function(n, x) for (i in 1:n) x = (1 + x)^(-1)
 +g <- cmpfun(f)
 +
 +# this one is not installed on mogon per default
 +library(microbenchmark)
 +compare <- microbenchmark(f(1000,​ 1), g(1000, 1), times = 1000)
 +
 +library(ggplot2)
 +autoplot(compare)
 +</​code>​
 +
 +There also is the option of a just in time compiler in this library:
 +
 +<code Rsplus>
 +library(compiler)
 +enableJIT(1)
 +</​code>​
 +
 +The numeric argument in enableJIT specifies the “level” of compilation. At level 2 and 3 more functions are compiled. In a few rare cases ''​enableJIT''​ will slow down code, particularly,​ if most of it is already pre-compiled or written in C, and/or if the code creates functions repeatedly, which then need to be compiled every time. This is more likely to happen with ''​enableJIT(2)''​ or ''​enableJIT(3)'',​ though these have the potential to speed up code more, as well.
 +
 +
 +
 +
 +====== Bioconductor ======
 +
 +[[http://​www.bioconductor.org/​|Bioconductor]] is a versatile tool package for handling biological data based upon the R statistical programming language. By loading the module ''​R/​3.1.2''​ or newer versions Bioconductor with MPI support becomes available. Missing Bioconductor packages can be installed or updated upon request, albeit not in a rapid fashion.
 +
 +Please ensure that the appropriate packages are loaded, too - before submitting your job.
 +
 +====== Submitting R Scripts ======
 +
 +There are two ways to submit scripts in interpreted languages:
 +
 +===== Submitting using another script =====
 + 
 +The script can be in any other scripting language (e.g. bash) by invoking ''​$ sbatch <​jobscript>''​ with a jobscript like: 
 +
 +<code bash>
 +#!/bin/bash
 + 
 +#SBATCH -J descriptive_job_name
 +#SBATCH -p (node)short # M1 example
 +#SBATCH -p smp/​parallel ​ # M2 example
 +#SBATCH -A <your slurm account>
 +#SBATCH -N 1       # e.g. one full node - do not do this, when your script is not using parallel code!
 +#SBATCH -t 2:00    # e.g. 2 hours
 +#SBATCH ...        # further options
 + 
 +# load your desired R-module
 +module load lang/R # not loading a specific will load the first in the list
 +
 +# do not forget to export OMP_NUM_THREADS,​ if the library you use, supports this
 +export OMP_NUM_THREADS=1 # the example holds for Mogon I, but most libraries do
 +                          # not scale up to 64 threads
 +srun <srun parameters, e.g. threads per R-instance>​ R --no-save --slave -f <​R-script_name>​
 +</​code>​
 +
 +Note the ''​--no-save''​-flag. This prevents name space clashes, if you have used R previously in different scenarios.
 +
 +<WRAP center round alert 95%>
 +A note on ''​OMP_NUM_THREADS'':​ On Mogon R is linked against OpenBlas. When unlimited OpenBlas tries to create as many threads as cores in our environment. If you create 64 (on Mogon I) instances of R, like in the example below, there will be 64*64 threads. This is not meaningful, would slow the computation tremendously and impossible. Therefore, when calculated bigger matrices, there should be fewer instances than cores of R, but ''​OMP_NUM_THREADS''​ can be set to higher values, such that ''​OMP_NUM_THREADS''​ * No. of R-tasks = number of cores.
 +</​WRAP>​
 +   
 +===== Submitting using R =====
 +
 +Here, the problem lies within the way the shell treads the interpreted scripts: programmers have to supply a [[https://​en.wikipedia.org/​wiki/​Shebang_(Unix)|shebang]] along with a fully qualified path to the interpreter. So when loading a given R-module the ''​which''​ command gives you the desired path:
 +
 +<code bash>
 +$ module load lang/R/...
 +$ which R
 +</​code>​
 +
 +<WRAP center round alert 95%>
 +However, the loaded R-modules are complex and you would loose this environment.
 +</​WRAP>​
 +
 +From hereon the submission header is analogous to the [[:​slurm_submit|standard one in bash]].
 +    ​
 +
 +====== Using R in parallel ======
 +
 +R offers several [[http://​cran.r-project.org/​web/​views/​HighPerformanceComputing.html|packages for parallel or HP-computing]]. [[http://​www.stats.uwo.ca/​faculty/​yu/​Rmpi/​|Rmpi]] and [[http://​www.sfu.ca/​~sblay/​R/​snow.html|snow]] probably are the most common packages for that purpose in conjunction with MPI (the snow link contains a bunch of examples).
 +
 +Rmpi has been installed with various R-modules. Loading the R-module issue a warning that the respective MPI module has to be loaded. Ignore this warning, if you are not going to use MPI.
 +
 +Some examples using "​pure"​ RMPI can be found [[http://​www.stat.berkeley.edu/​scf/​paciorek-distribComp.pdf|here.]] ​
 +
 +In addition to Rmpi and snow (see below) we provide the [[https://​cran.r-project.org/​web/​packages/​doMPI/​index.html|dompi package]]. See below for an example.
 +
 +For the following example we will look at the [[https://​cran.r-project.org/​web/​packages/​snow/​|snow library]]. ​
 +
 +===== A simple socket cluster =====
 +
 +
 +At first we will look into an example confined to a single node. This script will work on many files named ''​*.in''​ and is to be submitted with the shell script example from above.
 +
 +<code Rsplus>
 +#!/env/bin R
 +
 +library(snow)
 +
 +cl <- makeCluster(64,​ type="​SOCK"​)
 +
 +file_worker <- function(fname) {
 +    print(paste0("​working on '",​ fname, "'"​))
 +}
 +
 +res <- clusterCall(
 +        cl,
 +        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!
 +        #
 +        # The following function is
 +        # the callback function.
 +        #
 +        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!
 +        file_worker,​ Sys.glob('​*.in'​)
 +        )
 + 
 +# Print results
 +print(do.call(rbind,​res))
 + 
 +# Stop the cluster
 +stopCluster(cl)
 +</​code>​
 +
 +====== Using Rmpi & snow (sample script) ======
 +
 +The following sample script shows how to use the packages Rmpi and snow. Note that **the modules used to compile Rmpi have to be loaded prior to submitting the script** or else the script will crash.
 +
 +The script uses a //callback function//, within this function you should place your code to be parallized. Such function can, of course, be called multiple times and should be placed outside the actual clusterCall()-function.
 +
 +This script is placed and edited courtesy of [[https://​www3.uni-bonn.de/​imbie/​organisation/​ansprechpartner%20A-Z/​waldemar-spitz|W. Spitz (University of Bonn)]] (after [[http://​permalink.gmane.org/​gmane.comp.lang.r.hpc/​430|this web entry]]).
 +
 +<code Rsplus>
 +# Load the R MPI package, if it is not already loaded
 +# This is not strictly necessary, but may avoid loading this
 +# package twice in more complex setups
 +if (!is.loaded("​mpi_initialize"​)) {
 +   ​library("​Rmpi"​)
 +   }
 +
 +library(snow)
 +
 +# In case R exits unexpectedly,​ have it automatically clean up
 +# resources taken up by Rmpi (slaves, memory, etc...)
 +
 +.Last <- function(){
 +if (is.loaded("​mpi_initialize"​)){
 +      if (mpi.comm.size(1) > 0){
 +          print("​Please use mpi.close.Rslaves() to close slaves."​)
 +          mpi.close.Rslaves()
 +      }
 +      print("​Please use mpi.quit() to quit R")
 +      .Call("​mpi_finalize"​)
 +  }
 +}
 +
 +# How many nodes are reserved for the job?
 +nodelist <- strsplit(Sys.getenv("​LSB_MCPU_HOSTS"​),​ split = ' '​)[[1]]
 +nodes = c()
 +for (index in seq(1, length(nodelist),​ by=2)) {
 +  nodes<​-c(nodes,​ rep(nodelist[index],​ nodelist[index+1]))
 +}
 +
 +print(nodes)
 +
 +# Create cluster
 +cl <- makeMPIcluster(length(nodes))
 +
 +# Run a call back function on each reserved node
 +res <- clusterCall(
 +        cl,
 +        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!
 +        #
 +        # The following function is
 +        # the callback function. In
 +        # more elaborate settings it
 +        # should be placed elsewhere
 +        #
 +        # !!!!!!!!!!!!!!!!!!!!!!!!!!!!
 +        function() {
 +                Sys.info()["​nodename"​]
 +        }
 +        )
 +
 +# Print results
 +print(do.call(rbind,​res))
 +
 +# Stop the cluster
 +stopCluster(cl)
 +mpi.quit()
 +</​code>​
 +
 +In order to submit parallel R scripts using Rmpi, invoke ''​$ sbatch <​jobscript>''​ with a jobscript like:
 +
 +<code bash>
 +#!/bin/bash
 +
 +#SBATCH -J descriptive_job_name
 +#SBATCH -N 1       # e.g. one full node
 +#SBATCH -t 2:00    # e.g. 2 hours
 +#SBATCH --mem=100G # e.g. with 100 GB of RAM
 +#SBATCH ...        # further options
 +
 +# do not forget to export OMP_NUM_THREADS,​ if the library you use, supports this
 +export OMP_NUM_THREADS=64 # the example holds for Mogon I, but most libraries do
 +                          # not scale up to 64 threads
 +srun R --no-save --slave -f <​R-script_name>​
 +</​code>​
 +Note the //​--no-save//​-flag. This prevents name space clashes, if you have used R previously in different scenarios.
 +
 +
 +
 +
 +====== dompi ======
 +
 +This [[http://​wiki.gwdg.de/​index.php/​Parallel_analysis_using_Rmpi|example script]] shows the simple usage:
 +
 +<code Rsplus>
 +library(doMPI)
 +library(foreach)
 +
 +cl <- startMPIcluster(verbose=TRUE)
 +registerDoMPI(cl)
 +
 +
 +data(geneData,​ package = '​Biobase'​)
 +pair <- combn(1:​nrow(geneData),​ 2, simplify = F)
 +fakeData <- cbind(geneData,​ geneData, geneData, geneData)
 +pair2 <- sample(pair,​ 2000)
 +print(system.time(
 +out <- foreach(p = pair2, .packages = '​boot',​ .combine = '​rbind'​) %dopar%
 +  {
 +      mydata <- cbind(fakeData[p[1],​],​ fakeData[p[2],​ ])
 +      mycor <- function(x, i) cor(x[i,1], x[i,2])
 +      boot.out <- boot(mydata,​ mycor, 1000)
 +      ci <- boot.ci(boot.out,​ type = '​bca'​)$bca[4:​5]
 +      c(p, ci)
 +  }
 +))
 +print(head(out))
 +
 +closeCluster(cl)
 +mpi.finalize()
 +
 +</​code>​
 +
 +It can be started with ''​$ sbatch jobscript.sh'',​ like in the example above.
 +====== Dynamic Parallelization ======
 +
 +Dynamic Parallelization is offered by the package [[http://​cran.r-project.org/​web/​packages/​parallelize.dynamic/​index.html|parallelize.dynamic]]. Apart from the documentation on the project site, a [[http://​journal.r-project.org/​archive/​2013-2/​boehringer.pdf|paper]] described its setup.
 +
  
development/scripting_languages/r.txt · Last modified: 2019/10/12 07:40 by meesters