R

Contents

Summary and Version Information

Package R
Description R statistical analysis package
Categories Numerical Analysis,   Research
Version Module tag Availability* GPU
Ready
Notes
3.0.3 R/3.0.3 Non-HPC Glue systems
Deepthought2 HPCC
RedHat6
N
3.1.2 R/3.1.2 Non-HPC Glue systems
Deepthought2 HPCC
RedHat6
N DEPRECATED: built with gcc/4.6.1 and openmpi/1.6.5 (Rmpi)
3.2.2 R/3.2.2 Non-HPC Glue systems
Deepthought2 HPCC
RedHat6
N DEPRECATED: built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi)
3.3.2 R/3.3.2 Non-HPC Glue systems
Deepthought2 HPCC
RedHat6
N built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi)
3.5.1 R/3.5.1 Non-HPC Glue systems
Deepthought2 HPCC
64bit-Linux
N built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi)
4.0.1 R/4.0.1 Non-HPC Glue systems
All OSes
N

Notes:
*: Packages labelled as "available" on an HPC cluster means that it can be used on the compute nodes of that cluster. Even software not listed as available on an HPC cluster is generally available on the login nodes of the cluster (assuming it is available for the appropriate OS version; e.g. RedHat Linux 6 for the two Deepthought clusters). This is due to the fact that the compute nodes do not use AFS and so have copies of the AFS software tree, and so we only install packages as requested. Contact us if you need a version listed as not available on one of the clusters.

In general, you need to prepare your Unix environment to be able to use this software. To do this, either:

  • tap TAPFOO
OR
  • module load MODFOO

where TAPFOO and MODFOO are one of the tags in the tap and module columns above, respectively. The tap command will print a short usage text (use -q to supress this, this is needed in startup dot files); you can get a similar text with module help MODFOO. For more information on the tap and module commands.

For packages which are libraries which other codes get built against, see the section on compiling codes for more help.

Tap/module commands listed with a version of current will set up for what we considered the most current stable and tested version of the package installed on the system. The exact version is subject to change with little if any notice, and might be platform dependent. Versions labelled new would represent a newer version of the package which is still being tested by users; if stability is not a primary concern you are encouraged to use it. Those with versions listed as old set up for an older version of the package; you should only use this if the newer versions are causing issues. Old versions may be dropped after a while. Again, the exact versions are subject to change with little if any notice.

In general, you can abbreviate the module tags. If no version is given, the default current version is used. For packages with compiler/MPI/etc dependencies, if a compiler module or MPI library was previously loaded, it will try to load the correct build of the package for those packages. If you specify the compiler/MPI dependency, it will attempt to load the compiler/MPI library for you if needed.

Installing Modules

R's capabilities can be significantly enhanced through the addition of modules. Code can then enable the library with the library command. The supported R interpretters on the system have a selection of modules preinstalled. If a module you are interested in is not in that list, you can either install a personal copy of the module for yourself, or request that it be installed system wide. We will make reasonable efforts to accomodate such requests as staffing resources allow.

Installing modules yourself

The method for installing R packages is usually fairly straightforward, but obviously not all packages will install in the same manner. But most will follow the procedure below:

  1. module load R/X.Y.Z to select the version of R you wish to use
  2. Create the directory to hold your R modules, if you have not already done so. The default is in the directory R underneath your home directory, but you might wish to put it elsewhere; this will have subdirectories for R version and platform added.
  3. Unless you opted for the default directory ~/R, you need to tell R what directory you are using. To do this, you must set the environmental variable R_LIBS_USER. Multiple directories can be listed; separate the paths with the colon (:) character. This needs to be set whenever you wish to use the modules in R, so you will generally want to set it in your .cshrc.mine or .Renviron files.
  4. There are two standard methods for installing a package, one from the command line, and one from inside R itself. Assuming you are putting stuff in ~/myRpkgs and installing the package foo the commands would be:
    • From the command line, you will first need to download a tarball with the source code for the package. Many packages can be found at the Comprehensive R Archive Network (CRAN). Assuming you downloaded foo.tar.gz to the current directory, you could then install it with:
      R CMD INSTALL -l ~/myRpkgs foo.tar.gz
    • From within R, the install.packages function will connect to CRAN and download and install the package all in one step, with:
      install.packages("foo", lib="~/myRpkgs", repos="http://cran.r-project.org")

If all goes well, the package is now installed in the directory you specified and should be available for use by your R scripts.

Of course, not all packages install quite that easily. If you are comfortable building modules, hopefully the error messages will provide reasonable guidance as to how to proceed. Otherwise, you can just request for Division of Information Technology staff to install it, but that might take time depending on the availability of our time.

Running R in batch mode

Although R's interactive mode is nice for certain things, when you are doing production runs with tried and true scripts, it is usually easier to use R's batch interface. This is especially useful when submitting jobs to an HPC cluster.

If you have some R code in a file test.R and you wish to run it from the command line (or equivalently, from a shell script), you can simply use the Rscript command. E.g.

Rscript --no-save --no-restore test.R

The --no-save and --no-restore prevent the saving of the workspace at the end of the session and the restoring of saved objects at startup. These are typically what you want when running in batch mode. Older versions of R used the R CMD BATCH instead of the Rscript command; the main difference with the former is that it optionally takes the name of an output file. Both should work with currently installed versions of R.

For use on one of the HPC clusters, you will generally need to include the above in a job script, like:

#!/bin/bash
#Request 5 hours
#SBATCH -t 5:00
#Request 4 GiB per CPU-core
#SBATCH --mem-per-cpu=4096
#Request 1 core
#SBATCH -n 1

#Get our profile (and define module command)
. ~/.profile

#Load required modules
module load R/3.3.2

cd MY_WORK_DIRECTORY

#Make sure OpenMP is not "on"
OMP_NUM_THREADS=1
export OMP_NUM_THREADS

Rscript --no-save --no-restore my_R_code.R

Using R and MPI

User of one of the high-performance computing (HPC) clusters will likely be interested in running R codes that span multiple processors often over multiple nodes. This generally is done using MPI. There are a number of R packages that deal with MPI, including

  • Rmpi
  • snow
  • doSNOW: provides a dopar functionality via snow

Most users seem to prefer the snow package, which is presumably higher level and therefore easier to use than Rmpi. There are assorted guides to using R with the snow package on the web, including:

Below are just a few tips gleaned from these pages, etc. that users at UMD might find helpful.

  1. For best results, use the same version of compiler and MPI as used for building R and its MPI packages. The MPI libraries and compiler used for the different versions of R are listed in the version table at the top of this page. It is best to module load the compiler first (not needed for gcc/4.6.1) and then the OpenMPI library.
  2. We have also had reports of wierd errors occurring when using Rmpi (and the packages depending on it) with Infiniband; segfaults and other seemingly random errors when setting up connections. This appears to be related to complications with the used of pinned memory and forking within the R interpretter (see e.g. CRMDA blog and OpenMPI developers mailing list archives regarding this issue). As such, we strongly recommend R users who wish to use MPI disable Infiniband in their mpirun command by adding the arguments --mca btl tcp,self as shown in the example below.
  3. When using snow or one of its derivatives (e.g. doSNOW), you should launch your code with something like
    #!/bin/bash
    #Request 5 hours
    #SBATCH -t 5:00
    #Request 4 GiB per CPU-core
    #SBATCH --mem-per-cpu=4096
    #Request 40 cores
    #SBATCH -n 40
    
    #Get our profile (and define module command)
    . ~/.profile
    
    #Load required modules
    module load gcc/4.9.3
    module load openmpi/1.8.6
    module load R/3.3.2
    
    cd MY_WORK_DIRECTORY
    
    #Make sure OpenMP is not "on"
    OMP_NUM_THREADS=1
    export OMP_NUM_THREADS
    
    #NOTE THE -np 1 below!!!!
    #The --mca btl tcp,self arguments restricts communications to 
    #tcp instead of infiniband.  We have seen issues with Rmpi and infiniband
    mpirun -np 1 --mca btl tcp,self R CMD BATCH --no-save --no-restore my_R_code.R
    

    NOTE the use of -np 1 in the above. Although that looks suspicious (telling mpirun to only start one MPI tasks when we asked for 40 cores), it is actually correct for most uses of the snow (and derivative) libraries. This is because when using snow, typically snow will spawn its own workers. If you request something more than 1 MPI task to be launched via the openmpi, or omit the -np 1 altogether (which effectively is asking for mpirun to launch the number of tasks given in the #SBATCH -n line, 40 in this case), you will end up running e.g. 40 copies of the same code, each of which will try to spawn about 40 workers via snow, resulting in a mess (at best very sluggish performance, and more likely wierd errors).

  4. Most snow based R code will at some point invoke the makeCluster function. This takes a parameter specifying the size of the "cluster" to create. Typically, one wants this size to be one less than the number of cores requested from Slurm. This is because the process running the R code which spawns the workers is already consuming one CPU core, so if you try to spawn a number of workers equal to the number of cores requested of Slurm, there will be one core oversubscribed, which causes issues. I typically see an error about there being an insufficient number of "slots" available, and typically the R script just hangs (doing nothing, but not dying until the job is killed for exceeding its walltime, and thereby wasting a lot of SUs). Typically, it is better to do something like:
    cl<-makeCluster(mpi.universe.size()-1, type="MPI")