R on ARC: Difference between revisions
(21 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
* CRAN, HPC and Parallel computing with R: | * CRAN, HPC and Parallel computing with R: | ||
: https://cran.r-project.org/web/views/HighPerformanceComputing.html | : https://cran.r-project.org/web/views/HighPerformanceComputing.html | ||
* How to speed up R code, ArXiv article: | |||
: https://arxiv.org/pdf/1503.00855.pdf | |||
* EBook, '''R Programming for Data Science''': | |||
: https://bookdown.org/rdpeng/rprogdatascience/ | |||
Line 247: | Line 253: | ||
[1] FALSE | [1] FALSE | ||
</pre> | </pre> | ||
== Submitting R jobs == | |||
Let us assume that we have an R script we want to run, named <code>test.R</code>. | |||
This script is in our home directory, that is <code>/home/drozmano</code>. | |||
Before running the script we want to organize our computations, a put each script we run as a separate job in its own directory. | |||
This is to separate the jobs and their outputs for easier data management. | |||
We are going to create a directory <code>project1</code> and inside it we will create subdirectories for each separate job, <code>job01</code> | |||
, <code>job02</code> | |||
, <code>job03</code> | |||
, etc. | |||
It is better to give the names that are more descriptive than that, at least for project directories, but this is another topic. | |||
<pre> | |||
$ ls -l | |||
-rw-r--r-- 1 drozmano drozmano 176 May 7 15:03 test.R | |||
$ mkdir project1 | |||
$ mkdir project1/job01 | |||
$ mv test.R project1/job01/ | |||
$ cd project1/job01 | |||
$ ls -l | |||
-rw-r--r-- 1 drozmano drozmano 176 May 7 15:03 test.R | |||
</pre> | |||
At this point we organized our job. | |||
Now we need to create a job script for the job. | |||
You have to use your favourite text editor to create the job script. | |||
<pre> | |||
$ nano my_test.slurm | |||
</pre> | |||
Content of <code>my_test.slurm</code>: | |||
<syntaxhighlight lang=bash> | |||
#! /bin/bash | |||
#----------------------------------------- | |||
#SBATCH --job-name=my-R-job | |||
#SBATCH --nodes=1 | |||
#SBATCH --ntasks=1 | |||
#SBATCH --cpus-per-task=1 | |||
#SBATCH --mem=4gb | |||
#SBATCH --time=0-03:00:00 | |||
#----------------------------------------- | |||
module load R/3.6.2 | |||
Rscript test.R | |||
</syntaxhighlight> | |||
Once you finished editing the job script, you can submit it to the job scheduler, SLURM: | |||
<pre> | |||
$ sbatch my_test.slurm | |||
Submitted batch job 10237139 | |||
# You can check the status of the job in the queue using its JobID: | |||
$ squeue -j 10237139 | |||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | |||
10237139 lattice my-R-job drozmano R 0:10 1 cn269 | |||
# or the status of all your jobs in the queue: | |||
$ squeue -u drozmano | |||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | |||
10237139 lattice my-R-job drozmano R 0:02 1 cn269 | |||
# You can also check the job details and utilization of resources by the job with | |||
$ arc.job-info 10237139 | |||
.... | |||
</pre> | |||
Once the job is finished and it is not in the queue anymore, you can collect the saved output and also | |||
check the information which was sent to screen in the <code>slurm-10237139.out</code> | |||
file. | |||
The number in the name is the JobId of the job. | |||
<pre> | |||
$ less slurm-10237139.out | |||
</pre> | |||
Success. | |||
== Doing parallel computations with R == | |||
You can take advantage of using multiple CPUs (cores) inside one node to speed up your R computations. | |||
One of the ways to do that is to use <code>foreach</code> and <code>doParallel</code> R packages. | |||
Here is an example of 2-way parallelization: | |||
<syntaxhighlight lang=R > | |||
.... | |||
library(foreach) | |||
library(doParallel) | |||
cl <- makeCluster(2) | |||
registerDoParallel(cl) | |||
n<-20 | |||
result<-foreach(i=1:n) %dopar%{ | |||
.... | |||
# You computations go here. | |||
.... | |||
} | |||
stopCluster(cl) | |||
.... | |||
</syntaxhighlight> | |||
Doing this can potentially double the performance of your code. | |||
You also have to change the job script to request 2 CPU from SLURM as shown below. | |||
Content of <code>my_parallel_test.slurm</code>: | |||
<syntaxhighlight lang=bash> | |||
#! /bin/bash | |||
#----------------------------------------- | |||
#SBATCH --job-name=my-parR-job | |||
#SBATCH --nodes=1 | |||
#SBATCH --ntasks=1 | |||
#SBATCH --cpus-per-task=2 | |||
#SBATCH --mem=4gb | |||
#SBATCH --time=0-03:00:00 | |||
#----------------------------------------- | |||
module load R/3.6.2 | |||
Rscript test.R | |||
</syntaxhighlight> | |||
When using parallel computations is it necessary to measure the benefit of multiprocessing. | |||
Does doubling of the number of CPUs double the performance of your code? | |||
If there is little benefit in increasing the number of CPUs there is not much sense in using more resources. | |||
Remember, that is almost always '''more efficient''' to use '''fewer CPUs''' for your computation. | |||
= Adding R packages = | = Adding R packages = | ||
Line 485: | Line 623: | ||
= Installing R = | = Installing R = | ||
= | Let us imagine that we want to install '''R v.3.6.2''' into our home directory. | ||
You home directory full name is contained in the <code>$HOME</code> session variable, so you can always refer to the home directory as <code>$HOME</code>, therefore. | |||
Also, have to go to R's web site and find a URL link to the file containing source codes for the version of R you want to install. | |||
In this specific case to the version '''3.6.2'''. | |||
The site to go to is https://www.r-project.org/, the file has to have <code>.tar.gz</code> extension. | |||
Once you have the link, follow these steps: | |||
<pre> | |||
$ mkdir -p $HOME/src/R | |||
$ cd $HOME/src/R | |||
$ mkdir -p $HOME/software/R-3.6.2 | |||
$ wget http://cran.rstudio.com/src/base/R-3/R-3.6.2.tar.gz | |||
$ tar -xvf R-3.6.2.tar.gz | |||
$ cd R-3.6.2/ | |||
$ ./configure --prefix=$HOME/software/R-3.6.2 --with-x=no --with-pcre1 --disable-java --enable-shared-lib | |||
$ make -j4 | |||
$ make install | |||
</pre> | |||
To activate the new install you have to add the <code>bin</code> directory from it, the one the contains the program files, to your <code>PATH</code>. | |||
Like this: | |||
<pre> | |||
$ export PATH=$HOME/software/R-3.6.2/bin:$PATH | |||
</pre> | |||
You have to do this every time you what to use your personal R. | |||
= Links = | |||
[[ARC Software pages]] | |||
[[Category:Software]] | [[Category:Software]] | ||
[[Category:ARC]] | [[Category:ARC]] | ||
{{Navbox ARC}} |
Latest revision as of 19:38, 18 October 2023
General
- Article on parallel computations with R:
- CRAN, HPC and Parallel computing with R:
- How to speed up R code, ArXiv article:
- EBook, R Programming for Data Science:
Text mode interactive shell
When you start R usual way you get into interactive R shell where you can type commands and get the results back. Like this:
$ module load R/3.6.2 $ R R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) .... Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > Sys.info() sysname release "Linux" "3.10.0-1127.el7.x86_64" version nodename "#1 SMP Tue Mar 31 23:36:51 UTC 2020" "arc" machine login "x86_64" "drozmano" user effective_user "drozmano" "drozmano" > quit() $
Running R scripts from the command line
In order to run R scripts / programs on ARC as jobs you have to pre-record the commands you want in a text file,
for example test.R
,
and run it as a script non-interactively.
test.R:
cwd = getwd() cat(" Current Directory: ", cwd, "\n") t = Sys.time() cat(" Current time: ", format(t), "\n") u = Sys.info()["user"] cat(" User name: ", u, "\n")
There are three ways to run an R script.
From standard input
An R script can be sent to the standard input of the R interactive shell. This is similar to typing the commands in R:
$ R --no-save < test.R R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > cwd = getwd() > cat(" Current Directory: ", cwd, "\n") Current Directory: /global/software/src/r/tests > > t = Sys.time() > cat(" Current time: ", format(t), "\n") Current time: 2020-05-07 15:16:12 > > u = Sys.info()["user"] > cat(" User name: ", u, "\n") User name: drozmano > >
After executing all the commands from the script, R terminates. Note that both the commands and the printed output are shown.
Using CMD BATCH
command
An R script can be passed as an argument to the "R CMD BATCH" command. The output does not go to the screen, but is saved to the .Rout file:
$ R CMD BATCH test.R $ ls -l -rw-r--r-- 1 drozmano drozmano 176 May 7 15:03 test.R -rw-r--r-- 1 drozmano drozmano 1121 May 7 15:19 test.Rout
To see the output use the cat or less commands:
$ cat test.Rout R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > cwd = getwd() > cat(" Current Directory: ", cwd, "\n") Current Directory: /global/software/src/r/tests > > t = Sys.time() > cat(" Current time: ", format(t), "\n") Current time: 2020-05-07 15:19:07 > > u = Sys.info()["user"] > cat(" User name: ", u, "\n") User name: drozmano > > > proc.time() user system elapsed 0.219 0.079 0.369
The output if very similar to the first way, but contains some additional timing information. Again, both the commands and the output are shown.
Using Rscript
version of R
Probably the best non-interactive way to run an R script is to use a special non-interactive version of R, Rscript
:
$ Rscript test.R Current Directory: /global/software/src/r/tests Current time: 2020-05-07 15:22:17 User name: drozmano
In this case R does not print any extra information, and only explicitly printed values are shown in the output, the commands themselves are not printed.
Using R on ARC
Like other calculations on ARC systems, R scripts and programs are run by submitting an appropriate script for batch scheduling using the sbatch command. For more information about submitting jobs, see Running jobs article.
R modules
Currently there are several software modules on ARC that provide different versions of R. The versions differ in the release date.
You can see them using the module
command:
$ module avail R ----------- /global/software/Modules/3.2.10/modulefiles --------- R/3.5.3 R/3.6.2
In addition,
- Module
biobuilds/2017.11
provides R v.3.4.2. - Module
bioconda/2018.11
provides R v.3.4.1.
These modules are designed with bioinformatics applications in mind and have a number of specialized R packages preinstalled.
Installed R packages
When installing a new R version, the following packages are typically installed at the same time.
arules purrr xaringan glue covr lintr reprex reticulate utf8 promises devtools cluster dbscan epiR epitools glasso Hmisc irr mi RSQLite foreign openxlsx dplyr tidyr stringr stringi lubridate ggplot2 ggvis rgl htmlwidgets googleVis car lme4 nlme mgcv randomForest multcomp vcd glmnet survival caret shiny rmarkdown xtable sp maptools maps ggmap zoo xts quantmod Rcpp data.table XML jsonlite httr RcppArmadillo manipulate proto dichromat reshape2 mice rpart party caret randomForest nnet e1071 kernlab neuralnet rnn h2o RSNNS tensorflow keras infer janitor DataExplorer sparklyr drake DALEX raster gpclib # BioConductor BiocManager GenomicFeatures AnnotationDbi DESeq DESeq2 MAST FEM DEGseq EBSeq DRIMSeq SGSeqRNASeqR
If you want to use a specific R package with a centrally installed R you can check if it has already been installed before attempting installing it:
$ module load R/3.6.2 $ R R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night" .... Type 'q()' to quit R. > is.installed <- function(mypkg)is.element(mypkg, installed.packages()[,1]) > is.installed("FEM") [1] TRUE > is.installed("e1071") [1] TRUE > is.installed("rgdal") [1] FALSE
Submitting R jobs
Let us assume that we have an R script we want to run, named test.R
.
This script is in our home directory, that is /home/drozmano
.
Before running the script we want to organize our computations, a put each script we run as a separate job in its own directory. This is to separate the jobs and their outputs for easier data management.
We are going to create a directory project1
and inside it we will create subdirectories for each separate job, job01
, job02
, job03
, etc.
It is better to give the names that are more descriptive than that, at least for project directories, but this is another topic.
$ ls -l -rw-r--r-- 1 drozmano drozmano 176 May 7 15:03 test.R $ mkdir project1 $ mkdir project1/job01 $ mv test.R project1/job01/ $ cd project1/job01 $ ls -l -rw-r--r-- 1 drozmano drozmano 176 May 7 15:03 test.R
At this point we organized our job. Now we need to create a job script for the job. You have to use your favourite text editor to create the job script.
$ nano my_test.slurm
Content of my_test.slurm
:
#! /bin/bash
#-----------------------------------------
#SBATCH --job-name=my-R-job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4gb
#SBATCH --time=0-03:00:00
#-----------------------------------------
module load R/3.6.2
Rscript test.R
Once you finished editing the job script, you can submit it to the job scheduler, SLURM:
$ sbatch my_test.slurm Submitted batch job 10237139 # You can check the status of the job in the queue using its JobID: $ squeue -j 10237139 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10237139 lattice my-R-job drozmano R 0:10 1 cn269 # or the status of all your jobs in the queue: $ squeue -u drozmano JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10237139 lattice my-R-job drozmano R 0:02 1 cn269 # You can also check the job details and utilization of resources by the job with $ arc.job-info 10237139 ....
Once the job is finished and it is not in the queue anymore, you can collect the saved output and also
check the information which was sent to screen in the slurm-10237139.out
file.
The number in the name is the JobId of the job.
$ less slurm-10237139.out
Success.
Doing parallel computations with R
You can take advantage of using multiple CPUs (cores) inside one node to speed up your R computations.
One of the ways to do that is to use foreach
and doParallel
R packages.
Here is an example of 2-way parallelization:
....
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
n<-20
result<-foreach(i=1:n) %dopar%{
....
# You computations go here.
....
}
stopCluster(cl)
....
Doing this can potentially double the performance of your code. You also have to change the job script to request 2 CPU from SLURM as shown below.
Content of my_parallel_test.slurm
:
#! /bin/bash
#-----------------------------------------
#SBATCH --job-name=my-parR-job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4gb
#SBATCH --time=0-03:00:00
#-----------------------------------------
module load R/3.6.2
Rscript test.R
When using parallel computations is it necessary to measure the benefit of multiprocessing. Does doubling of the number of CPUs double the performance of your code? If there is little benefit in increasing the number of CPUs there is not much sense in using more resources. Remember, that is almost always more efficient to use fewer CPUs for your computation.
Adding R packages
Generally, we cannot support and manage specific packages for every user on our cluster. When a new version of R is installed we try to add some of the more popular packages to the install right away (see above for the list of expected packages), but if you need something outside of that list you will have to install the needed packages yourself.
Fortunately, R is very good at it and once you try to install anything yourself, R will try to save the new package to the central location, where the R files are, but it will realize that you have no rights to write to that location and will automatically create your own personal storage location inside your home directory and will save all the needed files there. Very convenient.
You only have to install the package you want once. Some R scripts or examples include installation commands (install.packages(...)) in the script body. This will cause the installation to be done every time this script runs. It is a VERY BAD idea to do on a cluster. The installation has to be done once from R's command line interpreter on the login node. There should not be any installation commands in the R script itself.
Installing a native R package
We can install the "utf8" package like this:
$ module load R/3.5.3 $ R R version 3.5.3 (2019-03-11) -- "Great Truth" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) .... Type 'q()' to quit R. > is.installed <- function(mypkg)is.element(mypkg, installed.packages()[,1]) > is.installed("utf8") [1] FALSE > install.packages("utf8") --- Please select a CRAN mirror for use in this session --- Secure CRAN mirrors 1: 0-Cloud [https] 2: Australia (Canberra) [https] 3: Australia (Melbourne 1) [https] ..... 60: Uruguay [https] 61: (other mirrors) Selection: 61 Other CRAN mirrors 1: Algeria [https] 2: Argentina (La Plata) 3: Belgium (Antwerp) [https] 4: Canada (BC) [https] 5: Canada (MB) [https] 6: Canada (NS) [https] .... 35: USA (NC) 36: USA (PA 1) 37: USA (PA 2) Selection: 4 trying URL 'https://mirror.rcg.sfu.ca/mirror/CRAN/src/contrib/utf8_1.1.4.tar.gz' Content type 'application/x-gzip' length 218882 bytes (213 KB) ================================================== downloaded 213 KB * installing *source* package ‘utf8’ ... ..... ..... ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (utf8) The downloaded source packages are in ‘/tmp/RtmpEbfZKU/downloaded_packages’ Updating HTML index of packages in '.Library' Making 'packages.html' ... done > > library(utf8) > quit() Save workspace image? [y/n/c]: n $
Now we can use the "utf8" package. Success. Next time you use R the package will be already there.
Some notes about the example above:
- Before installing the package we have checked whether it is installed, first, by defining a is.installed() function and using it.
- Even if the package is installed and available, when you ask R to install a package it will not warn you that the package is already installed and will go on installing it again.
- This example uses a CRAN mirror from Canada/BC instead a mirror from the more popular list.
Installing a wrapper package
Similar to the native package install, but when you try installing it, the install may fail, like this:
...... > is.installed <- function(mypkg)is.element(mypkg, installed.packages()[,1]) > is.installed("rgdal") [1] FALSE > is.installed("rgl") [1] FALSE > install.packages("rgdal") --- Please select a CRAN mirror for use in this session --- Secure CRAN mirrors .... Selection: 61 ..... Selection: 4 trying URL 'https://mirror.rcg.sfu.ca/mirror/CRAN/src/contrib/rgdal_1.5-10.tar.gz' Content type 'application/x-gzip' length 2300923 bytes (2.2 MB) ================================================== downloaded 2.2 MB * installing *source* package ‘rgdal’ ... ** package ‘rgdal’ successfully unpacked and MD5 sums checked configure: R_HOME: /global/software/r/r-3.5.3/lib64/R configure: CC: gcc -std=gnu99 configure: CXX: g++ configure: CXX11 is: g++, CXX11STD is: -std=gnu++11 configure: CXX is: g++ -std=gnu++11 configure: C++11 support available configure: rgdal: 1.5-10 checking for /usr/bin/svnversion... yes configure: svn revision: 1006 checking for gdal-config... no no configure: error: gdal-config not found or not executable. ERROR: configuration failed for package ‘rgdal’ * removing ‘/global/software/r/r-3.5.3/lib64/R/library/rgdal’ The downloaded source packages are in ‘/tmp/RtmpGB34Hw/downloaded_packages’ Updating HTML index of packages in '.Library' Making 'packages.html' ... done Warning message: In install.packages("rgdal") : installation of package ‘rgdal’ had non-zero exit status >
Here you have to carefully read the output and find the reason for the failure. The key information is where the ERROR happens the first time. Here:
configure: svn revision: 1006 checking for gdal-config... no no configure: error: gdal-config not found or not executable. ERROR: configuration failed for package ‘rgdal’ ...
The key phrase is
configure: error: gdal-config not found or not executable.
The rgdal package is not a native package for R, it is not implemented from scratch for R, but instead provides a way to use a completely independent library GDAL from R environment. Therefore, the package while being installed checks for the presence of the library in the system and will only install if the library is present.
In this case, the installer could not find a piece of the GDAL library called gdal-config and aborted, assuming that the library is not present.
To install this package, first, the corresponding library has to be made available. It is possible, that the library is already installed on the cluster, but its module has to be loaded first. Like this:
$ module load R/3.5.3 $ module load osgeo/gdal/3.0.2 $ R .... > install.packages(rgdal) ..... ..... configure: svn revision: 1006 checking for gdal-config... /global/software/gdal/gdal-3.0.2/bin/gdal-config checking gdal-config usability... yes configure: GDAL: 3.0.2 .... .... ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (rgdal) The downloaded source packages are in ‘/tmp/Rtmpd2nm4U/downloaded_packages’ Updating HTML index of packages in '.Library' Making 'packages.html' ... done >
This time the installation has been successful. The installer could find the gdal-config piece and could continue.
The package is now ready to be used in your R scripts, but every time you submit a job that is going to use it, you will have to load the corresponding osgeo/gdal/3.0.2 module before running the script.
If there is no module for the library you want to use from R, it has to be installed first. This, however, has nothing to do with R and is a completely different task.
BioConductor
Contains lots of bioinformatics related packages for R.
- Installation: https://www.bioconductor.org/install/
The BioConductor package manager should be already installed in centrally installed R versions.
Adding BioConductor packages is a bit different than installing a standard R package:
# Find packages: > BiocManager::available() # install specific packages: > BiocManager::install(c("GenomicFeatures", "AnnotationDbi")) > BiocManager::install(c("DESeq", "DESeq2")) > BiocManager::install(c("MAST", "FEM", "DEGseq", "EBSeq", "DRIMSeq", "SGSeq", "RNASeqR"))
If you are managing your own version of R in your home directory, you can install the core part of BioConductor as follows:
> install.packages("BiocManager") > BiocManager::install()
Then follow the instructions given above.
Installing R
Let us imagine that we want to install R v.3.6.2 into our home directory.
You home directory full name is contained in the $HOME
session variable, so you can always refer to the home directory as $HOME
, therefore.
Also, have to go to R's web site and find a URL link to the file containing source codes for the version of R you want to install. In this specific case to the version 3.6.2.
The site to go to is https://www.r-project.org/, the file has to have .tar.gz
extension.
Once you have the link, follow these steps:
$ mkdir -p $HOME/src/R $ cd $HOME/src/R $ mkdir -p $HOME/software/R-3.6.2 $ wget http://cran.rstudio.com/src/base/R-3/R-3.6.2.tar.gz $ tar -xvf R-3.6.2.tar.gz $ cd R-3.6.2/ $ ./configure --prefix=$HOME/software/R-3.6.2 --with-x=no --with-pcre1 --disable-java --enable-shared-lib $ make -j4 $ make install
To activate the new install you have to add the bin
directory from it, the one the contains the program files, to your PATH
.
Like this:
$ export PATH=$HOME/software/R-3.6.2/bin:$PATH
You have to do this every time you what to use your personal R.