--- title: "Parallel Evaluations in R" author: Thomas Girke date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: html_document: toc: true toc_float: collapsed: true smooth_scroll: true toc_depth: 3 fig_caption: yes code_folding: show number_sections: true fontsize: 14pt bibliography: bibtex.bib weight: 5 type: docs --- ## Overview - R provides a large number of packages for parallel evaluations on multi-core, multi-socket and multi-node systems. The latter are usually referred to as computer clusters. - MPI is also supported - For an overview of parallelization packages available for R see [here](https://cran.r-project.org/web/views/HighPerformanceComputing.html) - One of the most comprehensive parallel computing environments for R is [`batchtools`](https://mllg.github.io/batchtools/articles/batchtools.html#migration). Older versions of this package were released under the name `BatchJobs` [@Bischl2015-rf]. - `batchtools` supports both multi-core and multi-node computations with and without schedulers. By making use of cluster template files, most schedulers and queueing systems are supported (_e.g._ Torque, Sun Grid Engine, Slurm). - The `BiocParallel` package (see [here](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html)) provides similar functionalities as `batchtools`, but is tailored to use Bioconductor objects. ## Reminder: Traditional Job Submission for R This topic is covered in more detail in other tutorials. The following only provides a very brief overview of this submission method. __1.__ Create Slurm submission script, here called [script_name.sh](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/slides/R_for_HPC/demo_files/script_name.sh) with: ```bash #!/bin/bash -l #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=1G #SBATCH --time=1-00:15:00 # 1 day and 15 minutes #SBATCH --mail-user=useremail@address.com #SBATCH --mail-type=ALL #SBATCH --job-name="some_test" #SBATCH -p short # Choose queue/partition from: intel, batch, highmem, gpu, short Rscript my_script.R ``` __2.__ Submit R script called [my_script.R](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/slides/R_for_HPC/demo_files/my_script.R) by above Slurm script with: ```bash sbatch script_name.sh ``` ## Parallel Evaluations on Clusters with `batchtools` - The following introduces the usage of `batchtools` for a computer cluster using SLURM as scheduler (workload manager). SLURM is the scheduler used by the HPCC at UCR. - Similar instructions are provided in HPCC's manual section covering `batchtools` [here](https://hpcc.ucr.edu/manuals_linux-cluster_parallelR.html) - To simplify the evaluation of the R code on the following slides, the corresponding text version is available for download from [here](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/slides/R_for_HPC/demo_files/R_for_HPC_demo.R). ## Hands-on Demo of `batchtools` ### Set up working directory for SLURM First login to your cluster account, open R and execute the following lines. This will create a test directory (here `mytestdir`), redirect R into this directory and then download the required files: + [`slurm.tmpl`](https://github.com/tgirke/GEN242/blob/main/content/en/tutorials/rparallel/demo_files/slurm.tmpl) + [`.batchtools.conf.R`](https://github.com/tgirke/GEN242/blob/main/content/en/tutorials/rparallel/demo_files/.batchtools.conf.R) ```{r setenvir, eval=FALSE} dir.create("mytestdir") setwd("mytestdir") download.file("https://bit.ly/3Oh9dRO", "slurm.tmpl") download.file("https://bit.ly/3KPBwou", ".batchtools.conf.R") ``` ### Load package and define some custom function The following code defines a test function (here `myFct`) that will be run on the cluster for demonstration purposes. The test function (`myFct`) subsets the `iris` data frame by rows, and appends the host name and R version of each node where the function was executed. The R version to be used on each node can be specified in the `slurm.tmpl` file (under `module load`). ```{r custom_fct1, eval=FALSE} library('RenvModule') module('load','slurm') # Loads slurm among other modules library(batchtools) myFct <- function(x) { Sys.sleep(10) # to see job in queue, pause for 10 sec result <- cbind(iris[x, 1:4,], Node=system("hostname", intern=TRUE), Rversion=paste(R.Version()[6:7], collapse=".")) return(result) } ``` ### Submit jobs from R to cluster The following creates a `batchtools` registry, defines the number of jobs and resource requests, and then submits the jobs to the cluster via SLURM. ```{r submit_jobs, eval=FALSE} reg <- makeRegistry(file.dir="myregdir", conf.file=".batchtools.conf.R") Njobs <- 1:4 # Define number of jobs (here 4) ids <- batchMap(fun=myFct, x=Njobs) done <- submitJobs(ids, reg=reg, resources=list(partition="short", walltime=120, ntasks=1, ncpus=1, memory=1024)) waitForJobs() # Wait until jobs are completed ``` ### Summarize job status After the jobs are completed one can inspect their status as follows. ```{r job_status, eval=FALSE} getStatus() # Summarize job status showLog(Njobs[1]) # killJobs(Njobs) # # Possible from within R or outside with scancel ``` ### Access/assemble results The results are stored as `.rds` files in the registry directory (here `myregdir`). One can access them manually via `readRDS` or use various convenience utilities provided by the `batchtools` package. ```{r result_management, eval=FALSE} readRDS("myregdir/results/1.rds") # reads from rds file first result chunk loadResult(1) lapply(Njobs, loadResult) reduceResults(rbind) # Assemble result chunks in single data.frame do.call("rbind", lapply(Njobs, loadResult)) ``` ### Remove registry directory from file system By default existing registries will not be overwritten. If required one can explicitly clean and delete them with the following functions. ```{r remove_registry, eval=FALSE} clearRegistry() # Clear registry in R session removeRegistry(wait=0, reg=reg) # Delete registry directory # unlink("myregdir", recursive=TRUE) # Same as previous line ``` ### Load registry into R Loading a registry can be useful when accessing the results at a later state or after moving them to a local system. ```{r load_registry, eval=FALSE} from_file <- loadRegistry("myregdir", conf.file=".batchtools.conf.R") reduceResults(rbind) ``` ## Conclusions ### Advantages of `batchtools` - many parallelization methods multiple cores, and across both multiple CPU sockets and nodes - most schedulers supported - takes full advantage of a cluster - robust job management by organizing results in registry file-based database - simplifies submission, monitoring and restart of jobs - well supported and maintained package ## Session Info ```{r sessionInfo} sessionInfo() ``` ## References