--- title: "systemPipeR: Workflow Design and Reporting Environment" author: "Author: Daniela Cassol, Le Zhang and Thomas Girke" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: toc_float: true code_folding: show package: systemPipeR vignette: | %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{systemPipeR: Workflow design and reporting generation environment} %\VignetteEngine{knitr::rmarkdown} fontsize: 14pt bibliography: bibtex.bib editor_options: chunk_output_type: console weight: 7 type: docs --- ```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; } ``` ```{r setup_dir, echo=FALSE, include=FALSE, message=FALSE, warning=FALSE} unlink("rnaseq", recursive = TRUE) systemPipeRdata::genWorkenvir(workflow = "rnaseq") knitr::opts_knit$set(root.dir = 'rnaseq') ``` ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=80, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), tidy.opts=list(width.cutoff=80), tidy=TRUE) ``` ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} suppressPackageStartupMessages({ library(systemPipeR) library(BiocParallel) library(Biostrings) library(Rsamtools) library(GenomicRanges) library(ggplot2) library(GenomicAlignments) library(ShortRead) library(ape) library(batchtools) library(magrittr) }) ```

Source code download: [ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/systempiper/systemPipeR.Rmd) ] [ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/systempiper/systemPipeR.R) ]

### Introduction [_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) provides flexible utilities for designing, building and running automated end-to-end analysis workflows for a wide range of research applications, including next-generation sequencing (NGS) experiments, such as [RNA-Seq](https://bit.ly/3CbJkOE), [ChIP-Seq](https://bit.ly/3CfsoXi), [VAR-Seq](https://bit.ly/45N1GCY) and [Ribo-Seq](https://bit.ly/3IUptHm) [@H_Backman2016-bt]. Important features include a uniform workflow interface across different data analysis applications, automated report generation, and support for running both R and command-line software, such as NGS aligners or peak/variant callers, on local computers or compute clusters (Figure 1). The latter supports interactive job submissions and batch submissions to queuing systems of clusters. _`systemPipeR`_ has been designed to improve the reproducibility of large-scale data analysis projects while substantially reducing the time it takes to analyze complex omics data sets. It provides a uniform workflow interface and management system that allows the users to run selected workflow steps, as well as customize and design entirely new workflows. Additionally, the package take advantage of central community S4 classes of the Bioconductor ecosystem, and enhances them with command-line software support. The main motivation and advantages of using _`systemPipeR`_ for complex data analysis tasks are: 1. Design of complex workflows involving multiple R/Bioconductor packages 2. Common workflow interface for different applications 3. User-friendly access to widely used Bioconductor utilities 4. Support of command-line software from within R 5. Reduced complexity of using compute clusters from R 6. Accelerated runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple nodes 6. Improved reproducibility by automating the generation of analysis reports

**Figure 1:** Relevant features in _`systemPipeR`_. Workflow design concepts are illustrated under (A). Examples of `systemPipeR's` visualization functionalities are given under (B).
A central concept for designing workflows within the _`systemPipeR`_ environment is the use of workflow management containers. Workflow management containers facilitate the design and execution of complex data analysis steps. For its command-line interface _`systemPipeR`_ adopts the widely used [Common Workflow Language](https://www.commonwl.org/) (CWL) [@Amstutz2016-ka]. The interface to CWL is established by _`systemPipeR's`_ workflow control class called _`SYSargsList`_ (see Figure 2). This design offers many advantages such as: (i) options to run workflows either entirely from within R, from various command-line wrappers (e.g., *cwl-runner*) or from other languages (*e.g.*, Bash or Python). Apart from providing support for both command-line and R/Bioconductor software, the package provides utilities for containerization, parallel evaluations on computer clusters and automated generation of interactive analysis reports.

**Figure 2:** Overview of `systemPipeR` workflows management instances. (A) A typical analysis workflow requires multiple software tools (red), as well the description of the input (green) and output files, including analysis reports (purple). (B) `systemPipeR` provides multiple utilities to design and build a workflow, allowing multi-instance, integration of R code and command-line software, a simple and efficient annotation system, that allows automatic control of the input and output data, and multiple resources to manage the entire workflow. (C) Options are provided to execute single or multiple workflow steps, while enabling scalability, checkpoints, and generation of technical and scientific reports. An important feature of _`systemPipeR's`_ CWL interface is that it provides two options to run command-line tools and workflows based on CWL. First, one can run CWL in its native way via an R-based wrapper utility for *cwl-runner* or *cwl-tools* (CWL-based approach). Second, one can run workflows using CWL's command-line and workflow instructions from within R (R-based approach). In the latter case the same CWL workflow definition files (*e.g.* `*.cwl` and `*.yml`) are used but rendered and executed entirely with R functions defined by _`systemPipeR`_, and thus use CWL mainly as a command-line and workflow definition format rather than execution software to run workflows. In this regard _`systemPipeR`_ also provides several convenience functions that are useful for designing and debugging workflows, such as a command-line rendering function to retrieve the exact command-line strings for each data set and processing step prior to running a command-line. ### Workflow Management with _`SYSargsList`_ The `SYSargsList` S4 class is a list-like container that stores the paths to all input and output files along with the corresponding parameters used in each analysis step (see Figure 3). `SYSargsList` instances are constructed from an optional targets files, and two CWL parameter files including `*.cwl` and `*.yml` (for details, see below). When running preconfigured NGS workflows, the only input the user needs to provide is the initial targets file containing the paths to the input files (e.g., FASTQ) and experiment design information, such as sample labels and biological replicates. Subsequent targets instances are created automatically, based on the connectivity establish between each workflow step. _`SYSargsList`_ containers store all information required for one or multiple steps. This establishes central control for running, monitoring and debugging complex workflows from start to finish.

**Figure 3:** Workflow steps with input/output file operations are controlled by the _`SYSargsList`_ container. Each of its components (_`SYSargs2`_) are constructed from an optional *targets* and two *param* files. Alternatively, _`LineWise`_ instances containing pure R code can be used. ## Getting Started ### Installation The R software for running [_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) can be downloaded from [_CRAN_](http://cran.at.r-project.org/). The `systemPipeR` environment can be installed from the R console using the [_`BiocManager::install`_](https://cran.r-project.org/web/packages/BiocManager/index.html) command. The associated data package [_`systemPipeRdata`_](http://www.bioconductor.org/packages/devel/data/experiment/html/systemPipeRdata.html) can be installed the same way. The latter is a helper package for generating `systemPipeR` workflow environments with a single command containing all parameter files and sample data required to quickly test and run workflows. ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("systemPipeR") BiocManager::install("systemPipeRdata") ``` To use command-line software, the corresponding tool and dependencies need to be installed on a user's system. See [details](#third-party-software-tools). ### Loading package and documentation ```{r documentation, eval=FALSE} library("systemPipeR") # Loads the package library(help="systemPipeR") # Lists package info vignette("systemPipeR") # Opens vignette ``` ### Load sample data and workflow templates The mini sample FASTQ files used by this introduction as well as the associated workflow reporting vignettes can be loaded via the _`systemPipeRdata`_ package as shown below. The chosen data set [`SRP010938`](http://www.ncbi.nlm.nih.gov/sra/?term=SRP010938) contains 18 paired-end (PE) read sets from _Arabidposis thaliana_ [@Howard2013-fq]. To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the _A. thalina_ genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set requires less than 200MB disk storage space. A PE read set has been chosen for this test data set for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single-end) reads or PE reads. The following generates a fully populated _`systemPipeR`_ workflow environment (here for RNA-Seq) in the current working directory of an R session. The `systemPipeRdata` package provides preconfigured workflow templates for RNA-Seq, ChIP-Seq, VAR-Seq, and Ribo-Seq. Additional, templates are available on the project web site [here](https://systempipe.org/). ```{r generate_workenvir, eval=FALSE} systemPipeRdata::genWorkenvir(workflow="rnaseq") setwd("rnaseq") ``` ### Project structure The working environment of the sample data loaded in the previous step contains the following pre-configured directory structure (Figure 4). Directory names are indicated in ***green***. Users can change this structure as needed, but need to adjust the code in their workflows accordingly. * _**workflow/**_ (*e.g.* *rnaseq/*) + This is the root directory of the R session running the workflow. + Run script ( *\*.Rmd*) and sample annotation (*targets.txt*) files are located here. + Note, this directory can have any name (*e.g.* _**rnaseq**_, _**varseq**_). Changing its name does not require any modifications in the run script(s). + **Important subdirectories**: + _**param/**_ + _**param/cwl/**_: This subdirectory stores all the CWL parameter files. To organize workflows, each can have its own subdirectory, where all `CWL param` and `input.yml` files need to be in the same subdirectory. + _**data/**_ + FASTQ files + FASTA file of reference (*e.g.* reference genome) + Annotation files + etc. + _**results/**_ + Analysis results are usually written to this directory, including: alignment, variant and peak files (BAM, VCF, BED); tabular result files; and image/plot files. + Note, the user has the option to organize results files for a given sample and analysis step in a separate subdirectory.

**Figure 4:** *systemPipeR's* preconfigured directory structure. The following parameter files are included in each workflow template: 1. *`targets.txt`*: initial one provided by user; downstream *`targets_*.txt`* files are generated automatically 2. *`*.param/cwl`*: defines parameter for input/output file operations, *e.g.*: + *`hisat2/hisat2-mapping-se.cwl`* + *`hisat2/hisat2-mapping-se.yml`* 4. Configuration files for computer cluster environments (skip on single machines): + *`.batchtools.conf.R`*: defines the type of scheduler for *`batchtools`* pointing to template file of cluster, and located in user's home directory + *`*.tmpl`*: specifies parameters of scheduler used by a system, *e.g.* Torque, SGE, Slurm, etc. ### Structure of _`targets`_ file The _`targets`_ file defines all input files (_e.g._ FASTQ, BAM, BCF) and sample comparisons of an analysis workflow. The following shows the format of a sample _`targets`_ file included in the package. It also can be viewed and downloaded from _`systemPipeR`_'s GitHub repository [here](https://github.com/tgirke/systemPipeR/blob/master/inst/extdata/targets.txt). In a target file with a single type of input files, here FASTQ files of single-end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files of PE reads. All subsequent columns are optional and any number of additional columns can be added as needed. The columns in targets files are expected to be tab separated (TSV format). The `SampleName` column contains usually short labels for referencing samples (here FASTQ files) across many workflow steps (_e.g._ plots and column titles). Importantly, the labels used in the `SampleName` column need to be unique, while technical or biological replicates are indicated by duplicated values under the `Factor` column. For readability and transparency, it is useful to use here a short, consistent and informative syntax for naming samples and replicates. To avoid problems with other packages or external software, it is recommended to use the basic naming rules for R objects and their components as outlined [here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rbasics/rbasics/#data-objects). This is important since the values used under the `SampleName` and `Factor` columns are intended to be used as labels for naming columns or plotting features in downstream analysis steps. Users should note here, the usage of targets files is optional when using *systemPipeR's* new CWL interface. They can be replaced by a standard YAML input file used by CWL. Since for organizing experimental variables targets files are extremely useful and user-friendly. Thus, we encourage users to keep using them. #### Structure of _`targets`_ file for single-end (SE) samples ```{r targetsSE, eval=TRUE, message=FALSE} library(systemPipeR) targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR") showDF(read.delim(targetspath, comment.char = "#")) ``` To work with custom data, users need to generate a _`targets`_ file containing the paths to their own FASTQ files and then provide under _`targetspath`_ the path to the corresponding _`targets`_ file. #### Structure of _`targets`_ file for paired-end (PE) samples For paired-end (PE) samples, the structure of the targets file is similar, where users need to provide two FASTQ path columns: *`FileName1`* and *`FileName2`* with the paths to the PE FASTQ files. ```{r targetsPE, eval=TRUE} targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR") showDF(read.delim(targetspath, comment.char = "#")) ``` #### Sample comparisons Sample comparisons are defined in the header lines of the _`targets`_ file starting with '``# ``'. ```{r comment_lines, echo=TRUE} readLines(targetspath)[1:4] ``` The function _`readComp`_ imports the comparison information and stores it in a _`list`_. Alternatively, _`readComp`_ can obtain the comparison information from the corresponding _`SYSargsList`_ object (see below). Note, these header lines are optional. They are mainly useful for controlling comparative analyses according to certain biological expectations, such as identifying differentially expressed genes in RNA-Seq experiments based on simple pair-wise comparisons. ```{r targetscomp, eval=TRUE} readComp(file = targetspath, format = "vector", delim = "-") ``` ## Project initialization To create a workflow within _`systemPipeR`_, we can start by defining an empty container and checking the directory structure: ```{r SPRproject1, eval=TRUE} sal <- SPRproject() ``` Internally, `SPRproject` function will create a hidden directory called `.SPRproject`, by default. This directory will store all log files generated during a workflow run. Initially, the object `sal` is a empty container containing only the basic project information. The project information can be accessed with the `projectInfo` method. ```{r projectInfo, eval=TRUE} sal projectInfo(sal) ``` Also, the `length` function will return how many steps a workflow contains. Since no steps have been added yet it returns zero. ```{r length, eval=TRUE} length(sal) ``` ## Building workflows _`systemPipeR`_ workflows can be populated with a single command from an R Markdown file or stepwise in interactive mode. This section introduces first how to build a workflow stepwise in interactive mode, and then how to build a workflow with a single command from an R Markdown file. ### Stepwise workflow construction New workflows are constructed, or existing ones modified, by connecting each step via the `appendStep` method. Each step in a `SYSargsList` instance contains the instructions needed for processing a set of input files with a specific command-line and the paths to the exptected outfiles. For constructing R code-based workflow steps the constructor function `Linewise` is used. The following demonstrates how to create a command-line workflow step here using as example the short read aligner software HISAT2 [@Kim2015-ve]. The constructor function renders the proper command-line strings for each sample and software tool, appending a new step in the `SYSargsList` object defined in the previous step. For this, the `SYSargsList` constructor function uses in this example data from three input files: - CWL command-line specification file (`wf_file` argument) - Input variables (`input_file` argument) - Targets file (`targets` argument) In CWL files with the extension *`.cwl`* define the parameters of a chosen command-line step or an entire workflow, while files with the extension *`.yml`* define the input variables of command-line steps. Note, input variables provided by a *targets* or *targetsPE* file can be passed on via the *inputvars* argument to the *`.yml`* file and from there to the _`SYSargsList`_ object. ```{r hisat2_mapping, eval=TRUE} appendStep(sal) <- SYSargsList( step_name = "hisat2_mapping", dir = TRUE, targets = "targetsPE.txt", wf_file = "workflow-hisat2/workflow_hisat2-pe.cwl", input_file = "workflow-hisat2/workflow_hisat2-pe.yml", dir_path = "param/cwl", inputvars = c(FileName1 = "_FASTQ_PATH1_", FileName2 = "_FASTQ_PATH2_", SampleName = "_SampleName_") ) ``` An overview of the workflow can be printed as follows. ```{r} sal ``` In the above the workflow status for the new step is *Pending*. This means the workflow object has been constructed but not executed yet. Several accessor methods are available to explore the _`SYSargsList`_ object. Of particular interest is the *`cmdlist()`* method. It constructs the system commands for running command-line software as specified by a given *`.cwl`* file combined with the paths to the input samples (*e.g.* FASTQ files), here provided by a *`targets`* file. The example below shows the *`cmdlist()`* output for running HISAT2 on the first PE read sample. Evaluating the output of *`cmdlist()`* can be very helpful for designing and debugging *`.cwl`* files of new command-line software or changing the parameter settings of existing ones. The rendered command-line instance for each input sample can be returned as follows. ```{r} cmdlist(sal, step = "hisat2_mapping", targets = 1) ``` The outfiles components of _`SYSargsList`_ define the expected output files for each step in the workflow; some of which are the input for the next workflow step (see Figure 3). ```{r output_WF, eval=TRUE} outfiles(sal) ``` In an 'R-centric' rather than a 'CWL-centric' workflow design the connectivity among workflow steps is established by creating the downstream targets instances automatically (see Figure 3). Each step uses the outfiles from the previous step as input, thereby establishing connectivity among each step in the workflow. By chaining several _`SYSargsList`_ steps together one can construct complex workflows involving many sample-level input/output file operations with any combination of command-line or R-based software. Also, _`systemPipeR`_ provides features to automatically build these connections. This provides additional security to make sure that all samples will be processed in a reproducible manner. Alternatively, a CWL-centric workflow design can be used that defines all or most workflow steps with CWL workflow and parameter files. Due to time and space restrictions, the CWL-centric approach is not covered by this tutorial. The following R code step generates a `data.frame` containing important read alignment statistics such as the total number of reads in the FASTQ files, the number of total alignments, as well as the number of primary alignments in the corresponding BAM files. The constructor function `LineWise` requires the `step_name` and the R code assigned to the `code` argument. The R code should be enclosed by braces (`{}`). ```{r align_stats, eval=TRUE} appendStep(sal) <- LineWise( code = { fqpaths <- getColumn(sal, step = "hisat2_mapping", "targetsWF", column = "FileName1") bampaths <- getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam") read_statsDF <- alignStats(args = bampaths, fqpaths = fqpaths, pairEnd = TRUE) write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t") }, step_name = "align_stats", dependency = "hisat2_mapping") ``` To inspect the R code of this step, the `codeLine` method can be used. ```{r codeLine, eval=TRUE} codeLine(sal, "align_stats") ``` The `getColumn` method allows to extract certain information of a workflow instance, here output files that can be accessed from within R as demonstrated below. ```{r getColumn, eval=TRUE} getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam") ``` ```{r cleaning, eval=TRUE, include=FALSE} if (file.exists(".SPRproject")) unlink(".SPRproject", recursive = TRUE) ## NOTE: Removing previous project create in the quick starts section ``` ### Loading workflows from an R Markdown The `importWF` function allows to load an entire workflow from an R Markdown (Rmd) file into an `SYSargsList` object that has been intialized with `SPRproject()` as introduced above. Next, one can run the workflow from start to finish with a single function call using `runWF`. Below the latter has been commented out to first introduce additional details prior to executing the workflow with `runWF`, which will be discussed in the following section. ```{r importWF_rmd, eval=TRUE} sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd") # sal <- runWF(sal) ``` Several accessor functions are available to inspect the workflow. ```{r importWF_details, eval=FALSE} stepsWF(sal) dependency(sal) codeLine(sal) targetsWF(sal) ``` #### Third-party software tools {#third-party-software-tools} Examples of preconfigured CWL templates for third-party software tools are provided in the following table. In addition, the tutorial entitled [*Automate Creation of CWL Instructions*](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/cmdtocwl/cmdtocwl/) explains how to create these CWL templates for essentially any command-line tool simply by providing the command-line string for a new software as input. ```{r table_tools, echo=FALSE, message=T} library(magrittr) SPR_software <- system.file("extdata", "SPR_software.csv", package="systemPipeR") software <- read.delim(SPR_software, sep=",", comment.char = "#") colors <- colorRampPalette((c("darkseagreen", "indianred1")))(length(unique(software$Category))) id <- as.numeric(c((unique(software$Category)))) software %>% dplyr::mutate(Step = kableExtra::cell_spec(Step, color = "white", bold = TRUE, background = factor(Category, id, colors))) %>% dplyr::select(Tool, Description, Step) %>% dplyr::arrange(Tool) %>% kableExtra::kable(escape = FALSE, align = "c", col.names = c("Tool Name", "Description", "Step")) %>% kableExtra::kable_styling(c("striped", "hover", "condensed"), full_width = TRUE) %>% kableExtra::scroll_box(width = "100%", height = "500px") ``` Importantly, command-line software needs to be installed on a user's system and available in a user's `PATH`. To check whether this is the case, one can use the `tryCL` function. ```{r test_tool_path, eval=FALSE} tryCL(command="grep") ``` ## How to run workflows? ### Setup and requirements To execute the code in this tutorial, one needs the following software installed. * R (version >=4.1.2) * systemPipeR package (version >=2.0.8) * Hisat2 (version >= 2.1.0) To build custom workflows with any additional command-line software, one needs the respective software installed and available in the user PATH. To test this, one can use the `tryCL` function. ```{r test_software, eval=FALSE} tryCL(command = "hisat2") ## "All set up, proceed!" ``` ### Run from R For running a workflow, the `runWF` function can be used. It executes all workflow steps stored in a `SYSargsList` container (below named `sal`). This assumes the `sal` object has been initialized and populated as outlined above using `SPRproject()` and `importWF(sal, file_path="...")`, respectively. ```{r runWF, eval=FALSE} sal <- runWF(sal) ``` This essential function allows the user to choose one or multiple steps to be executed using its `steps` argument. However, it is necessary to maintain valid dependencies (also see [workflow graph](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#visualize-workflow)). If a selected step depends on a previous step(s), the output of which may not be available yet, then the execution will fail. ```{r runWF_error, eval=FALSE} sal <- runWF(sal, steps = c(1,3)) ``` The `runWF` function allows to force the execution of chosen workflow steps, even if the status of a given step is `'Success'` and the expected `outfiles` exist already. This is useful for updating certain steps when needed. Another option is to force `runWF` to ignore all warnings and errors. This can be achieved by assigning `FALSE` to the arguments `warning.stop` and `error.stop`, respectively. ```{r runWF_force, eval=FALSE} sal <- runWF(sal, force = TRUE, warning.stop = FALSE, error.stop = FALSE) ``` While `SYSargsList` (sal) objects are autosaved when working with workflows, it can be sometimes safer to explicity save the object before closing R. ```{r saveWF, eval=FALSE} sal <- write_SYSargsList(sal) ``` #### Resume and reset workflows To restart workflows, set `resume=TRUE`. ```{r resumeWF, eval=FALSE} sal <- SPRproject(resume=TRUE) ``` To delete all steps in a workflow including the content saved under the `.SPRproject` directory, use `overwrite=TRUE`. This option should be used with caution, since it effectively deletes most previous workflow data. ```{r overwriteWF, eval=FALSE} sal <- SPRproject(overwrite=TRUE) ``` For additional options and details, users want to consult the help documentation for: `?SPRproject`, `?runWF` and `?'SYSargsList-class'`. #### Parallelization on clusters The computation of time-consuming steps can be greatly accelerated by processing many files in parallel using several compute nodes of a cluster, where a scheduling system is used for load balancing. The following `resources` list provides to a `SYSargsList` (`sal`) object the number of independent parallel cluster processes defined under the `Njobs` element in the list. The given example will run 18 processes in parallel using for each 4 CPU cores, thus utilizing a total of 72 CPU cores. Note, `runWF` can be used with most queueing systems as it is based on utilities defined by the `batchtools` package, which supports the use of template files (_`*.tmpl`_) for defining the run parameters of different schedulers. To run the following code, one needs to have both a `conffile` (see _`.batchtools.conf.R`_ samples [here](https://mllg.github.io/batchtools/)) and a `template` file (see _`*.tmpl`_ samples [here](https://github.com/mllg/batchtools/tree/master/inst/templates)) for the queueing system available on a system. The following example uses the sample `conffile` and `template` files for the Slurm scheduler provided by this package. The `resources` list can be appended when a workflow step is generated. Alternatively, one can append these resource specifications to any step of a pre-generated `SYSarsList` with the `addResources` function. ```{r runWF_cluster, eval=FALSE} resources <- list(conffile=".batchtools.conf.R", template="batchtools.slurm.tmpl", Njobs=18, walltime=120, # in min ntasks=1, ncpus=4, memory=1024, # in Mb partition = "short" ) sal <- addResources(sal, c("hisat2_mapping"), resources = resources) sal <- runWF(sal) ``` The above example will submit via `runWF(sal)` the *hisat2_mapping* step to a partition called `short` on a computer cluster. Users need to adjust this and other parameters defined in the `resources` list to their environment. ### Run from Command-Line (without cluster) Create an R script containing the following (or similar) minimum content. ```sh #!/usr/bin/env Rscript library(systemPipeR) sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd") # adjust name of Rmd file if different sal <- runWF(sal) # runs entire workflow sal <- renderReport(sal) # after workflow has completed render Rmd to HTML report ``` Assuming the script is named `wf_run_script.R` it can be executed from the command-line (not R console!) as follows. In addition, one can make the script executable to run it like any other script. ```sh Rscript wf_run_script.R ``` This will run `systemPipeR` workflows on a single machine. In this case a limited amount of parallelization is possible if the corresponding code chunks in the workflow take advantage of multi-core parallelization instructions provided by `BiocParallel`, `batchtools` or related packages. However, this type of parallelization is usually limited to the number of cores available on a single CPU. A much more scalable approach is the use of computer clusters as described above and in the next section. ### Submit workflow from command-line to cluster In addition to running workflows within interactive R sessions or submitting them from the command-line on a single system (see above), one can execute `systemPipeR` workflows from the command-line to an HPC cluster by including the relevant workflow run instructions in an R script and then submitting it via a submission script of a workload manager system to a computer cluster. The following gives an example for the Slurm workload manager. To understand the details, it is important to inspect the content of the two files (here .R and .sh). Additional details about resource specification under Slurm are given [below](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#parallelization-on-clusters). - R script: [wf_run_script.R](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.R) - Slurm submission script: [wf_run_script.sh](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.sh) As a test, users can generate in their user account of a cluster a workflow environment populated with the toy data as outlined [here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq/#workflow-environment). After this, it is best to create within the workflow directory a subdirectory, e.g. called `cl_sbatch_run`, and then save the above two files to this subdirectory. Next, the parameters in both files need to be adjusted to match the type of workflow and the required computing resources. This includes the name of the `Rmd` file and scheduler resource settings such as: `partition`, `Njobs`, `walltime`, `memory`, etc. After all relevant settings have been set correctly, one can execute the workflow with `sbatch` from within the `cl_sbatch_run` directory as follows (note this is a command-line call outside of R): ```sh sbatch wf_run_script.sh ``` After the submission to the cluster, one usually should check its status and progress with `squeue -u ` (under Slurm) as well as by monitoring the content of the `slurm-.out` file generated by the scheduler in the same directory. This file contains most of the `STDOUT` and `STDERROR` generated by a cluster job. Once everything is working on the toy data, users can run the workflow on real data the same way. ### Visualize workflow _`systemPipeR`_ workflows instances can be visualized with the `plotWF` function. The resulting plot includes the following information. - Workflow topology graph (dependency graphs between different steps) - Workflow step status, *e.g.* `Success`, `Error`, `Pending`, `Warnings` - Sample status and statistics - Workflow timing: run duration time If no argument is provided, the basic plot will automatically detect width, height, layout, plot method, branches, _etc_. ```{r, eval=TRUE} plotWF(sal) ``` ### Technical reports _`systemPipeR`_ compiles all workflow execution logs in one central location, making it easy to check any standard output (`stdout`) or standard error (`stderr`) for any command-line tool used in a workflow. Also, the information is appended to the workflow plot making it easy to click on respective steps. ```{r, eval=FALSE} sal <- renderLogs(sal) ``` ### Scientific reports _`systemPipeR`_ auto-generates scientific analysis reports in HTML format. These reports compile the results of all workflow steps including text, code, plots and tables. ```{r, eval=FALSE} sal <- renderReport(sal) ``` Alternatively, scientific reports can be rendered with the `render` function from `rmarkdown`. ```{r, eval=FALSE} rmarkdown::render("systemPipeRNAseq.Rmd", clean=TRUE, output_format="BiocStyle::html_document") ``` ## How to modify workflows? ### Modifying steps If needed one can modify existing workflow steps in a pre-populated `SYSargsList` object, and potentially already executed workflows, with the `replaceStep(sal) <-` replacement function. The following gives an example where step number 3 in a `SYSargsList` (sal) object will be updated with modified or new code. Note, this is a generalized example where the user needs to insert the code lines between `{` and `}`, and also adjust the values assigned to the arguments: `step_name` and `dependency`. ```{r modify_WF, eval=FALSE} replaceStep(sal, step=3) <- LineWise( code = { ... }, step_name = "my_step_name", dependency = "my_dependency") ``` Since `step_names` need to be unique, one should avoid using the same `step_name` as before. If the previous name is used, a default name will be assigned. Rerunning the assignment will then allow to assign the previous name. This behavior is enforced for version tracking. Subsequently, one can view and check the code changes with `codeLine()`, and then rerun the corresponding step (here 3) as follows: ```{r run_WF_steps, eval=FALSE} codeLine(stepsWF(sal)$my_step_name) runWF(sal, steps=3) ``` As mentioned above, any step in a workflow can only be run in isolation if its expected input exists (see `dependency`). ### Adding steps New steps can be added to the Rmd file of a workflow by inserting new R Markdown code blocks (chunks) starting and ending with the usual `appendStep<-` syntax and then creating a new `SYSargsList` instance with `importWF` that contain the new step(s). To add steps to a pre-populated `SYSargsList` object, one can use the `after` argument of the `appendStep<-` function. The following example will add a new step after position 3 to the corresponding `sal` object. This can be useful if a workflow with a long runtime has already been completed and one only wants to make some refinements without re-running the entire workflow. ```{r adding_WF_steps, eval=FALSE} appendStep(sal, after=3) <- ... ``` ## Workflow templates Workflow templates are provided via the affiliated Bioconductor data package named `systemPipeRdata`, as well as by a dedicated GitHub repository. Instances of these workflows can be created with a single command. The following gives several examples. ### RNA-Seq WF template

[ [Tutorial 1](https://bit.ly/3fMB5hm) ] [ [Tutorial 2](https://systempipe.org/SPrnaseq/index.html) ]

Load the RNA-Seq sample workflow into your current working directory. ```{r genRna_workflow_single, eval=FALSE} library(systemPipeRdata) genWorkenvir(workflow="rnaseq") setwd("rnaseq") ``` #### Create the workflow This template includes the most common analysis steps of `RNAseq` workflows. One can add, remove, modify workflow steps by modifying the `sal` object. ```{r project_rnaseq, eval=FALSE} sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE) ``` **Workflow steps:** 1. Read preprocessing + Quality filtering (trimming) + FASTQ quality report 2. Alignments: _`HISAT2`_ (or any other RNA-Seq aligner) 3. Alignment stats 4. Read counting 5. Sample-wise correlation analysis 6. Analysis of differentially expressed genes (DEGs) 7. GO term enrichment analysis 8. Gene-wise clustering #### Run workflow ```{r run_rnaseq, eval=FALSE} sal <- runWF(sal) ``` Workflow visualization ```{r plot_rnaseq, eval=FALSE} plotWF(sal) ``` Report generation ```{r report_rnaseq, eval=FALSE} sal <- renderReport(sal) sal <- renderLogs(sal) ``` ### ChIP-Seq WF template

[ [Tutorial 1](https://bit.ly/3oQ14Mf) ] [ [Tutorial 2](https://systempipe.org/SPchipseq/index.html) ]

Load the ChIP-Seq sample workflow into your current working directory. ```{r genChip_workflow_single, eval=FALSE} library(systemPipeRdata) genWorkenvir(workflow="chipseq") setwd("chipseq") ``` **Workflow steps:** 1. Read preprocessing + Quality filtering (trimming) + FASTQ quality report 2. Alignments: _`Bowtie2`_ or _`rsubread`_ 3. Alignment stats 4. Peak calling: _`MACS2`_ 5. Peak annotation with genomic context 6. Differential binding analysis 7. GO term enrichment analysis 8. Motif analysis #### Create the workflow This template provides some common steps for a `ChIPseq` workflow. One can add, remove, modify workflow steps by operating on the `sal` object. ```{r project_chipseq, eval=FALSE} sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeChIPseq.Rmd", verbose = FALSE) ``` #### Run workflow ```{r run_chipseq, eval=FALSE} sal <- runWF(sal) ``` Workflow visualization ```{r plot_chipseq, eval=FALSE} plotWF(sal) ``` Report generation ```{r report_chipseq, eval=FALSE} sal <- renderReport(sal) sal <- renderLogs(sal) ``` ### VAR-Seq WF template

[ [Tutorial](https://systempipe.org/SPvarseq/index.html) ]

Load the VAR-Seq sample workflow into your current working directory. ```{r genVar_workflow_single, eval=FALSE} library(systemPipeRdata) genWorkenvir(workflow="varseq") setwd("varseq") ``` **Workflow steps:** 1. Read preprocessing + Quality filtering (trimming) + FASTQ quality report 2. Alignments: _`gsnap`_, _`bwa`_ 3. Variant calling: _`VariantTools`_, _`GATK`_, _`BCFtools`_ 4. Variant filtering: _`VariantTools`_ and _`VariantAnnotation`_ 5. Variant annotation: _`VariantAnnotation`_ 6. Combine results from many samples 7. Summary statistics of samples #### Create the workflow This template provides some common steps for a `VARseq` workflow. One can add, remove, modify workflow steps by operating on the `sal` object. ```{r project_varseq, eval=FALSE} sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeVARseq.Rmd", verbose = FALSE) ``` #### Run workflow ```{r run_varseq, eval=FALSE} sal <- runWF(sal) ``` Workflow visualization ```{r plot_varseq, eval=FALSE} plotWF(sal) ``` Report generation ```{r report_varseq, eval=FALSE} sal <- renderReport(sal) sal <- renderLogs(sal) ``` ### Ribo-Seq WF template

[ [Tutorial](https://systempipe.org/SPriboseq/index.html) ]

Load the Ribo-Seq sample workflow into your current working directory. ```{r genRibo_workflow_single, eval=FALSE} library(systemPipeRdata) genWorkenvir(workflow="riboseq") setwd("riboseq") ``` **Workflow steps:** 1. Read preprocessing + Adaptor trimming and quality filtering + FASTQ quality report 2. Alignments: _`HISAT2`_ (or any other RNA-Seq aligner) 3. Alignment stats 4. Compute read distribution across genomic features 5. Adding custom features to workflow (e.g. uORFs) 6. Genomic read coverage along transcripts 7. Read counting 8. Sample-wise correlation analysis 9. Analysis of differentially expressed genes (DEGs) 10. GO term enrichment analysis 11. Gene-wise clustering 12. Differential ribosome binding (translational efficiency) #### Create the workflow This template provides some common steps for a `RIBOseq` workflow. One can add, remove, modify workflow steps by operating on the `sal` object. ```{r project_riboseq, eval=FALSE} sal <- SPRproject() sal <- importWF(sal, file_path = "systemPipeRIBOseq.Rmd", verbose = FALSE) ``` #### Run workflow ```{r run_riboseq, eval=FALSE} sal <- runWF(sal) ``` Workflow visualization ```{r plot_riboseq, eval=FALSE} plotWF(sal, rstudio = TRUE) ``` Report generation ```{r report_riboseq, eval=FALSE} sal <- renderReport(sal) sal <- renderLogs(sal) ``` ```{r quiting, eval=TRUE, echo=FALSE} knitr::opts_knit$set(root.dir = '../') unlink("rnaseq", recursive = TRUE) ``` ## Funding This project is funded by NSF award [ABI-1661152](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1661152). ## Version information **Note:** the most recent version of this tutorial can be found here. ```{r session_info, eval=TRUE} sessionInfo() ``` ## References