---
title: "systemPipeR: Workflow Design and Reporting Environment"
author: "Author: Daniela Cassol, Le Zhang and Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
BiocStyle::html_document:
toc_float: true
code_folding: show
package: systemPipeR
vignette: |
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{systemPipeR: Workflow design and reporting generation environment}
%\VignetteEngine{knitr::rmarkdown}
fontsize: 14pt
bibliography: bibtex.bib
editor_options:
chunk_output_type: console
weight: 7
type: docs
---
```{css, echo=FALSE}
pre code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
```
```{r setup_dir, echo=FALSE, include=FALSE, message=FALSE, warning=FALSE}
unlink("rnaseq", recursive = TRUE)
systemPipeRdata::genWorkenvir(workflow = "rnaseq")
knitr::opts_knit$set(root.dir = 'rnaseq')
```
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=80, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
tidy.opts=list(width.cutoff=80), tidy=TRUE)
```
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
suppressPackageStartupMessages({
library(systemPipeR)
library(BiocParallel)
library(Biostrings)
library(Rsamtools)
library(GenomicRanges)
library(ggplot2)
library(GenomicAlignments)
library(ShortRead)
library(ape)
library(batchtools)
library(magrittr)
})
```
Source code download:
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/systempiper/systemPipeR.Rmd) ]
[ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/systempiper/systemPipeR.R) ]
### Introduction
[_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html)
provides flexible utilities for designing, building and running automated end-to-end
analysis workflows for a wide range of research applications, including
next-generation sequencing (NGS) experiments, such as [RNA-Seq](https://bit.ly/3CbJkOE), [ChIP-Seq](https://bit.ly/3CfsoXi),
[VAR-Seq](https://bit.ly/45N1GCY) and [Ribo-Seq](https://bit.ly/3IUptHm) [@H_Backman2016-bt]. Important features include a uniform
workflow interface across different data analysis applications, automated
report generation, and support for running both R and command-line software,
such as NGS aligners or peak/variant callers, on local computers or compute
clusters (Figure 1). The latter supports interactive job submissions and batch
submissions to queuing systems of clusters.
_`systemPipeR`_ has been designed to improve the reproducibility of large-scale data analysis
projects while substantially reducing the time it takes to analyze complex omics
data sets. It provides a uniform workflow interface and management system that allows
the users to run selected workflow steps, as well as customize and design entirely new workflows.
Additionally, the package take advantage of central community S4 classes of the Bioconductor
ecosystem, and enhances them with command-line software support.
The main motivation and advantages of using _`systemPipeR`_ for complex data analysis tasks are:
1. Design of complex workflows involving multiple R/Bioconductor packages
2. Common workflow interface for different applications
3. User-friendly access to widely used Bioconductor utilities
4. Support of command-line software from within R
5. Reduced complexity of using compute clusters from R
6. Accelerated runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple nodes
6. Improved reproducibility by automating the generation of analysis reports
**Figure 1:** Relevant features in _`systemPipeR`_.
Workflow design concepts are illustrated under (A). Examples of `systemPipeR's` visualization functionalities are given under (B).
A central concept for designing workflows within the _`systemPipeR`_
environment is the use of workflow management containers. Workflow management
containers facilitate the design and execution of complex data
analysis steps. For its command-line interface _`systemPipeR`_ adopts the
widely used [Common Workflow Language](https://www.commonwl.org/) (CWL)
[@Amstutz2016-ka]. The interface to CWL is established by _`systemPipeR's`_
workflow control class called _`SYSargsList`_ (see Figure 2). This
design offers many advantages such as: (i) options to run workflows either
entirely from within R, from various command-line wrappers (e.g., *cwl-runner*)
or from other languages (*e.g.*, Bash or Python). Apart from providing
support for both command-line and R/Bioconductor software, the package provides
utilities for containerization, parallel evaluations on computer clusters and
automated generation of interactive analysis reports.
**Figure 2:** Overview of `systemPipeR` workflows management instances. (A) A
typical analysis workflow requires multiple software tools (red), as well the
description of the input (green) and output files, including analysis reports
(purple). (B) `systemPipeR` provides multiple utilities to design and
build a workflow, allowing multi-instance, integration of R code and command-line
software, a simple and efficient annotation system, that allows automatic control
of the input and output data, and multiple resources to manage the entire workflow.
(C) Options are provided to execute single or multiple workflow steps, while
enabling scalability, checkpoints, and generation of technical and scientific reports.
An important feature of _`systemPipeR's`_ CWL interface is that it provides two
options to run command-line tools and workflows based on CWL. First, one can
run CWL in its native way via an R-based wrapper utility for *cwl-runner* or
*cwl-tools* (CWL-based approach). Second, one can run workflows using CWL's
command-line and workflow instructions from within R (R-based approach). In the
latter case the same CWL workflow definition files (*e.g.* `*.cwl` and `*.yml`)
are used but rendered and executed entirely with R functions defined by
_`systemPipeR`_, and thus use CWL mainly as a command-line and workflow
definition format rather than execution software to run workflows. In this regard
_`systemPipeR`_ also provides several convenience functions that are useful for
designing and debugging workflows, such as a command-line rendering function to
retrieve the exact command-line strings for each data set and processing step
prior to running a command-line.
### Workflow Management with _`SYSargsList`_
The `SYSargsList` S4 class is a list-like container that stores the paths to
all input and output files along with the corresponding parameters used in each
analysis step (see Figure 3). `SYSargsList` instances are constructed from an
optional targets files, and two CWL parameter files including `*.cwl` and
`*.yml` (for details, see below). When running preconfigured NGS workflows, the
only input the user needs to provide is the initial targets file containing the
paths to the input files (e.g., FASTQ) and experiment design information, such
as sample labels and biological replicates. Subsequent targets instances are
created automatically, based on the connectivity establish between each
workflow step. _`SYSargsList`_ containers store all information required for
one or multiple steps. This establishes central control for running, monitoring
and debugging complex workflows from start to finish.
**Figure 3:** Workflow steps with input/output file operations are controlled by
the _`SYSargsList`_ container. Each of its components (_`SYSargs2`_) are constructed
from an optional *targets* and two *param* files. Alternatively, _`LineWise`_ instances
containing pure R code can be used.
## Getting Started
### Installation
The R software for running
[_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html)
can be downloaded from [_CRAN_](http://cran.at.r-project.org/). The
`systemPipeR` environment can be installed from the R console using the
[_`BiocManager::install`_](https://cran.r-project.org/web/packages/BiocManager/index.html)
command. The associated data package
[_`systemPipeRdata`_](http://www.bioconductor.org/packages/devel/data/experiment/html/systemPipeRdata.html)
can be installed the same way. The latter is a helper package for generating
`systemPipeR` workflow environments with a single command containing all
parameter files and sample data required to quickly test and run workflows.
```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
BiocManager::install("systemPipeR")
BiocManager::install("systemPipeRdata")
```
To use command-line software, the corresponding tool and dependencies need to
be installed on a user's system. See [details](#third-party-software-tools).
### Loading package and documentation
```{r documentation, eval=FALSE}
library("systemPipeR") # Loads the package
library(help="systemPipeR") # Lists package info
vignette("systemPipeR") # Opens vignette
```
### Load sample data and workflow templates
The mini sample FASTQ files used by this introduction as well as the
associated workflow reporting vignettes can be loaded via the
_`systemPipeRdata`_ package as shown below. The chosen data set
[`SRP010938`](http://www.ncbi.nlm.nih.gov/sra/?term=SRP010938) contains 18
paired-end (PE) read sets from _Arabidposis thaliana_ [@Howard2013-fq]. To
minimize processing time during testing, each FASTQ file has been subsetted to
90,000-100,000 randomly sampled PE reads that map to the first 100,000
nucleotides of each chromosome of the _A. thalina_ genome. The corresponding
reference genome sequence (FASTA) and its GFF annotation files (provided in the
same download) have been truncated accordingly. This way the entire test sample
data set requires less than 200MB disk storage space. A PE read set has been
chosen for this test data set for flexibility, because it can be used for
testing both types of analysis routines requiring either SE (single-end) reads
or PE reads.
The following generates a fully populated _`systemPipeR`_ workflow environment
(here for RNA-Seq) in the current working directory of an R session. The `systemPipeRdata`
package provides preconfigured workflow templates for RNA-Seq, ChIP-Seq, VAR-Seq, and Ribo-Seq.
Additional, templates are available on the project web site [here](https://systempipe.org/).
```{r generate_workenvir, eval=FALSE}
systemPipeRdata::genWorkenvir(workflow="rnaseq")
setwd("rnaseq")
```
### Project structure
The working environment of the sample data loaded in the previous step contains
the following pre-configured directory structure (Figure 4). Directory names are indicated
in ***green***. Users can change this
structure as needed, but need to adjust the code in their workflows
accordingly.
* _**workflow/**_ (*e.g.* *rnaseq/*)
+ This is the root directory of the R session running the workflow.
+ Run script ( *\*.Rmd*) and sample annotation (*targets.txt*) files are located here.
+ Note, this directory can have any name (*e.g.* _**rnaseq**_, _**varseq**_). Changing its name does not require any modifications in the run script(s).
+ **Important subdirectories**:
+ _**param/**_
+ _**param/cwl/**_: This subdirectory stores all the CWL parameter files. To organize workflows, each can have its own subdirectory, where all `CWL param` and `input.yml` files need to be in the same subdirectory.
+ _**data/**_
+ FASTQ files
+ FASTA file of reference (*e.g.* reference genome)
+ Annotation files
+ etc.
+ _**results/**_
+ Analysis results are usually written to this directory, including: alignment, variant and peak files (BAM, VCF, BED); tabular result files; and image/plot files.
+ Note, the user has the option to organize results files for a given sample and analysis step in a separate subdirectory.
**Figure 4:** *systemPipeR's* preconfigured directory structure.
The following parameter files are included in each workflow template:
1. *`targets.txt`*: initial one provided by user; downstream *`targets_*.txt`* files are generated automatically
2. *`*.param/cwl`*: defines parameter for input/output file operations, *e.g.*:
+ *`hisat2/hisat2-mapping-se.cwl`*
+ *`hisat2/hisat2-mapping-se.yml`*
4. Configuration files for computer cluster environments (skip on single machines):
+ *`.batchtools.conf.R`*: defines the type of scheduler for *`batchtools`* pointing to template file of cluster, and located in user's home directory
+ *`*.tmpl`*: specifies parameters of scheduler used by a system, *e.g.* Torque, SGE, Slurm, etc.
### Structure of _`targets`_ file
The _`targets`_ file defines all input files (_e.g._ FASTQ, BAM, BCF) and
sample comparisons of an analysis workflow. The following shows the format of a
sample _`targets`_ file included in the package. It also can be viewed and
downloaded from _`systemPipeR`_'s GitHub repository
[here](https://github.com/tgirke/systemPipeR/blob/master/inst/extdata/targets.txt).
In a target file with a single type of input files, here FASTQ files of
single-end (SE) reads, the first three columns are mandatory including their
column names, while it is four mandatory columns for FASTQ files of PE reads.
All subsequent columns are optional and any number of additional columns can be
added as needed. The columns in targets files are expected to be tab separated (TSV format).
The `SampleName` column contains usually short labels for
referencing samples (here FASTQ files) across many workflow steps (_e.g._
plots and column titles). Importantly, the labels used in the `SampleName`
column need to be unique, while technical or biological replicates are
indicated by duplicated values under the `Factor` column. For readability
and transparency, it is useful to use here a short, consistent and informative
syntax for naming samples and replicates. To avoid problems with other
packages or external software, it is recommended to use the basic naming rules
for R objects and their components as outlined [here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rbasics/rbasics/#data-objects).
This is important since the values used under the `SampleName` and `Factor`
columns are intended to be used as labels for naming columns or plotting features
in downstream analysis steps.
Users should note here, the usage of targets files is optional when using
*systemPipeR's* new CWL interface. They can be replaced by a standard YAML
input file used by CWL. Since for organizing experimental variables targets
files are extremely useful and user-friendly. Thus, we encourage users to keep using
them.
#### Structure of _`targets`_ file for single-end (SE) samples
```{r targetsSE, eval=TRUE, message=FALSE}
library(systemPipeR)
targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR")
showDF(read.delim(targetspath, comment.char = "#"))
```
To work with custom data, users need to generate a _`targets`_ file containing
the paths to their own FASTQ files and then provide under _`targetspath`_ the
path to the corresponding _`targets`_ file.
#### Structure of _`targets`_ file for paired-end (PE) samples
For paired-end (PE) samples, the structure of the targets file is similar, where
users need to provide two FASTQ path columns: *`FileName1`* and *`FileName2`*
with the paths to the PE FASTQ files.
```{r targetsPE, eval=TRUE}
targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR")
showDF(read.delim(targetspath, comment.char = "#"))
```
#### Sample comparisons
Sample comparisons are defined in the header lines of the _`targets`_ file
starting with '``# ``'.
```{r comment_lines, echo=TRUE}
readLines(targetspath)[1:4]
```
The function _`readComp`_ imports the comparison information and stores it in a
_`list`_. Alternatively, _`readComp`_ can obtain the comparison information from
the corresponding _`SYSargsList`_ object (see below). Note, these header lines are
optional. They are mainly useful for controlling comparative analyses according
to certain biological expectations, such as identifying differentially expressed
genes in RNA-Seq experiments based on simple pair-wise comparisons.
```{r targetscomp, eval=TRUE}
readComp(file = targetspath, format = "vector", delim = "-")
```
## Project initialization
To create a workflow within _`systemPipeR`_, we can start by defining an empty
container and checking the directory structure:
```{r SPRproject1, eval=TRUE}
sal <- SPRproject()
```
Internally, `SPRproject` function will create a hidden directory called `.SPRproject`,
by default. This directory will store all log files generated during a workflow
run.
Initially, the object `sal` is a empty container containing only the basic project
information. The project information can be accessed with the `projectInfo` method.
```{r projectInfo, eval=TRUE}
sal
projectInfo(sal)
```
Also, the `length` function will return how many steps a workflow contains. Since
no steps have been added yet it returns zero.
```{r length, eval=TRUE}
length(sal)
```
## Building workflows
_`systemPipeR`_ workflows can be populated with a single command from an R
Markdown file or stepwise in interactive mode. This section introduces first
how to build a workflow stepwise in interactive mode, and then how to
build a workflow with a single command from an R Markdown file.
### Stepwise workflow construction
New workflows are constructed, or existing ones modified, by connecting each
step via the `appendStep` method. Each step in a `SYSargsList` instance contains the
instructions needed for processing a set of input files with a specific
command-line and the paths to the exptected outfiles. For constructing R code-based
workflow steps the constructor function `Linewise` is used.
The following demonstrates how to create a command-line workflow step here using as
example the short read aligner software HISAT2 [@Kim2015-ve].
The constructor function renders the proper command-line strings for each sample
and software tool, appending a new step in the `SYSargsList` object defined in the
previous step. For this, the `SYSargsList` constructor function uses in this example
data from three input files:
- CWL command-line specification file (`wf_file` argument)
- Input variables (`input_file` argument)
- Targets file (`targets` argument)
In CWL files with the extension *`.cwl`* define the parameters of a chosen
command-line step or an entire workflow, while files with the extension *`.yml`* define
the input variables of command-line steps. Note, input variables provided
by a *targets* or *targetsPE* file can be passed on via the *inputvars* argument to
the *`.yml`* file and from there to the _`SYSargsList`_ object.
```{r hisat2_mapping, eval=TRUE}
appendStep(sal) <- SYSargsList(
step_name = "hisat2_mapping",
dir = TRUE,
targets = "targetsPE.txt",
wf_file = "workflow-hisat2/workflow_hisat2-pe.cwl",
input_file = "workflow-hisat2/workflow_hisat2-pe.yml",
dir_path = "param/cwl",
inputvars = c(FileName1 = "_FASTQ_PATH1_",
FileName2 = "_FASTQ_PATH2_",
SampleName = "_SampleName_")
)
```
An overview of the workflow can be printed as follows.
```{r}
sal
```
In the above the workflow status for the new step is *Pending*. This means the
workflow object has been constructed but not executed yet.
Several accessor methods are available to explore the _`SYSargsList`_ object.
Of particular interest is the *`cmdlist()`* method. It constructs the system
commands for running command-line software as specified by a given *`.cwl`*
file combined with the paths to the input samples (*e.g.* FASTQ files), here provided
by a *`targets`* file. The example below shows the *`cmdlist()`* output for
running HISAT2 on the first PE read sample. Evaluating the output of
*`cmdlist()`* can be very helpful for designing and debugging *`.cwl`* files
of new command-line software or changing the parameter settings of existing
ones. The rendered command-line instance for each input sample can be returned as follows.
```{r}
cmdlist(sal, step = "hisat2_mapping", targets = 1)
```
The outfiles components of _`SYSargsList`_ define the expected output files for
each step in the workflow; some of which are the input for the next workflow step
(see Figure 3).
```{r output_WF, eval=TRUE}
outfiles(sal)
```
In an 'R-centric' rather than a 'CWL-centric' workflow design the connectivity
among workflow steps is established by creating the downstream targets instances
automatically (see Figure 3). Each step uses the outfiles from the previous step as
input, thereby establishing connectivity among each step in the workflow. By chaining
several _`SYSargsList`_ steps together one can construct complex workflows involving
many sample-level input/output file operations with any combination of command-line or R-based
software. Also, _`systemPipeR`_ provides features to automatically build these
connections. This provides additional security to make sure that all samples will be
processed in a reproducible manner.
Alternatively, a CWL-centric workflow design can be used that defines
all or most workflow steps with CWL workflow and parameter files. Due to time and
space restrictions, the CWL-centric approach is not covered by this tutorial.
The following R code step generates a `data.frame` containing important read alignment statistics
such as the total number of reads in the FASTQ files, the number of total alignments,
as well as the number of primary alignments in the corresponding BAM files.
The constructor function `LineWise` requires the `step_name` and the R code
assigned to the `code` argument. The R code should be enclosed by braces (`{}`).
```{r align_stats, eval=TRUE}
appendStep(sal) <- LineWise(
code = {
fqpaths <- getColumn(sal, step = "hisat2_mapping", "targetsWF", column = "FileName1")
bampaths <- getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam")
read_statsDF <- alignStats(args = bampaths, fqpaths = fqpaths, pairEnd = TRUE)
write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t")
},
step_name = "align_stats",
dependency = "hisat2_mapping")
```
To inspect the R code of this step, the `codeLine` method can be used.
```{r codeLine, eval=TRUE}
codeLine(sal, "align_stats")
```
The `getColumn` method allows to extract certain information of a workflow instance, here
output files that can be accessed from within R as demonstrated below.
```{r getColumn, eval=TRUE}
getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam")
```
```{r cleaning, eval=TRUE, include=FALSE}
if (file.exists(".SPRproject")) unlink(".SPRproject", recursive = TRUE)
## NOTE: Removing previous project create in the quick starts section
```
### Loading workflows from an R Markdown
The `importWF` function allows to load an entire workflow from an R Markdown (Rmd) file
into an `SYSargsList` object that has been intialized with `SPRproject()` as introduced
above. Next, one can run the workflow from start to finish with a single function call
using `runWF`. Below the latter has been commented out to first introduce additional
details prior to executing the workflow with `runWF`, which will be discussed in the
following section.
```{r importWF_rmd, eval=TRUE}
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd")
# sal <- runWF(sal)
```
Several accessor functions are available to inspect the workflow.
```{r importWF_details, eval=FALSE}
stepsWF(sal)
dependency(sal)
codeLine(sal)
targetsWF(sal)
```
#### Third-party software tools {#third-party-software-tools}
Examples of preconfigured CWL templates for third-party software
tools are provided in the following table. In addition, the tutorial entitled
[*Automate Creation of CWL
Instructions*](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/cmdtocwl/cmdtocwl/)
explains how to create these CWL templates for essentially any command-line
tool simply by providing the command-line string for a new software as input.
```{r table_tools, echo=FALSE, message=T}
library(magrittr)
SPR_software <- system.file("extdata", "SPR_software.csv", package="systemPipeR")
software <- read.delim(SPR_software, sep=",", comment.char = "#")
colors <- colorRampPalette((c("darkseagreen", "indianred1")))(length(unique(software$Category)))
id <- as.numeric(c((unique(software$Category))))
software %>%
dplyr::mutate(Step = kableExtra::cell_spec(Step, color = "white", bold = TRUE,
background = factor(Category, id, colors))) %>%
dplyr::select(Tool, Description, Step) %>%
dplyr::arrange(Tool) %>%
kableExtra::kable(escape = FALSE, align = "c", col.names = c("Tool Name", "Description", "Step")) %>%
kableExtra::kable_styling(c("striped", "hover", "condensed"), full_width = TRUE) %>%
kableExtra::scroll_box(width = "100%", height = "500px")
```
Importantly, command-line software needs to be installed on a user's system
and available in a user's `PATH`. To check whether this is the case, one can use the
`tryCL` function.
```{r test_tool_path, eval=FALSE}
tryCL(command="grep")
```
## How to run workflows?
### Setup and requirements
To execute the code in this tutorial, one needs the following software installed.
* R (version >=4.1.2)
* systemPipeR package (version >=2.0.8)
* Hisat2 (version >= 2.1.0)
To build custom workflows with any additional command-line software, one needs
the respective software installed and available in the user PATH. To test this,
one can use the `tryCL` function.
```{r test_software, eval=FALSE}
tryCL(command = "hisat2") ## "All set up, proceed!"
```
### Run from R
For running a workflow, the `runWF` function can be used. It executes all workflow steps
stored in a `SYSargsList` container (below named `sal`). This assumes the `sal` object
has been initialized and populated as outlined above using `SPRproject()` and
`importWF(sal, file_path="...")`, respectively.
```{r runWF, eval=FALSE}
sal <- runWF(sal)
```
This essential function allows the user to choose one or multiple steps to be
executed using its `steps` argument. However, it is necessary to maintain valid
dependencies (also see [workflow graph](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#visualize-workflow)).
If a selected step depends on a previous step(s), the output of which may not
be available yet, then the execution will fail.
```{r runWF_error, eval=FALSE}
sal <- runWF(sal, steps = c(1,3))
```
The `runWF` function allows to force the execution of chosen workflow steps, even if
the status of a given step is `'Success'` and the expected `outfiles` exist already.
This is useful for updating certain steps when needed. Another option is to
force `runWF` to ignore all warnings and errors. This can be achieved by assigning
`FALSE` to the arguments `warning.stop` and `error.stop`, respectively.
```{r runWF_force, eval=FALSE}
sal <- runWF(sal, force = TRUE, warning.stop = FALSE, error.stop = FALSE)
```
While `SYSargsList` (sal) objects are autosaved when working with workflows, it can
be sometimes safer to explicity save the object before closing R.
```{r saveWF, eval=FALSE}
sal <- write_SYSargsList(sal)
```
#### Resume and reset workflows
To restart workflows, set `resume=TRUE`.
```{r resumeWF, eval=FALSE}
sal <- SPRproject(resume=TRUE)
```
To delete all steps in a workflow including the content saved under the `.SPRproject` directory,
use `overwrite=TRUE`. This option should be used with caution, since it effectively deletes most
previous workflow data.
```{r overwriteWF, eval=FALSE}
sal <- SPRproject(overwrite=TRUE)
```
For additional options and details, users want to consult the help
documentation for: `?SPRproject`, `?runWF` and `?'SYSargsList-class'`.
#### Parallelization on clusters
The computation of time-consuming steps can be greatly accelerated by
processing many files in parallel using several compute nodes of a cluster,
where a scheduling system is used for load balancing.
The following `resources` list provides to a `SYSargsList` (`sal`) object the
number of independent parallel cluster processes defined under the `Njobs`
element in the list. The given example will run 18 processes in parallel using
for each 4 CPU cores, thus utilizing a total of 72 CPU cores. Note, `runWF` can
be used with most queueing systems as it is based on utilities defined by the
`batchtools` package, which supports the use of template files (_`*.tmpl`_) for
defining the run parameters of different schedulers. To run the following code,
one needs to have both a `conffile` (see _`.batchtools.conf.R`_ samples
[here](https://mllg.github.io/batchtools/)) and a `template` file (see
_`*.tmpl`_ samples
[here](https://github.com/mllg/batchtools/tree/master/inst/templates)) for the
queueing system available on a system. The following example uses the sample
`conffile` and `template` files for the Slurm scheduler provided by this
package.
The `resources` list can be appended when a workflow step is generated. Alternatively,
one can append these resource specifications to any step of a pre-generated `SYSarsList`
with the `addResources` function.
```{r runWF_cluster, eval=FALSE}
resources <- list(conffile=".batchtools.conf.R",
template="batchtools.slurm.tmpl",
Njobs=18,
walltime=120, # in min
ntasks=1,
ncpus=4,
memory=1024, # in Mb
partition = "short"
)
sal <- addResources(sal, c("hisat2_mapping"), resources = resources)
sal <- runWF(sal)
```
The above example will submit via `runWF(sal)` the *hisat2_mapping* step
to a partition called `short` on a computer cluster. Users need to adjust this and
other parameters defined in the `resources` list to their environment.
### Run from Command-Line (without cluster)
Create an R script containing the following (or similar) minimum content.
```sh
#!/usr/bin/env Rscript
library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd") # adjust name of Rmd file if different
sal <- runWF(sal) # runs entire workflow
sal <- renderReport(sal) # after workflow has completed render Rmd to HTML report
```
Assuming the script is named `wf_run_script.R` it can be executed from the command-line (not
R console!) as follows. In addition, one can make the script executable to run it like any other script.
```sh
Rscript wf_run_script.R
```
This will run `systemPipeR` workflows on a single machine. In this case a limited amount of
parallelization is possible if the corresponding code chunks in the workflow take advantage of
multi-core parallelization instructions provided by `BiocParallel`, `batchtools` or
related packages. However, this type of parallelization is usually limited to the
number of cores available on a single CPU. A much more scalable approach is the use
of computer clusters as described above and in the next section.
### Submit workflow from command-line to cluster
In addition to running workflows within interactive R sessions or submitting
them from the command-line on a single system (see above), one can execute
`systemPipeR` workflows from the command-line to an HPC cluster by including
the relevant workflow run instructions in an R script and then submitting it
via a submission script of a workload manager system to a computer cluster. The
following gives an example for the Slurm workload manager. To understand the
details, it is important to inspect the content of the two files (here .R and
.sh). Additional details about resource specification under Slurm are given
[below](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#parallelization-on-clusters).
- R script: [wf_run_script.R](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.R)
- Slurm submission script: [wf_run_script.sh](https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.sh)
As a test, users can generate in their user account of a cluster a workflow
environment populated with the toy data as outlined
[here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq/#workflow-environment).
After this, it is best to create within the workflow directory a subdirectory,
e.g. called `cl_sbatch_run`, and then save the above two files to this
subdirectory. Next, the parameters in both files need to be adjusted to match
the type of workflow and the required computing resources. This includes the
name of the `Rmd` file and scheduler resource settings such as: `partition`,
`Njobs`, `walltime`, `memory`, etc. After all relevant settings have been set
correctly, one can execute the workflow with `sbatch` from within the
`cl_sbatch_run` directory as follows (note this is a command-line call outside
of R):
```sh
sbatch wf_run_script.sh
```
After the submission to the cluster, one usually should check its status and
progress with `squeue -u ` (under Slurm) as well as by monitoring
the content of the `slurm-.out` file generated by the scheduler in the
same directory. This file contains most of the `STDOUT` and `STDERROR`
generated by a cluster job. Once everything is working on the toy data, users
can run the workflow on real data the same way.
### Visualize workflow
_`systemPipeR`_ workflows instances can be visualized with the `plotWF` function.
The resulting plot includes the following information.
- Workflow topology graph (dependency graphs between different steps)
- Workflow step status, *e.g.* `Success`, `Error`, `Pending`, `Warnings`
- Sample status and statistics
- Workflow timing: run duration time
If no argument is provided, the basic plot will automatically detect width,
height, layout, plot method, branches, _etc_.
```{r, eval=TRUE}
plotWF(sal)
```
### Technical reports
_`systemPipeR`_ compiles all workflow execution logs in one central location,
making it easy to check any standard output (`stdout`) or standard error
(`stderr`) for any command-line tool used in a workflow.
Also, the information is appended to the workflow plot making it easy to click on
respective steps.
```{r, eval=FALSE}
sal <- renderLogs(sal)
```
### Scientific reports
_`systemPipeR`_ auto-generates scientific analysis reports in HTML format. These reports
compile the results of all workflow steps including text, code, plots and tables.
```{r, eval=FALSE}
sal <- renderReport(sal)
```
Alternatively, scientific reports can be rendered with the `render` function from `rmarkdown`.
```{r, eval=FALSE}
rmarkdown::render("systemPipeRNAseq.Rmd", clean=TRUE, output_format="BiocStyle::html_document")
```
## How to modify workflows?
### Modifying steps
If needed one can modify existing workflow steps in a pre-populated
`SYSargsList` object, and potentially already executed workflows, with the
`replaceStep(sal) <-` replacement function. The following gives an example
where step number 3 in a `SYSargsList` (sal) object will be updated with
modified or new code. Note, this is a generalized example where the user needs
to insert the code lines between `{` and `}`, and also adjust the values assigned
to the arguments: `step_name` and `dependency`.
```{r modify_WF, eval=FALSE}
replaceStep(sal, step=3) <- LineWise(
code = {
...
},
step_name = "my_step_name",
dependency = "my_dependency")
```
Since `step_names` need to be unique, one should avoid using the same
`step_name` as before. If the previous name is used, a default name will be
assigned. Rerunning the assignment will then allow to assign the previous name. This
behavior is enforced for version tracking. Subsequently, one can view and check
the code changes with `codeLine()`, and then rerun the corresponding step (here
3) as follows:
```{r run_WF_steps, eval=FALSE}
codeLine(stepsWF(sal)$my_step_name)
runWF(sal, steps=3)
```
As mentioned above, any step in a workflow can only be run in isolation if its
expected input exists (see `dependency`).
### Adding steps
New steps can be added to the Rmd file of a workflow by inserting new R
Markdown code blocks (chunks) starting and ending with the usual `appendStep<-`
syntax and then creating a new `SYSargsList` instance with `importWF` that
contain the new step(s). To add steps to a pre-populated `SYSargsList` object,
one can use the `after` argument of the `appendStep<-` function. The following
example will add a new step after position 3 to the corresponding `sal` object.
This can be useful if a workflow with a long runtime has already been completed
and one only wants to make some refinements without re-running the entire
workflow.
```{r adding_WF_steps, eval=FALSE}
appendStep(sal, after=3) <- ...
```
## Workflow templates
Workflow templates are provided via the affiliated Bioconductor data package
named `systemPipeRdata`, as well as by a dedicated GitHub repository. Instances
of these workflows can be created with a single command. The following gives
several examples.
### RNA-Seq WF template
[ [Tutorial 1](https://bit.ly/3fMB5hm) ] [ [Tutorial 2](https://systempipe.org/SPrnaseq/index.html) ]
Load the RNA-Seq sample workflow into your current working directory.
```{r genRna_workflow_single, eval=FALSE}
library(systemPipeRdata)
genWorkenvir(workflow="rnaseq")
setwd("rnaseq")
```
#### Create the workflow
This template includes the most common analysis steps of `RNAseq` workflows. One can add, remove, modify
workflow steps by modifying the `sal` object.
```{r project_rnaseq, eval=FALSE}
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)
```
**Workflow steps:**
1. Read preprocessing
+ Quality filtering (trimming)
+ FASTQ quality report
2. Alignments: _`HISAT2`_ (or any other RNA-Seq aligner)
3. Alignment stats
4. Read counting
5. Sample-wise correlation analysis
6. Analysis of differentially expressed genes (DEGs)
7. GO term enrichment analysis
8. Gene-wise clustering
#### Run workflow
```{r run_rnaseq, eval=FALSE}
sal <- runWF(sal)
```
Workflow visualization
```{r plot_rnaseq, eval=FALSE}
plotWF(sal)
```
Report generation
```{r report_rnaseq, eval=FALSE}
sal <- renderReport(sal)
sal <- renderLogs(sal)
```
### ChIP-Seq WF template
[ [Tutorial 1](https://bit.ly/3oQ14Mf) ] [ [Tutorial 2](https://systempipe.org/SPchipseq/index.html) ]
Load the ChIP-Seq sample workflow into your current working directory.
```{r genChip_workflow_single, eval=FALSE}
library(systemPipeRdata)
genWorkenvir(workflow="chipseq")
setwd("chipseq")
```
**Workflow steps:**
1. Read preprocessing
+ Quality filtering (trimming)
+ FASTQ quality report
2. Alignments: _`Bowtie2`_ or _`rsubread`_
3. Alignment stats
4. Peak calling: _`MACS2`_
5. Peak annotation with genomic context
6. Differential binding analysis
7. GO term enrichment analysis
8. Motif analysis
#### Create the workflow
This template provides some common steps for a `ChIPseq` workflow. One can add, remove, modify
workflow steps by operating on the `sal` object.
```{r project_chipseq, eval=FALSE}
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeChIPseq.Rmd", verbose = FALSE)
```
#### Run workflow
```{r run_chipseq, eval=FALSE}
sal <- runWF(sal)
```
Workflow visualization
```{r plot_chipseq, eval=FALSE}
plotWF(sal)
```
Report generation
```{r report_chipseq, eval=FALSE}
sal <- renderReport(sal)
sal <- renderLogs(sal)
```
### VAR-Seq WF template
[ [Tutorial](https://systempipe.org/SPvarseq/index.html) ]
Load the VAR-Seq sample workflow into your current working directory.
```{r genVar_workflow_single, eval=FALSE}
library(systemPipeRdata)
genWorkenvir(workflow="varseq")
setwd("varseq")
```
**Workflow steps:**
1. Read preprocessing
+ Quality filtering (trimming)
+ FASTQ quality report
2. Alignments: _`gsnap`_, _`bwa`_
3. Variant calling: _`VariantTools`_, _`GATK`_, _`BCFtools`_
4. Variant filtering: _`VariantTools`_ and _`VariantAnnotation`_
5. Variant annotation: _`VariantAnnotation`_
6. Combine results from many samples
7. Summary statistics of samples
#### Create the workflow
This template provides some common steps for a `VARseq` workflow. One can add, remove, modify
workflow steps by operating on the `sal` object.
```{r project_varseq, eval=FALSE}
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeVARseq.Rmd", verbose = FALSE)
```
#### Run workflow
```{r run_varseq, eval=FALSE}
sal <- runWF(sal)
```
Workflow visualization
```{r plot_varseq, eval=FALSE}
plotWF(sal)
```
Report generation
```{r report_varseq, eval=FALSE}
sal <- renderReport(sal)
sal <- renderLogs(sal)
```
### Ribo-Seq WF template
[ [Tutorial](https://systempipe.org/SPriboseq/index.html) ]
Load the Ribo-Seq sample workflow into your current working directory.
```{r genRibo_workflow_single, eval=FALSE}
library(systemPipeRdata)
genWorkenvir(workflow="riboseq")
setwd("riboseq")
```
**Workflow steps:**
1. Read preprocessing
+ Adaptor trimming and quality filtering
+ FASTQ quality report
2. Alignments: _`HISAT2`_ (or any other RNA-Seq aligner)
3. Alignment stats
4. Compute read distribution across genomic features
5. Adding custom features to workflow (e.g. uORFs)
6. Genomic read coverage along transcripts
7. Read counting
8. Sample-wise correlation analysis
9. Analysis of differentially expressed genes (DEGs)
10. GO term enrichment analysis
11. Gene-wise clustering
12. Differential ribosome binding (translational efficiency)
#### Create the workflow
This template provides some common steps for a `RIBOseq` workflow. One can add, remove, modify
workflow steps by operating on the `sal` object.
```{r project_riboseq, eval=FALSE}
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRIBOseq.Rmd", verbose = FALSE)
```
#### Run workflow
```{r run_riboseq, eval=FALSE}
sal <- runWF(sal)
```
Workflow visualization
```{r plot_riboseq, eval=FALSE}
plotWF(sal, rstudio = TRUE)
```
Report generation
```{r report_riboseq, eval=FALSE}
sal <- renderReport(sal)
sal <- renderLogs(sal)
```
```{r quiting, eval=TRUE, echo=FALSE}
knitr::opts_knit$set(root.dir = '../')
unlink("rnaseq", recursive = TRUE)
```
## Funding
This project is funded by NSF award [ABI-1661152](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1661152).
## Version information
**Note:** the most recent version of this tutorial can be found here.
```{r session_info, eval=TRUE}
sessionInfo()
```
## References