--- title: "systemPipeR: Workflow Environment for Data Analysis and Report Generation" author: "Daniela Cassol, Le Zhang and Thomas Girke" date: last-modified sidebar: tutorials bibliography: bibtex.bib ---
```{r setup_dir} #| echo: false #| include: false #| message: false #| warning: false unlink("rnaseq", recursive = TRUE) systemPipeRdata::genWorkenvir(workflow = "rnaseq") knitr::opts_knit$set(root.dir = 'rnaseq') ``` ```{r setup} #| echo: false #| message: false #| warning: false suppressPackageStartupMessages({ library(systemPipeR) library(BiocParallel) library(Biostrings) library(Rsamtools) library(GenomicRanges) library(ggplot2) library(GenomicAlignments) library(ShortRead) library(ape) library(batchtools) library(magrittr) }) ``` ## Introduction [_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) is a Workflow Management System (WMS) for data analysis that integrates R with command-line (CL) software [@H_Backman2016-bt]. This platform allows scientists to analyze diverse data types on personal or distributed computer systems. It ensures a high level of reproducibility, scalability, and portability (Figure \@ref(fig:utilities)). Central to `systemPipeR` is a CL interface (CLI) that adopts the Common Workflow Language [CWL, @Crusoe2021-iq]. Using this CLI, users can select the optimal R or CL software for each analysis step. The platform supports end-to-end and partial execution of workflows, with built-in restart capabilities. A workflow control container class manages analysis tasks of varying complexity. Standardized processing routines for metadata facilitate the handling of large numbers of input samples and complex experimental designs. As a multipurpose workflow management toolkit, `systemPipeR` enables users to run existing workflows, customize them, or create entirely new ones while leveraging widely adopted data structures within the Bioconductor ecosystem. Another key aspect of `systemPipeR` is its ability to generate reproducible scientific analysis and technical reports. For result interpretation, it offers a range of graphics functionalities. Additionally, an associated Shiny App provides various interactive features for result exploration, and enhancing the user experience. {width=70% fig-align="center"} ### Workflow control class A central component of `systemPipeR` is `SYSargsList` (or short `SAL`), a container for workflow management. This S4 class stores all relevant information for running and monitoring each analysis step in workflows. It captures the connectivity between workflow steps, the paths to their input and output data, and pertinent parameter values used in each step (see Figure \@ref(fig:sysargslistImage)). Typically, `SAL` instances are constructed from an intial metadata targets table, R code and CWL parameter files for each R- and CL-based analysis step in workflows (details provided below). For preconfigured workflows, users only need to provide their input data (such as FASTQ files) and the corresponding metadata in a targets file. The latter describes the experimental design, defines sample labels, replicate information, and other relevant information. {width=100% fig-align="center"} Figure \@ref(fig:sysargslistImage) illustrates the design of the `systemPipeR` (SPR) WMS. (A) The root directory of a SPR Project includes files and directories that contain the input data, metadata and parameters required for running a workflow. This project environment can be autogenerated with the functions given under (E). (B) The workflow instructions are loaded from the project environment into the Workflow Management Class `SAL`. (C) Subsequently, the workflow can be executed and monitored. (D) After completion or during a run various reports can be generated, including scientific and technical reports, as well as interactive workflow graphs illustrating the workflow topology as well as run and completion statistics. (E) The corresponding commands (1-4) for the initialization, execution and report generation of workflows are listed, which can be run with a single execution command. Workflow steps and reporting instructions are specified in the Rmd file (A), which is the source file for generating the scientific report (D). Input data required for a workflow run are stored in the data directory, and output files generated by a workflow run are written to the results directory (A). The input/output and dependencies between steps are automatically generated and managed by `SAL`. Status information is auto-saved to the `SPRproject` directory, allowing for workflow tracking and restarts. ### CL interface (CLI) {#cl-interface} _`systemPipeR`_ adopts the [Common Workflow Language (CWL)](https://www.commonwl.org/index.html), which is a widely used community standard for describing CL tools and workflows in a declarative, generic, and reproducible manner [@Amstutz2016-ka]. CWL specifications are human-readable [YAML](https://www.commonwl.org/user_guide/topics/yaml-guide.html) files that are straightforward to create and to modify. Integrating CWL in `systemPipeR` enhances the sharability, standardization, extensibility and portability of data analysis workflows. Following the CWL Specifications, the basic description for executing a CL software are defined by two files: a cwl step definition file and a yml configuration file. Figure \@ref(fig:sprandCWL) illustrates the utilitity of the two files using “Hello World” as an example. The cwl file (A) defines the parameters of CL tool or workflow (C), and the yml file (B) assigns the input variables to the corresponding parameters. For convenience, in `systemPipeR` parameter values can be provided by a targets file (D, see above), and automatically passed on to the corresponding parameters in the yml file. The usage of a targets file greatly simplifies the operation of the system for users, because a tabular metadata file is intuitive to maintain, and it eliminates the need of modifying the more complex cwl and yml files directly. The structure of `targets` files is explained in the corresponding section [below](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#targets-files). A detailed overview of the CWL syntax is provided in the [CWL syntax](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cwl) section below, and the details for connecting the input information in `targets` with CWL parameters are described [here](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cwl_targets). {width=60% fig-align="center"} ### Workflow templates `systemPipeRdata`, a companion package to `systemPipeR`, offers a collection of workflow templates that are ready to use. With a single command, users can easily load these templates onto their systems. Once loaded, users have the flexibility to utilize the templates as they are or modify them as needed. More in-depth information can be found in the main vignette of systemPipeRdata, which can be accessed [here](https://www.bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/systemPipeRdata.html). ### Other functionalities The package also provides several convenience functions that are useful for designing and testing workflows, such as a [CL rendering function](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cmd-step) that assembles from the parameter files (cwl, yml and targets) the exact CL strings for each step prior to running a CL tool. [Auto-generation of CWL](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cwl-auto-generation) parameter files is also supported. Here, users can simply provide the CL strings for a CL software of interest to a rendering function that generates the corresponding `*.cwl` and `*.yml` files for them. Auto-conversion of workflows to executable [Bash scripts](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#bash-script) is also supported. ## Quick start ### Installation The [_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) package can be installed from the R console using the [_`BiocManager::install`_](https://cran.r-project.org/web/packages/BiocManager/index.html) command. The associated [_`systemPipeRdata`_](http://www.bioconductor.org/packages/devel/data/experiment/html/systemPipeRdata.html) package can be installed the same way. The latter is a data package for generating _`systemPipeR`_ workflow test instances with a single command. These instances contain all parameter files and sample data required to quickly test and run workflows. ```{r install} #| eval: false if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("systemPipeR") BiocManager::install("systemPipeRdata") ``` For a workflow to run successfully, all CL tools used by a workflow need to be installed and executable on a user's system, where the analysis will be performed (details provided [below](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#third-party-software-tools)). ### Five minute tutorial {#five-min} The following demonstrates how to initialize, run and monitor workflows, and subsequently create analysis reports. __1. Create workflow environment.__ The chosen example uses the `genWorenvir` function from the `systemPipeRdata` package to create an RNA-Seq workflow environment that is fully populated with a small test data set, including FASTQ files, reference genome and annotation data. After this, the user's R session needs to be directed into the resulting `rnaseq` directory (here with `setwd`). A list of available workflow templates is available in the vignette of the `systemPipeRdata` package [here](https://www.bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/systemPipeRdata.html#wf-bioc-collection). ```{r eval=FALSE} systemPipeRdata::genWorkenvir(workflow = "rnaseq") setwd("rnaseq") ``` __2. Initialize project and import workflow from `Rmd` template.__ New workflow instances are created with the `SPRproject` function. When calling this function, a project directory with the default name `.SPRproject` is created within the workflow directory. Progress information and log files of a workflow run will be stored in this directory. After this, workflow steps can be loaded into `sal` one-by-one, or all at once with the `importWF` function. The latter reads all steps from a workflow Rmd file (here `systemPipeRNAseq.Rmd`) defining the analysis steps. ```{r eval=FALSE} library(systemPipeR) # Initialize workflow project sal <- SPRproject() ## Creating directory '/home/myuser/systemPipeR/rnaseq/.SPRproject' ## Creating file '/home/myuser/systemPipeR/rnaseq/.SPRproject/SYSargsList.yml' sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd") # import into sal the WF steps defined by chosen Rmd file ## The following print statements, issued during the import, are shortened for brevity ## Import messages for first 3 of 20 steps total ## Parse chunk code ## Now importing step 'load_SPR' ## Now importing step 'preprocessing' ## Now importing step 'trimming' ## Now importing step '...' ## ... ## Now check if required CL tools are installed ## Messages for 4 of 7 CL tools total ## step_name tool in_path ## 1 trimming trimmomatic TRUE ## 2 hisat2_index hisat2-build TRUE ## 3 hisat2_mapping hisat2 TRUE ## 4 hisat2_mapping samtools TRUE ## ... ``` The `importWF` function also checks the availability of the R packages and CL software tools used by a workflow. All dependency CL software needs to be installed and exported to a user's `PATH`. In the given example, the CL tools `trimmomatic`, `hisat2-build`, `hisat2`, and `samtools` are listed. If the `in_path` column shows `FALSE` for any of them, then the missing CL software needs to be installed and made available in a user's `PATH` prior to running the workflow. Note, the shown availability table of CL tools can also be returned with `listCmdTools(sal, check_path=TRUE)`, and the availability of individual CL tools can be checked with `tryCL`, _e.g._ for `hisat2` use: `tryCL(command = "hisat2")`. __3. Status summary.__ An overview of the workflow steps and their status information can be returned by typing `sal`. For space reasons, the following shows only the first 3 of a total of 20 steps of the RNA-Seq workflow. At this stage all workflow steps are listed as pending since none of them have been executed yet. ```{r eval=FALSE} sal ## Instance of 'SYSargsList': ## WF Steps: ## 1. load_SPR --> Status: Pending ## 2. preprocessing --> Status: Pending ## Total Files: 36 | Existing: 0 | Missing: 36 ## 2.1. preprocessReads-pe ## cmdlist: 18 | Pending: 18 ## 3. trimming --> Status: Pending ## Total Files: 72 | Existing: 0 | Missing: 72 ## 4. - 20. not shown here for brevity ``` __4. Run workflow.__ Next, one can execute the entire workflow from start to finish. The `steps` argument of `runWF` can be used to run only selected steps. For details, consult the help file with `?runWF`. During the run, detailed status information will be provided for each workflow step. ```{r eval=FALSE} sal <- runWF(sal) ``` After completing all or only some steps, the status of workflow steps can always be checked with the summary print function. If a workflow step was completed, its status will change from `Pending` to `Success` or `Failed`. ```{r eval=FALSE} sal ``` {width=80% fig-align="center"} __5. Workflow topology graph.__ Workflows can be displayed as topology graphs using the `plotWF` function. The run status information for each step and various other details are embedded in these graphs. Additional details are provided in the [visualize workflow section](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#visualize-workflow) below. ```{r eval=FALSE} plotWF(sal) ``` {width=60% fig-align="center"} __6. Report generation.__ The `renderReport` and `renderLogs` function can be used for generating scientific and technical reports, respectively. Alternatively, scientific reports can be generated with the `render` function of the `rmarkdown` package. The latter option with `rmarkdown::render` is often more flexible and preferred for most users, since it provides the advantage that any modifications to the Rmd file are instantly reflected in the HTML report, eliminating the necessity to update the `sal` object. ```{r eval=FALSE} # Scietific report sal <- renderReport(sal) rmarkdown::render("systemPipeRNAseq.Rmd", clean = TRUE, output_format = "BiocStyle::html_document") # Technical (log) report sal <- renderLogs(sal) ``` ## Directory structure {#wf-directories} The root directory of `systemPipeR` workflows contains by default three user facing sub-directories: `data`, `results` and `param`. A fourth sub-directory is a hidden log directory with the default name `.SPRproject` that is created when initializing a workflow run with the `SPRproject` function (see above). Users can change the recommended directory structure, but will need to adjust in some cases the code in their workflows. Just adding directories to the default structure is possible without requiring changes to the workflows. The following directory tree summarizes the expected content in each default directory (names given in ***green***). * _**workflow/**_ + This is the root directory of a workflow. It can have any name and includes the following files: + Workflow *Rmd* and metadata targets file(s) + Optionally, configuration files for computer clusters, such as `.batchtools.conf.R` and `tmpl` files for `batchtools` and `BiocParallel`. + Additional files can be added as needed. + Default sub-directories: + _**param/**_ + CWL parameter files are organized by CL tools (under _**cwl/**_), each with its own sub-directory that contains the corresponding `cwl` and `yml` files. Previous versions of parameter files are stored in a separate sub-directory. + _**data/**_ + Raw input and/or assay data (*e.g.* FASTQ files) + Reference data, including genome sequences, annotation files, databases, etc. + Any number of sub-directories can be added to organize the data under this directory. + Other input data + _**results/**_ + Analysis results are written to this directory. Examples include tables, plots, or NGS results such as alignment (BAM), variant (VCF), peak (BED) files. + Any number of sub-directories can be created to organize the analysis results under this directory. + _**.SPRproject/**_ + Hidden log directory created by `SPRproject` function at the beginning of a workflow run. It is a hidden directory because its name starts with a dot. + Run status information and log files of a workflow run are stored here. The content in this directory is auto-generated and not expected to be modified by users. ## The _`targets`_ file {#targets-files} A `targets` file defines the input files (_e.g._ FASTQ, BAM, BCF) and sample comparisons used in a data analysis workflow. It can also store any number of additional descriptive information for each sample. How the input information is passed on from a `targets` file to the CWL parameter files is introduced [above](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cl-interface), and additional details are given [below](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#cwl). The following shows the format of two _`targets`_ file examples included in the package. They can also be viewed and downloaded from _`systemPipeR`'s_ GitHub repository [here](https://github.com/tgirke/systemPipeR/blob/master/inst/extdata/targets.txt). As an alternative to using targets files, `YAML` files can be used instead. Since organizing experimental variables in tabular files is straightforward, the following sections of this vignette focus on the usage of targets files. Their usage also integrates well with the widely used `SummarizedExperiment` object class. Descendant targets files can be extracted for each step with input/output operations where the output of the previous step(s) serves as input to the current step, and the output of the current step becomes the input of the next step. This connectivity among input/output operations is automatically tracked throughout workflows. This way it is straightforward to start workflows at different processing stages. For instance, one can intialize an RNA-Seq workflow at the stage of raw sequence files (FASTQ), alignment files (BAM) or a precomputed read count table. #### Single-end (SE) data In a `targets` file with a single type of input files, here FASTQ files of single-end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files of PE reads. All subsequent columns are optional and any number of additional columns can be added as needed. The columns in `targets` files are expected to be tab separated (TSV format). The `SampleName` column contains usually short labels for referencing samples (here FASTQ files) across many workflow steps (_e.g._ plots and column titles). Importantly, the labels used in the `SampleName` column need to be unique, while technical or biological replicates are indicated by the same values under the `Factor` column. For readability and transparency, it is useful to use here a short, consistent and informative syntax for naming samples and replicates. This is important since the values provided under the `SampleName` and `Factor` columns are intended to be used as labels for naming the columns or plotting features in downstream analysis steps. ```{r targetsSE} targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR") DT::datatable(read.delim(targetspath, comment.char = "#")) ``` To work with custom data, users need to generate a `targets` file containing the paths to their own FASTQ files and then provide under `targetspath` the path to the corresponding `targets` file. #### Paired-end (PE) data For paired-end (PE) samples, the structure of the targets file is similar. The main difference is that `targets` files for PE data have two FASTQ path columns (here `FileName1` and `FileName2`) each containing the paths to the corresponding PE FASTQ files. ```{r targetsPE} targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR") showDF(read.delim(targetspath, comment.char = "#")) ``` #### Sample comparisons If needed, sample comparisons of comparative experiments, such as differentially expressed genes (DEGs), can be specified in the header lines of a `targets` file that start with a `#