--- title: "RNA-Seq Workflow Template" author: "First Last" date: last-modified sidebar: tutorials bibliography: bibtex.bib ---
```{r setup_dir} #| echo: false #| include: false #| message: false #| warning: false unlink(".SPRproject/", recursive = TRUE) ``` ```{r setup_libraries} #| echo: false #| message: false #| warning: false suppressPackageStartupMessages({ library(systemPipeR) }) ``` ## Introduction ### Overview This workflow template is for analyzing RNA-Seq data. It is provided by [systemPipeRdata](https://bioconductor.org/packages/devel/data/experiment/html/systemPipeRdata.html), a companion package to [systemPipeR](https://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) [@H_Backman2016-bt]. Similar to other `systemPipeR` workflow templates, a single command generates the necessary working environment. This includes the expected directory structure for executing `systemPipeR` workflows and parameter files for running command-line (CL) software utilized in specific analysis steps. For learning and testing purposes, a small sample (toy) data set is also included (mainly FASTQ and reference genome files). This enables users to seamlessly run the numerous analysis steps of this workflow from start to finish without the requirement of providing custom data. After testing the workflow, users have the flexibility to employ the template as is with their own data or modify it to suit their specific needs. For more comprehensive information on designing and executing workflows, users want to refer to the main vignettes of [systemPipeR](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html) and [systemPipeRdata](https://www.bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/systemPipeRdata.html). The `Rmd` file (`systemPipeRNAseq.Rmd`) associated with this vignette serves a dual purpose. It acts both as a template for executing the workflow and as a template for generating a reproducible scientific analysis report. Thus, users want to customize the text (and/or code) of this vignette to describe their experimental design and analysis results. This typically involves deleting the instructions how to work with this workflow, and customizing the text describing experimental designs, other metadata and analysis results. ### Experimental design Typically, the user wants to describe here the sources and versions of the reference genome sequence along with the corresponding annotations. The standard directory structure of `systemPipeR` (see [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#3_Directory_structure)), expects the input data in a subdirectory named `data` and all results will be written to a separate `results` directory. The Rmd source file for executing the workflow and rendering its report (here `systemPipeRNAseq.Rmd`) is expected to be located in the parent directory. The test (toy) data set used by this template ([SRP010938](http://www.ncbi.nlm.nih.gov/sra/?term=SRP010938)) contains 18 paired-end (PE) read sets from *Arabidposis thaliana* [@Howard2013-fq]. To minimize processing time during testing, each FASTQ file has been reduced to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the *A. thaliana* genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been reduced to the same genome regions. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen here for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads. To use their own RNA-Seq and reference genome data, users want to move or link the data to the designated `data` directory and execute the workflow from the parent directory using their customized `Rmd` file. Beginning with this template, users should delete the provided test data and move or link their custom data to the designated locations. Alternatively, users can create an environment skeleton (named `new` [here](https://www.bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/new.html)) or build one from scratch. To perform an RNA-Seq analysis with new FASTQ files from the same reference genome, users only need to provide the FASTQ files and an experimental design file called 'targets' file that outlines the experimental design. The structure and utility of targets files is described in `systemPipeR's` main vignette [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#4_The_targets_file). ### Workflow steps The default analysis steps included in this RNA-Seq workflow template are listed below. Users can modify the existing steps, add new ones or remove steps as needed. __Default analysis steps__ 1. Read preprocessing + Quality filtering (trimming) + FASTQ quality report 2. Alignments: _`HISAT2`_ (or any other RNA-Seq aligner) 3. Alignment stats 4. Read counting 5. Sample-wise correlation analysis 6. Analysis of differentially expressed genes (DEGs) 7. GO term enrichment analysis 8. Gene-wise clustering ### Setup of workflow environment NOTE: this section describes how to set up the proper environment (directory structure) for running `systemPipeR` workflows in the GEN242 class. This routine applies to all workflows. After mastering this task the workflow run instructions can be deleted since they are not expected to be included in a final HTML/PDF report of a workflow. 1. If a remote system or cluster is used, then users need to log in to the remote system first. The following applies to an HPC cluster (_e.g._ HPCC cluster). A terminal application or RStudion Server via onDemand needs to be used to log in to a user's cluster account. Next, one can open an interactive session on a computer node with `srun`. More details about argument settings for `srun` are available in this [HPCC manual](http://hpcc.ucr.edu/manuals_linux-cluster_jobs.html#partitions) or the HPCC section of this website [here](../linux/index.qmd#job-submission-with-sbatch). Next, load the R version required for running the workflow with `module load`. Sometimes it may be necessary to first unload an active software version before loading another version, _e.g._ `module unload R`. From command-line ```{bash shell_01} #| eval: false srun --x11 --partition=gen242 --account=gen242 --mem=20gb --cpus-per-task 8 --ntasks 1 --time 20:00:00 --pty bash -l module unload R; module load R/4.4.2 ``` 2. Load a workflow template with the `genWorkenvir` function. This can be done from the command-line or from within R. However, only one of the two options needs to be used. The environment for this RNA-Seq workflow is auto-generated below with the `genWorkenvir` function (selected under `workflow="rnaseq"`). It is fully populated with a small test data set, including FASTQ files, reference genome and annotation data. The name of the resulting workflow directory can be specified under the `mydirname` argument. The default `NULL` uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. After this, the user’s R session needs to be directed into the resulting `rnaseq` directory (here with `setwd`). From command-line ```{bash shell_02} #| eval: false Rscript -e "systemPipeRdata::genWorkenvir(workflow='rnaseq')" cd rnaseq ``` From R ```{r gen_workflow_envir} #| eval: false library(systemPipeRdata) genWorkenvir(workflow = "rnaseq", mydirname = "rnaseq") setwd("rnaseq") ``` 3. If the user wishes to use another `Rmd` file than the template instance provided by the `genWorkenvir` function, then it can be copied or downloaded into the root directory of the workflow environment (_e.g._ with `cp`, `download.file` or `wget`). For the first introduction to RNA-Seq analysis in GEN242, we will use the `Rmd` file obtained by `genWorkenvir`. Note, the `Rmd` source file of this tutorial page is linked on the top. ```{bash shell_03} #| eval: false # From command-line wget <*.Rmd> -O <*.Rmd> # From R download.file("<*.Rmd>", "*.Rmd>") ``` 4. Now one can open from the root directory of the workflow the corresponding R Markdown script (_e.g._ systemPipeChIPseq.Rmd) using an R IDE, such as _nvim-r_, _ESS_ or RStudio. Subsequently, the workflow can be run as outlined below. #### Import custom functions Custom functions for the challenge projects can be imported with the source command from a local R script (here [challengeProject_Fct.R](https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/tutorials/spchipseq/challengeProject_Fct.R)). Skip this step if such a script is not available. Alternatively, these functions can be loaded from a custom R package. ```{r load_custom_fct} #| eval: false #| message: false source("challengeProject_Fct.R") ``` #### Input data: `targets` file The `targets` file defines the input files (e.g. FASTQ or BAM) and sample comparisons used in a data analysis workflow. It can also store any number of additional descriptive information for each sample. The following shows the first four lines of the `targets` file used in this workflow template. ```{r load_targets_file} targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR") targets <- read.delim(targetspath, comment.char = "#") targets[1:4,-c(5,6)] ``` To work with custom data, users need to generate a _`targets`_ file containing the paths to their own FASTQ files. [Here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#4_The_targets_file) is a detailed description of the structure and utility of `targets` files. ## Quick start {#quick-start} After a workflow environment has been created with the above `genWorkenvir` function call and the corresponding R session directed into the resulting directory (here `rnaseq`), the `SPRproject` function is used to initialize a new workflow project instance. The latter creates an empty `SAL` workflow container (below `sal`) and at the same time a linked project log directory (default name `.SPRproject`) that acts as a flat-file database of a workflow. Additional details about this process and the SAL workflow control class are provided in `systemPipeR's` main vignette [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#11_Workflow_control_class) and [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#5_Detailed_tutorial). Next, the `importWF` function imports all the workflow steps outlined in the source Rmd file of this vignette (here `systemPipeRNAseq.Rmd`) into the `SAL` workflow container. An overview of the workflow steps and their status information can be returned at any stage of the loading or run process by typing `sal`. ```{r project_rnaseq} #| eval: false library(systemPipeR) sal <- SPRproject() # Intializes new workflow. # sal <- SPRproject(resume=TRUE, load.envir=TRUE) # Restarts workflow. # sal <- SPRproject(overwrite=TRUE) # Resets workflow. sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE) sal ``` After loading the workflow into `sal`, it can be executed from start to finish (or partially) with the `runWF` command. Running the workflow will only be possible if all dependent CL software is installed on a user's system. Their names and availability on a system can be listed with `listCmdTools(sal, check_path=TRUE)`. For more information about the `runWF` command, refer to the help file and the corresponding section in the main vignette [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#61_Overview). Running workflows in parallel mode on computer clusters is a straightforward process in `systemPipeR`. Users can simply append the resource parameters (such as the number of CPUs) for a cluster run to the `sal` object after importing the workflow steps with `importWF` using the `addResources` function. More information about parallelization can be found in the corresponding section at the end of this vignette [here](#paralellization) and in the main vignette [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#63_Parallel_evaluation). ```{r run_rnaseq} #| eval: false sal <- runWF(sal) ``` Workflows can be visualized as topology graphs using the `plotWF` function. ```{r plot_rnaseq} #| eval: false plotWF(sal) ``` {width=75% fig-align="center"} Scientific and technical reports can be generated with the `renderReport` and `renderLogs` functions, respectively. Scientific reports can also be generated with the `render` function of the `rmarkdown` package. The latter option with `rmarkdown::render` is often more flexible and preferred for most users, since it provides the advantage that any modifications to the Rmd file are instantly reflected in the HTML report, eliminating the necessity to update the sal object. The technical reports are based on log information that `systemPipeR` collects during workflow runs. ```{r report_rnaseq} #| eval: false # Scientific report sal <- renderReport(sal) rmarkdown::render("systemPipeRNAseq.Rmd", clean = TRUE, output_format = "BiocStyle::html_document") # Technical (log) report sal <- renderLogs(sal) ``` The `statusWF` function returns a status summary for each step in a `SAL` workflow instance. ```{r status_rnaseq} #| eval: false statusWF(sal) ``` ## Workflow steps The data analysis steps of this workflow are defined by the following workflow code chunks. They can be loaded into `SAL` interactively, by executing the code of each step in the R console, or all at once with the `importWF` function used under the Quick start section. R and CL workflow steps are declared in the code chunks of `Rmd` files with the `LineWise` and `SYSargsList` functions, respectively, and then added to the `SAL` workflow container with `appendStep<-`. Their syntax and usage is described [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#52_Constructing_workflows). ### Load packages The first step loads the `systemPipeR` package. ```{r load_SPR} #| message: false #| eval: false #| spr: true cat(crayon::blue$bold("To use this workflow, the following R packages are required:\n")) cat(c("'GenomicFeatures", "BiocParallel", "DESeq2", "ape", "edgeR", "biomaRt", "pheatmap","ggplot2'\n"), sep = "', '") ###pre-end appendStep(sal) <- LineWise(code = { library(systemPipeR) }, step_name = "load_SPR") ``` ### Read preprocessing #### With `preprocessReads` The `preprocessReads` function allows applying predefined or custom read preprocessing functions to all FASTQ files referenced in a SAL container, such as quality filtering or adapter trimming routines. Internally, `preprocessReads` uses the `FastqStreamer` function from the `ShortRead` package to stream through large FASTQ files in a memory-efficient manner. The following example uses `preprocessReads` to perform adapter trimming with the `trimLRPatterns` function from the `Biostrings` package. In this instance, `preprocessReads` is invoked through a CL interface built on `docopt`, that is executed from R with CWL. The parameters for running `preprocessReads` are specified in the corresponding `cwl/yml` files. It is important to point out that creating and using CL interfaces for defining R-based workflow steps is not essential in `systemPipeR` since [`LineWise`](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#5211_Step_1:_R_step) offers similar capabilities while requiring less specialized knowledge from users. ```{r preprocessing} #| message: false #| eval: false #| spr: true appendStep(sal) <- SYSargsList( step_name = "preprocessing", targets = "targetsPE.txt", dir = TRUE, wf_file = "preprocessReads/preprocessReads-pe.cwl", input_file = "preprocessReads/preprocessReads-pe.yml", dir_path = system.file("extdata/cwl", package = "systemPipeR"), inputvars = c( FileName1 = "_FASTQ_PATH1_", FileName2 = "_FASTQ_PATH2_", SampleName = "_SampleName_" ), dependency = c("load_SPR")) ``` The paths to the output files generated by the preprocessing step (here trimmed FASTQ files) are recorded in a new `targets` file that can be used for the next workflow step, _e.g._ running the NGS alignments with the trimmed FASTQ files. The following example demonstrates how to design a custom `preprocessReads` function, as well as how to replace parameters in the `sal` object. To apply the modifications to the workflow, it needs to be saved to a file, here `param/customFCT.RData` which will be loaded during the workflow run by the `preprocessReads.doc.R` script. Please note, this step is included here solely for demonstration purposes, and thus not part of the workflow run. This is achieved by dropping `spr=TRUE` in the header line of the code chunk. ```{r custom_preprocessing_function} #| eval: false appendStep(sal) <- LineWise( code = { filterFct <- function(fq, cutoff = 20, Nexceptions = 0) { qcount <- rowSums(as(quality(fq), "matrix") <= cutoff, na.rm = TRUE) # Retains reads where Phred scores are >= cutoff with N exceptions fq[qcount <= Nexceptions] } save(list = ls(), file = "param/customFCT.RData") }, step_name = "custom_preprocessing_function", dependency = "preprocessing" ) ``` After defining this step, it can be inspected and modified as follows. ```{r editing_preprocessing} #| message: false #| eval: false yamlinput(sal, "preprocessing")$Fct yamlinput(sal, "preprocessing", "Fct") <- "'filterFct(fq, cutoff=20, Nexceptions=0)'" yamlinput(sal, "preprocessing")$Fct ## check the new function cmdlist(sal, "preprocessing", targets = 1) ## check if the command line was updated with success ``` #### With Trimmomatic For demonstration purposes, this workflow uses the [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) software as an example of an external CL read trimming tool [@Bolger2014-yr]. Trimmomatic offers a range of practical trimming utilities specifically designed for single- and paired-end Illumina reads. It is important to note that while the Trimmomatic trimming step is included in this workflow, it's not mandatory. Users can opt to use read trimming results generated by the previous `preprocessReads` step if preferred. ```{r trimming} #| eval: false #| spr: true appendStep(sal) <- SYSargsList( step_name = "trimming", targets = "targetsPE.txt", wf_file = "trimmomatic/trimmomatic-pe.cwl", input_file = "trimmomatic/trimmomatic-pe.yml", dir_path = system.file("extdata/cwl", package = "systemPipeR"), inputvars=c(FileName1="_FASTQ_PATH1_", FileName2="_FASTQ_PATH2_", SampleName="_SampleName_"), dependency = "load_SPR", run_step = "optional") ``` #### FASTQ quality report The following `seeFastq` and `seeFastqPlot` functions generate and plot a series of useful quality statistics for a set of FASTQ files, including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length, and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results can be exported to different graphics formats, such as a PNG file, here named `fastqReport.png`. Detailed information about the usage and visual components in the quality plots can be found in the corresponding help file (see `?seeFastq` or `?seeFastqPlot`). ```{r fastq_report} #| eval: false #| message: false #| spr: true appendStep(sal) <- LineWise(code = { fastq <- getColumn(sal, step = "preprocessing", "targetsWF", column = 1) fqlist <- seeFastq(fastq = fastq, batchsize = 10000, klength = 8) png("./results/fastqReport.png", height = 1500, width = 330 * length(fqlist)) seeFastqPlot(fqlist) dev.off() }, step_name = "fastq_report", dependency = "preprocessing") ``` {width=100% fig-align="center"} ## Short read alignments ### Read mapping with `HISAT2` To use the `HISAT2` short read aligner developed by Kim, Langmead, and Salzberg [-@Kim2015-ve], it is necessary to index the reference genome. `HISAT2` relies on the Burrows-Wheeler index for this process. ```{r hisat2_index} #| eval: false #| spr: true appendStep(sal) <- SYSargsList( step_name = "hisat2_index", dir = FALSE, targets=NULL, wf_file = "hisat2/hisat2-index.cwl", input_file="hisat2/hisat2-index.yml", dir_path="param/cwl", dependency = "load_SPR" ) ``` #### `HISAT2` mapping The parameter settings of the aligner are defined in the `cwl/yml` files used in the following code chunk. The following shows how to construct the alignment step and append it to the `SAL` workflow container. Please note that the input (FASTQ) files used in this step are the output files generated by the preprocessing step (see above: `step_name = "preprocessing"`). ```{r hisat2_mapping} #| eval: false #| spr: true appendStep(sal) <- SYSargsList( step_name = "hisat2_mapping", dir = TRUE, targets ="preprocessing", wf_file = "workflow-hisat2/workflow_hisat2-pe.cwl", input_file = "workflow-hisat2/workflow_hisat2-pe.yml", dir_path = "param/cwl", inputvars = c(preprocessReads_1 = "_FASTQ_PATH1_", preprocessReads_2 = "_FASTQ_PATH2_", SampleName = "_SampleName_"), rm_targets_col = c("FileName1", "FileName2"), dependency = c("preprocessing", "hisat2_index") ) ``` The `cmdlist` functions allows to inspect the exact CL call used for each input file (sample), here for `HISAT2` alignments. Note, this step also includes the conversion of the alignment files to sorted and indexed bam files using functionalities of the `SAMtools` CL suite. ```{r bowtie2_alignment} #| eval: false cmdlist(sal, step="hisat2_mapping", targets=1) ``` ```{r hisat_cl} #| eval: false $hisat2_mapping $hisat2_mapping$M1A $hisat2_mapping$M1A$hisat2 [1] "hisat2 -S ./results/M1A.sam -x ./data/tair10.fasta -k 1 --min-intronlen 30 --max-intronlen 3000 -1 ./results/M1A_1.fastq_trim.gz -2 ./results/M1A_2.fa stq_trim.gz --threads 4" $hisat2_mapping$M1A$`samtools-view` [1] "samtools view -bS -o ./results/M1A.bam ./results/M1A.sam " $hisat2_mapping$M1A$`samtools-sort` [1] "samtools sort -o ./results/M1A.sorted.bam ./results/M1A.bam -@ 4" $hisat2_mapping$M1A$`samtools-index` [1] "samtools index -b results/M1A.sorted.bam results/M1A.sorted.bam.bai ./res ults/M1A.sorted.bam " ``` #### Alignment stats The following computes an alignment summary file (here `alignStats.xls`), which comprises the count of reads in each FASTQ file and the number of reads that align with the reference, presented in both total and percentage values. ```{r align_stats} #| eval: false #| spr: true appendStep(sal) <- LineWise( code = { fqpaths <- getColumn(sal, step = "preprocessing", "targetsWF", column = "FileName1") bampaths <- getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam") read_statsDF <- alignStats(args = bampaths, fqpaths = fqpaths, pairEnd = TRUE) write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t") }, step_name = "align_stats", dependency = "hisat2_mapping") ``` The resulting `alignStats.xls` file can be included in the report as shown below (here restricted to the first four rows). ```{r align_stats_view} read.table("results/alignStats.xls", header = TRUE)[1:4,] ``` ### Viewing BAM files in IGV The `symLink2bam` function creates symbolic links to view the BAM alignment files in a genome browser such as IGV without moving these large files to a local system. The corresponding URLs are written to a file with a path specified under `urlfile`, here `IGVurl.txt`. Please replace the directory and the user name. The `symLink2bam` function creates symbolic links to view the BAM alignment files in a genome browser such as IGV without moving these large files to a local system. The corresponding URLs are written to a file with a path specified under `urlfile`, here `IGVurl.txt`. To make the following code work, users need to change the directory name (here `