--- title: "Data Visualizations" author: "Author: Daniela Cassol (danielac@ucr.edu) and Thomas Girke (thomas.girke@ucr.edu)" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: toc_float: true code_folding: show BiocStyle::pdf_document: default package: systemPipeTools vignette: | %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{systemPipeTools: Data Visualizations} %\VignetteEngine{knitr::rmarkdown} fontsize: 14pt bibliography: bibtex.bib type: docs weight: 1 --- ```{css, echo=FALSE, eval=TRUE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; } ``` ```{r style, echo = FALSE, results = 'asis', eval=TRUE} BiocStyle::markdown() options(width = 80, max.print = 1000) knitr::opts_chunk$set( eval = as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache = as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), tidy.opts = list(width.cutoff = 80), tidy = TRUE ) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE, eval=TRUE} suppressPackageStartupMessages({ library(systemPipeTools) library(systemPipeR) }) ``` # Data Visualization with `systemPipeR` *systemPipeTools* package extends the widely used *[systemPipeR](https://systempipe.org/)* (SPR) [@H_Backman2016-bt] workflow environment with enhanced toolkit for data visualization, including utilities to automate the analysis of differentially expressed genes (DEGs). *systemPipeTools* provides functions for data transformation and data exploration via scatterplots, hierarchical clustering heatMaps, principal component analysis, multidimensional scaling, generalized principal components, t-Distributed Stochastic Neighbor embedding (t-SNE), and MA and volcano plots. All these utilities can be integrated with the modular design of the *systemPipeR* environment that allows users to easily substitute any of these features and/or custom with alternatives. ## Metadata and Reads Counting Information The first step is importing the `targets` file and raw reads counting table. - The `targets` file defines all FASTQ files and sample comparisons of the analysis workflow. - The raw reads counting table represents all the reads that map to gene (row) for each sample (columns). ```{r targets_counts, eval=TRUE} ## Targets file targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR") targets <- read.delim(targetspath, comment = "#") cmp <- systemPipeR::readComp(file = targetspath, format = "matrix", delim = "-") ## Count table file countMatrixPath <- system.file("extdata", "countDFeByg.xls", package = "systemPipeR") countMatrix <- read.delim(countMatrixPath, row.names = 1) showDT(countMatrix) ``` ## Data Transformation For gene differential expression, raw counts are required, however for data visualization or clustering, it can be useful to work with transformed count data. `exploreDDS` function is convenience wrapper to transform raw read counts using the [`DESeq2`](@Love2014-sh) package transformations methods. The input file has to contain all the genes, not just differentially expressed ones. Supported methods include variance stabilizing transformation (`vst`) (@Anders2010-tp), and regularized-logarithm transformation or `rlog` (@Love2014-sh). ```{r exploreDDS, eval=TRUE, warning=FALSE} exploredds <- exploreDDS(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL, transformationMethod = "rlog") exploredds ``` Users are strongly encouraged to consult the [`DESeq2`](@Love2014-sh) vignette for more detailed information on this topic and how to properly run `DESeq2` on data sets with more complex experimental designs. ## Scatterplot To decide which transformation to choose, we can visualize the transformation effect comparing two samples or a grid of all samples, as follows: ```{r exploreDDSplot, eval=TRUE, warning=FALSE} exploreDDSplot(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL, samples = c("M12A", "M12A", "A12A", "A12A"), scattermatrix = TRUE) ``` The scatterplots are created using the log2 transform normalized reads count, variance stabilizing transformation (VST) (@Anders2010-tp), and regularized-logarithm transformation or `rlog` (@Love2014-sh). ## Hierarchical Clustering Dendrogram The following computes the sample-wise correlation coefficients using the `stats::cor()` function from the transformed expression values. After transformation to a distance matrix, hierarchical clustering is performed with the `stats::hclust` function and the result is plotted as a dendrogram, as follows: ```{r hclustplot, eval=TRUE} hclustplot(exploredds, method = "spearman") ``` The function provides the utility to save the plot automatically. ## Hierarchical Clustering HeatMap This function performs hierarchical clustering on the transformed expression matrix generated within the `DESeq2` package. It uses, by default, a `Pearson` correlation-based distance measure and complete linkage for cluster join. If `samples` selected in the `clust` argument, it will be applied the `stats::dist()` function to the transformed count matrix to get sample-to-sample distances. Also, it is possible to generate the `pheatmap` or `plotly` plot format. ```{r heatMaplot_samples, eval=TRUE} ## Samples plot heatMaplot(exploredds, clust = "samples", plotly = TRUE) ``` If `ind` selected in the `clust` argument, it is necessary to provide the list of differentially expressed genes for the `exploredds` subset. ```{r heatMaplot_DEG, eval=TRUE, warning=FALSE} ## Individuals genes identified in DEG analysis ### DEG analysis with `systemPipeR` degseqDF <- systemPipeR::run_DESeq2(countDF = countMatrix, targets = targets, cmp = cmp[[1]], independent = FALSE) DEG_list <- systemPipeR::filterDEGs(degDF = degseqDF, filter = c(Fold = 2, FDR = 10)) ### Plot heatMaplot(exploredds, clust = "ind", DEGlist = unique(as.character(unlist(DEG_list[[1]])))) ``` The function provides the utility to save the plot automatically. ## Principal Component Analysis This function plots a Principal Component Analysis (PCA) from transformed expression matrix. This plot shows samples variation based on the expression values and identifies batch effects. ```{r PCAplot, eval=TRUE} PCAplot(exploredds, plotly = FALSE) ``` The function provides the utility to save the plot automatically. ## Multidimensional scaling with `MDSplot` This function computes and plots multidimensional scaling analysis for dimension reduction of count expression matrix. Internally, it is applied the `stats::dist()` function to the transformed count matrix to get sample-to-sample distances. ```{r MDSplot, eval=TRUE} MDSplot(exploredds, plotly = FALSE) ``` The function provides the utility to save the plot automatically. ## Dimension Reduction with `GLMplot` This function computes and plots generalized principal components analysis for dimension reduction of count expression matrix. ```{r GLMplot, eval=TRUE, warning=FALSE} exploredds_r <- exploreDDS(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL, transformationMethod = "raw") GLMplot(exploredds_r, plotly = FALSE) ``` The function provides the utility to save the plot automatically. ## MA plot This function plots log2 fold changes (y-axis) versus the mean of normalized counts (on the x-axis). Statistically significant features are colored. ```{r MAplot, eval=TRUE} MAplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20), genes = "ATCG00280") ``` The function provides the utility to save the plot automatically. ## t-Distributed Stochastic Neighbor embedding with `tSNEplot` This function computes and plots t-Distributed Stochastic Neighbor embedding (t-SNE) analysis for unsupervised nonlinear dimensionality reduction of count expression matrix. Internally, it is applied the `Rtsne::Rtsne()` [@Rtsne] function, using the exact t-SNE computing with `theta=0.0`. ```{r tSNEplot, eval=TRUE} tSNEplot(countMatrix, targets, perplexity = 5) ``` ## Volcano plot A simple function that shows statistical significance (`p-value`) versus magnitude of change (`log2 fold change`). ```{r volcanoplot, eval=TRUE} volcanoplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20), genes = "ATCG00280") ``` # Version information ```{r sessionInfo, eval=TRUE} sessionInfo() ``` # Funding This project is funded by NSF award [ABI-1661152](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1661152). # References