--- title: "Introduction to Tidy Transcriptomics" author: - Maria Doyle, Peter MacCallum Cancer Centre^[] - Stefano Mangiola, Walter and Eliza Hall Institute^[] output: rmarkdown::html_vignette bibliography: "`r file.path(system.file(package='bioc2021tidytranscriptomics', 'vignettes'), 'tidytranscriptomics.bib')`" vignette: > %\VignetteIndexEntry{Introduction to Tidy Transcriptomics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css, echo = FALSE} # Formatting for challenges .challenge { background-color: #fff4d4; } .challenge code { background-color: #fff4d4; } ``` # Workshop introduction ```{r, echo=FALSE, out.width = "100px"} knitr::include_graphics("../inst/vignettes/tidybulk_logo.png") ``` ## Instructors *Dr. Stefano Mangiola* is currently a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall Institute in Melbourne, Australia. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical models for the analysis of RNA sequencing data, and data analysis and visualisation interfaces. *Dr. Maria Doyle* is the Application and Training Specialist for Research Computing at the Peter MacCallum Cancer Centre in Melbourne, Australia. She has a PhD in Molecular Biology and currently works in bioinformatics and data science education and training. She is passionate about supporting researchers, reproducible research, open source and tidy data. ## Overview Recently, [plyranges](https://bioconductor.org/packages/release/bioc/html/plyranges.html) and [tidybulk](https://www.bioconductor.org/packages/release/bioc/html/tidybulk.html) have made efforts toward the harmonization of biological data structures and workflows using the concept of data tidiness, to facilitate modularisation. In this workshop, we present [tidySingleCellExperiment](https://stemangiola.github.io/tidySingleCellExperiment/) and [tidySummarizedExperiment](https://stemangiola.github.io/tidySummarizedExperiment/), two R packages that allow the user to visualise and manipulate SingleCellExperiment and SummarizedExperiment objects in a tidy fashion. Importantly, the tidybulk framework now works natively with SummarizedExperiment objects and, thanks to tidySummarizedExperiment, allows tidy and modular RNA sequencing analyses without renouncing the efficiency of Bioconductor data containers. These tools are part of the [tidytranscriptomics](https://github.com/stemangiola/tidytranscriptomics) R software suite, and represent an effort toward the harmonisation of transcriptional analyses under the tidy umbrella. ### Pre-requisites * Some familiarity with tidyverse syntax * Some familiarity with bulk RNA-seq and single cell RNA-seq Recommended Background Reading [Introduction to R for Biologists](https://melbournebioinformatics.github.io/r-intro-biologists/intro_r_biologists.html) ### Time outline This workshop is a 1.5 hour session. Guide | Activity - Hands on demos with challenges and Q&A | Time | |--------------------------------------------------------------------|------| | Part 1 Bulk RNA-seq with tidySummarizedExperiment and tidybulk | 45 | | Part 2 Single-cell RNA-seq with tidySingleCellExperiment | 45 | | Total | 90m |
**Format**: Hands on demos, challenges plus Q&A **Interact**: Airmeet for questions, Menti quiz for challenges

### Workshop goals and objectives #### Learning goals * To understand how transcriptomic data can be represented and analysed according to the tidy data paradigm with tidySummarizedExperiment, tidybulk and tidySingleCellExperiment. #### Learning objectives * Explore, visualise and analyse bulk RNA-seq count data with tidySummarizedExperiment and tidybulk * Explore and visualise single cell RNA-seq count data with tidySingleCellExperiment ### Google doc with all links [This Google doc](https://docs.google.com/document/d/151UE3tOeYLnRJRPPNZA-UD13E5QWUg470BWbHOJ5CPA/edit?usp=sharing) has all the links you need for this workshop. This document will be made accessible at the start of the workshop as it contains the RStudio in the cloud login instructions and Menti code, which are only for Bioc2021 registered attendees. * Material * Menti for Audience Questions and Challenges Quiz * Feedback survey * Rstudio - instructions for logging into workshop in cloud ### Audience Questions First we have a few questions for you, the audience. Please either open a tab in your browser or use your phone. Go to Menti (www.menti.com) and type the code given in the [Google doc](https://docs.google.com/document/d/151UE3tOeYLnRJRPPNZA-UD13E5QWUg470BWbHOJ5CPA/edit?usp=sharing). Then please rate the following on a scale of 1-5 (1=no/none, 5=yes/a lot) * I will code along and/or try out exercises during this workshop. * I have experience with transcriptomic analyses * I have experience with tidyverse ## What is transcriptomics? “The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells” _Wikipedia_ ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/transcriptomics.jpg") ``` ## Why use transcriptomics? - Genome (DNA) pretty stable - Proteome (proteins) harder to measure - Transcriptome (RNA) can measure changes in expression of thousands of coding and non-coding transcripts

## Possible experimental design ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/ScreenShot2.png") ``` ## How does transcriptomics work? ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/ScreenShot3.png") ``` ## Types of transcriptomic analyses * **Differential expression** * **Cell type composition** * Alternative splicing * Novel transcript discovery * Fusions identification * Variant analysis
Topics in bold we will see in this workshop

## Bulk RNA sequencing differential expression workflow ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/bulk_RNAseq_pipeline.png") ``` ## Differences between bulk and single-cell RNA sequencing ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/bulk_vs_single.jpg") ``` _Shalek and Benson, 2017_ ## Single-cell RNA sequencing analysis workflow ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/single_cell_RNAseq_pipeline.png") ``` ## Tidy data and the tidyverse This workshop demonstrates how to perform analysis of RNA sequencing data following the tidy data paradigm [@wickham2014tidy]. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. For more information, see the R for Data Science chapter on tidy data [here](https://r4ds.had.co.nz/tidy-data.html). ```{r, echo=FALSE, out.width = "500px", fig.cap="[Tidy data for efficiency, reproducibilty and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/)"} knitr::include_graphics("../inst/vignettes/tidydata_1.jpg") ``` The tidyverse is a collection of packages that can be used to tidy, manipulate and visualise data. We'll use many functions from the tidyverse in this workshop, such as `filter`, `select`, `mutate`, `pivot_longer` and `ggplot`. ```{r, echo=FALSE, out.width = "500px", fig.cap="Adapted from [Getting Started with tidyverse in R](https://www.storybench.org/getting-started-with-tidyverse-in-r/)"} knitr::include_graphics("../inst/vignettes/tidyverse.png") ``` ## Getting started ### Cloud Easiest way to run this material. Only available during workshop. * Login to workshop as per the instructions in the [Google doc](https://docs.google.com/document/d/151UE3tOeYLnRJRPPNZA-UD13E5QWUg470BWbHOJ5CPA/edit?usp=sharing). * Open `tidytranscriptomics.Rmd` in `bioc2021_tidytranscriptomcs/vignettes` folder ### Local We will use the Cloud during the Bioc2021 workshop and this method is available if you want to run the material after the workshop. If you want to install on your own computer, see instructions [here](https://stemangiola.github.io/bioc2021_tidytranscriptomics/index.html#workshop-package-installation). Alternatively, you can view the material at the workshop webpage [here](https://stemangiola.github.io/bioc2021_tidytranscriptomics/articles/tidytranscriptomics.html). ## What is tidytranscriptomics? [tidybulk](https://stemangiola.github.io/tidybulk/), [tidySummarizedExperiment](https://stemangiola.github.io/tidySummarizedExperiment/) and [tidySingleCellExperiment](https://stemangiola.github.io/tidySingleCellExperiment/) are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis. The roadmap for the development of the tidytranscriptomics packages is shown in the figure below. ```{r, echo=FALSE, out.width = "600px"} knitr::include_graphics("../inst/vignettes/roadmap_integration.png") ``` # Part 1 Bulk RNA sequencing with tidySummarizedExperiment and tidybulk ```{r, echo=FALSE, out.width = "100px"} knitr::include_graphics("../inst/vignettes/tidybulk_logo.png") ``` ### Acknowledgements Some of the material in Part 1 was adapted from an R RNA sequencing workshop first run [here](http://combine-australia.github.io/2016-05-11-RNAseq/). The use of the airway dataset was inspired by the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). First, let's load all the packages we will need to analyse the bulk RNA sequencing data. *Note: you should load the tidybulk* and *tidySummarizedExperiment* libraries after the tidyverse core packages for best integration. ```{r message=FALSE, warning=FALSE} # dataset library(airway) # tidyverse core packages library(tibble) library(dplyr) library(tidyr) library(readr) library(stringr) library(ggplot2) library(purrr) # tidyverse-friendly packages library(plotly) library(ggrepel) library(GGally) library(tidybulk) # library(tidySummarizedExperiment) # we'll load this below to show what it can do ``` Plot settings. Set the colours and theme we will use for our plots. ```{r} # Use colourblind-friendly colours friendly_cols <- dittoSeq::dittoColors() # Set theme custom_theme <- list( scale_fill_manual(values = friendly_cols), scale_color_manual(values = friendly_cols), theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size = 12), legend.position = "bottom", strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1) ) ) ``` ### tidySummarizedExperiment tidySummarizedExperiment provides a bridge between Bioconductor [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) [@morgan2021summarized] and the tidyverse [@wickham2019welcome]. It enables viewing the Bioconductor *SummarizedExperiment* object as a tidyverse tibble, and provides SummarizedExperiment-compatible *dplyr*, *tidyr*, *ggplot* and *plotly* functions. This allows users to get the best of both Bioconductor and tidyverse worlds. ## Functions/utilities available SummarizedExperiment-compatible Functions | Description ------------ | ------------- `all` | `tidySummarizedExperiment` is a SummarizedExperiment object tidyverse Packages | Description ------------ | ------------- `dplyr` | Almost all `dplyr` APIs like for any tibble `tidyr` | Almost all `tidyr` APIs like for any tibble `ggplot2` | `ggplot` like for any tibble `plotly` | `plot_ly` like for any tibble Utilities | Description ------------ | ------------- `tidy` | Add `tidySummarizedExperiment` invisible layer over a SummarizedExperiment object `as_tibble` | Convert cell-wise information to a `tbl_df` Here we will demonstrate performing a bulk RNA sequencing analysis using *tidySummarizedExperiment* and *tidybulk*. We will use data from the *airway* package, which comes from the paper by [@himes2014rna]. It includes 8 samples from human airway smooth muscle cells, from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control) a sample has undergone RNA sequencing and gene counts have been generated. ```{r} # load airway RNA sequencing data data(airway) # take a look airway ``` The data in the airway package is a Bioconductor *SummarizedExperiment* object. For more information see [here](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). The `tidySummarizedExperiment` package enables any *SummarizedExperiment* object to be displayed and manipulated according to tidy data principles, without affecting any *SummarizedExperiment* behaviour. If we load the *tidySummarizedExperiment* package and then view the airway data it now displays as a tibble. A tibble is the tidyverse table format. ```{r message=FALSE, warning=FALSE} # load tidySummarizedExperiment package library(tidySummarizedExperiment) # take a look airway ``` Now we can more easily see the data. The airway object contains information about genes and samples, the first column has the Ensembl gene identifier, the second column has the sample identifier and the third column has the gene transcription abundance. The abundance is the number of reads aligning to the gene in each experimental sample. The remaining columns include sample information. The dex column tells us whether the samples are treated or untreated and the cell column tells us what cell line they are from. We can still interact with the tidy SummarizedExperiment object using commands for SummarizedExperiment objects. ```{r} assays(airway) ``` ### Tidyverse commands And now we can also use tidyverse commands, such as `filter`, `select` and `mutate` to explore the tidy SummarizedExperiment object. Some examples are shown below and more can be seen at the tidySummarizedExperiment website [here](https://stemangiola.github.io/tidySummarizedExperiment/articles/introduction.html#tidyverse-commands-1). We can also use the tidyverse pipe `%>%`. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see [here](https://r4ds.had.co.nz/pipes.html). We can use `filter` to choose rows, for example, to see just the rows for the treated samples. ```{r} airway %>% filter(dex == "trt") ``` We can use `select` to choose columns, for example, to see the sample, cell line and treatment columns. ```{r} airway %>% select(sample, cell, dex) ``` We can combine group_by and summarise to calculate the total number of rows (transcripts) for each sample. ```{r} airway %>% group_by(sample) %>% summarise(n=n()) ``` We can use `mutate` to create a column. For example, we could create a new sample_name column that contains shorter sample names. We can remove the SRR1039 prefix that's present in all of the samples, as shorter names can fit better in some of the plots we will create. We can use `mutate` together with `str_replace` to remove the SRR1039 string from the sample column. ```{r} airway %>% mutate(sample_name=str_remove(sample, "SRR1039")) %>% # select columns to view select(sample, sample_name) ``` ```{challenge_1 class.source="challenge"} Challenge: Which sample has the largest no. of genes with zero counts? Select one of the multiple choice options in www.menti.com (code in Google doc). ``` ## Setting up the data We'll now prepare the data for a tidy analysis with tidybulk. Note that while here we use tidy SummarizedExperiment input, tidybulk also works with tibble and data.frame input. We'll set up the airway data for our RNA sequencing analysis. We'll create a column with shorter sample names and a column with gene symbols. We can get the gene symbols for these Ensembl gene ids using the Bioconductor annotation package for human, `org.Hs.eg.db`. ```{r} # setup data workflow counts <- airway %>% mutate(sample_name = str_remove(sample, "SRR1039")) %>% mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys = feature, keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first" )) # take a look counts ``` ## Pre-processing We filter out lowly expressed genes using tidybulk `keep_abundant` or `identify_abundant`. These functions can use the *edgeR* `filterByExpr` function described in [@law2016rna] to automatically identify the genes with adequate abundance for differential expression testing. We also scale counts to normalise for technical and composition differences between the samples using tidybulk `scale_abundance` Here we will use the default edgeR's trimmed mean of M values (TMM), [@robinson2010scaling]. ```{r} # Filtering counts counts_scaled <- counts %>% keep_abundant(factor_of_interest = dex) %>% scale_abundance() # take a look counts_scaled ``` After we run, we should see some columns have been added at the end. The `counts_scaled` column contains the scaled counts. We can visualise the difference in abundance densities before and after scaling. As tidybulk output is compatible with tidyverse, we can simply pipe it into standard tidyverse functions such as `filter`, `pivot_longer` and `ggplot`. We can also take advantage of ggplot's `facet_wrap` to easily create multiple plots. ```{r out.width = "70%"} counts_scaled %>% # Reshaping pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% # Plotting ggplot(aes(x = abundance + 1, color = sample_name)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme ``` In this dataset, the distributions of the counts are not very different to each other before scaling, but scaling does make the distributions more similar. If we saw a sample with a very different distribution, we may need to investigate it. As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from [@law2016rna]). **tidybulk** ```{r eval=FALSE} # tidybulk airway %>% keep_abundant(factor_of_interest = dex) %>% scale_abundance() %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% ggplot(aes(x = abundance + 1, color = sample)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme ``` **base R using edgeR** ```{r eval=FALSE} # Example code, no need to run # Prepare data set library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] nsamples <- ncol(dgList) logcounts <- log2(dgList$counts) # Setup graphics col <- RColorBrewer::brewer.pal(nsamples, "Paired") par(mfrow = c(1, 2)) # Plot raw counts plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title(main = "Counts") for (i in 2:nsamples) { den <- density(logcounts[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n") # Plot scaled counts dgList_norm <- calcNormFactors(dgList) lcpm_n <- cpm(dgList_norm, log = TRUE) plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title("Counts scaled") for (i in 2:nsamples) { den <- density(lcpm_n[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n") ``` ## Exploratory analyses ### Dimensionality reduction By far, one of the most important plots we make when we analyse RNA sequencing data are principal-component analysis (PCA) or multi-dimensional scaling (MDS) plots. We reduce the dimensions of the data to identify the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don't need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the `reduce_dimensions` function to calculate the dimensions. ```{r} # Get principal components counts_scal_PCA <- counts_scaled %>% reduce_dimensions(method = "PCA") ``` This joins the result to the counts' object. ```{r} # Take a look counts_scal_PCA ``` For plotting, we can select just the sample-wise information with `pivot_sample`. ```{r} # take a look counts_scal_PCA %>% pivot_sample() ``` We can now plot the reduced dimensions. ```{r out.width = "70%"} # PCA plot counts_scal_PCA %>% pivot_sample() %>% ggplot(aes(x = PC1, y = PC2, colour = dex, shape = cell)) + geom_point() + geom_text_repel(aes(label = sample_name), show.legend = FALSE) + custom_theme ``` The samples group by treatment on PC1 which is what we hope to see. PC2 separates the N080611 cell line from the other samples, indicating a greater difference between that cell line and the others. ## Differential expression *tidybulk* integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood [@chen2016reads] (tidybulk default method), edgeR likelihood ratio [@mccarthy2012differential], limma-voom [@law2014voom] and DESeq2 [@love2014moderated]. A common question researchers have is which method to choose. With tidybulk, we can easily run multiple methods and compare. We give `test_differential_abundance` our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance, such as batch differences, we could use the formula `~ dex`. However, each treated and untreated sample is from a different cell line, so we add the cell line as an additional factor `~ dex + cell`. ```{r message=FALSE} de_all <- counts_scal_PCA %>% # edgeR QLT test_differential_abundance( ~ dex + cell, method = "edgeR_quasi_likelihood", prefix = "edgerQLT_" ) %>% # edgeR LRT test_differential_abundance( ~ dex + cell, method = "edgeR_likelihood_ratio", prefix = "edgerLR_" ) %>% # limma-voom test_differential_abundance( ~ dex + cell, method = "limma_voom", prefix = "voom_" ) %>% # DESeq2 test_differential_abundance( ~ dex + cell, method = "deseq2", prefix = "deseq2_" ) # take a look de_all ``` This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(treated/untreated). ### Comparison of methods We can visually compare the significance for all methods. We will notice that there is some difference between the methods. ```{r message=FALSE} de_all %>% pivot_transcript() %>% select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>% ggpairs(1:4) ``` ```{challenge_2 class.source="challenge"} Challenge: Which method detects the largest no. of differentially abundant transcripts, p value adjusted for multiple testing < 0.05 (FDR, adj.P.Val, padj)? Select one of the multiple choice options in www.menti.com (code in Google doc). ``` Tidybulk enables a simplified way of performing an RNA sequencing differential expression analysis (with the benefit of smoothly integrating with ggplot2 and other tidyverse packages). Compare the code for a tidybulk edgeR analysis versus standard edgeR below. **standard edgeR** ```{r eval=FALSE} # Example code, no need to run library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] dgList <- calcNormFactors(dgList) cell <- factor(dgList$samples$cell) design <- model.matrix(~ group + cell) dgList <- estimateDisp(dgList, design) fit <- glmQLFit(dgList, design) qlf <- glmQLFTest(fit, coef=2) ``` ## Volcano plots Volcano plots are a useful genome-wide tool for checking that the analysis looks good. Volcano plots enable us to visualise the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot. We can also colour significant genes, e.g. genes with false-discovery rate \< 0.05. To decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the FDR (or adjusted P-value), NOT the raw p-value. This is because we are testing many genes (multiple testing), and the chances of finding differentially expressed genes are very high when you do that many tests. Hence we need to control the false discovery rate, the adjusted p-value column in the results table. That is, if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives. ```{r out.width = "70%"} # run default edgeR method counts_de <- counts_scal_PCA %>% test_differential_abundance(~ dex + cell) # volcano plot, minimal counts_de %>% pivot_transcript() %>% ggplot(aes(x = logFC, y = PValue, colour = FDR < 0.05)) + geom_point() + scale_y_continuous(trans = "log10_reverse") + custom_theme ``` We'll extract the symbols for a few top genes (by P value) to use in a more informative volcano plot, integrating some of the packages in tidyverse. ```{r} topgenes_symbols <- counts_de %>% pivot_transcript() %>% arrange(PValue) %>% head(6) %>% pull(symbol) topgenes_symbols ``` ```{r out.width = "70%", warning=FALSE} counts_de %>% pivot_transcript() %>% # Subset data mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>% mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>% # Plot ggplot(aes(x = logFC, y = PValue, label = symbol)) + geom_point(aes(color = significant, size = significant, alpha = significant)) + geom_text_repel() + # Custom scales custom_theme + scale_y_continuous(trans = "log10_reverse") + scale_color_manual(values = c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2)) ``` ## Automatic bibliography Tidybulk provides a handy function called `get_bibliography` that keeps track of the references for the methods used in your tidybulk workflow. The references are in BibTeX format and can be imported into your reference manager. ```{r} get_bibliography(counts_de) ``` ## Cell type composition analysis If we are sequencing tissue samples, we may want to know what cell types are present and if there are differences in expression between them. `tidybulk` has a `deconvolve_cellularity` function that can help us do this. For this example, we will use a subset of the breast cancer dataset from [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/tcga). ```{r} BRCA <- bioc2021tidytranscriptomics::BRCA BRCA ``` With tidybulk, we can easily infer the proportions of cell types within a tissue using one of several published methods (Cibersort [@newman2015robust], EPIC [@racle2017simultaneous] and llsr [@abbas2009deconvolution]). Here we will use Cibersort which provides a default signature called LM22 to define the cell types. LM22 contains 547 genes that identify 22 human immune cell types. ```{r} BRCA_cell_type <- BRCA %>% deconvolve_cellularity(prefix="cibersort: ", cores = 1) BRCA_cell_type ``` Cell type proportions are added to the tibble as new columns. The prefix makes it easy to reshape the data frame if needed, for visualisation or further analyses. ```{r} BRCA_cell_type_long <- BRCA_cell_type %>% pivot_sample() %>% # Reshape pivot_longer( contains("cibersort"), names_prefix = "cibersort: ", names_to = "cell_type", values_to = "proportion" ) BRCA_cell_type_long ``` We can plot the proportions of immune cell types for each patient. ```{r} BRCA_cell_type_long %>% # Plot proportions ggplot(aes(x = sample, y = proportion, fill = cell_type)) + geom_bar(stat = "identity") + custom_theme ``` ```{challenge_3 class.source="challenge"} Challenge: What is the most abundant cell type overall in BRCA samples? Select one of the multiple choice options in www.menti.com (code in Google doc). ``` ## Key Points - Bulk RNA sequencing data can be represented and analysed in a 'tidy' way using tidySummarizedExperiment, tidybulk and the tidyverse. - tidySummarizedExperiment enables us to visualise and manipulate a Bioconductor SummarizedExperiment object as if it were in tidy data format. - `tidybulk` allows streamlined multi-method analyses - `tidybulk` allow easy analyses of cell-type composition # Part 2 Single-cell RNA sequencing with tidySingleCellExperiment A typical single-cell RNA sequencing workflow is shown in [Background information](https://stemangiola.github.io/bioc2021_tidytranscriptomics/articles/background.html#differences-between-bulk-and-single-cell-rna-sequencing-1). We don't have time in this workshop to go into depth on each step but you can read more about single-cell RNA sequencing workflows in the online book [Orchestrating Single-Cell Analysis with Bioconductor](http://bioconductor.org/books/release/OSCA/index.html). In Part 1, we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single-cell sequencing enables a more direct estimation of cell-type composition and gives a greater resolution. For bulk RNA sequencing, we need to infer the cell types using the abundance of transcripts in the whole sample. With single-cell RNA sequencing, we can directly measure the transcripts in each cell and then classify the cells into cell types. ```{r message=FALSE, warning=FALSE} # load packages library(ggplot2) library(purrr) library(scater) library(scran) library(igraph) library(batchelor) library(SingleR) library(scuttle) library(EnsDb.Hsapiens.v86) library(celldex) library(ggbeeswarm) library(tidySingleCellExperiment) ``` ## Introduction to tidySingleCellExperiment The single-cell RNA sequencing data used here is 3000 cells in total, subsetted from 20 samples from 10 peripheral blood mononuclear cell (PBMC) datasets. The datasets are from GSE115189/SRR7244582 [@Freytag2018], SRR11038995 [@Cai2020, SCP345 (singlecell.broadinstitute.org), SCP424 [@Ding2020], SCP591 [@Karagiannis2020] and 10x-derived 6K and 8K datasets (support.10xgenomics.com/). The data is in [SingleCellExperiment](https://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) format [@Amezquita2019]. SingleCellExperiment is a very popular container of single-cell RNA sequencing data. Similar to what we saw with tidySummarizedExperiment, `tidySingleCellExperiment` package enables any *SingleCellExperiment* object to be displayed and manipulated according to tidy data principles without affecting any *SingleCellExperiment* behaviour. If we load the *tidySingleCellExperiment* package and then view the PBMC data, it displays as a tibble. ```{r} # load pbmc single cell RNA sequencing data pbmc <- bioc2021tidytranscriptomics::pbmc # take a look pbmc ``` It can be interacted with using [SingleCellExperiment commands](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) such as `assayNames`. ```{r} assayNames(pbmc) ``` We can also interact with our object as we do with any tidyverse tibble. ### Tidyverse commands And now we can also use tidyverse commands, such as `filter`, `select` and `mutate` to explore the tidySingleCellExperiment object. Some examples are shown below and more can be seen at the tidySingleCellExperiment website [here](https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html#tidyverse-commands-1). We can also use the tidyverse pipe `%>%`. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see [here](https://r4ds.had.co.nz/pipes.html). We can use `filter` to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage. ```{r} pbmc %>% filter(ident == "G1") ``` We can use `select` to choose columns, for example, to see the sample, cell, total cellular RNA ```{r} pbmc %>% select(cell, nCount_RNA , ident) ``` We can use `mutate` to create a column. For example, we could create a new `ident_l` column that contains a lower-case varsion of `ident`. ```{r} pbmc %>% mutate(ident_l=tolower(ident)) %>% # select columns to view select(ident, ident_l) ``` ## Join datasets We can join datasets as if they were tibbles ```{r} pbmc %>% bind_rows(pbmc) ``` ## Setting up the data In this case, we want to polish an annotation column. We will extract the sample, dataset and group information from the file name column into separate columns. ```{r} # First take a look at the file column pbmc %>% select(file) ``` ```{r} # Create columns for sample, dataset and groups pbmc <- pbmc %>% # Extract sample and group extract(file, "sample", "../data/([a-zA-Z0-9_]+)/outs.+", remove = FALSE) %>% # Extract data source extract(file, c("dataset", "groups"), "../data/([a-zA-Z0-9_]+)_([0-9])/outs.+") # Take a look pbmc %>% select(sample, dataset, groups) ``` ## Quality control A key quality control step performed in single-cell analyses is the assessment of the proportion of mitochondrial transcripts. A high mitochondrial count indicates cell death, and it is useful for filtering cells in a dying state. We'll first show the mitochondrial analysis for one of the 10 datasets. ```{r} one_dataset <- pbmc %>% filter(dataset =="GSE115189") ``` We get the chromosomal location for each gene in the dataset so we can identify the mitochondrial genes. We'll get a warning that some of the ids don't find a match, but it should be just a small proportion. ```{r} location <- mapIds( EnsDb.Hsapiens.v86, keys = rownames(one_dataset), column = "SEQNAME", keytype = "SYMBOL" ) ``` Next we calculate the mitchondrial content for each cell in the dataset. ```{r} mito_info_one_dataset <- perCellQCMetrics(one_dataset, subsets = list(Mito = which(location == "MT"))) mito_info_one_dataset ``` We then label the cells with high mitochondrial content as outliers. ```{r} mito_info_one_dataset <- mito_info_one_dataset %>% # Converting to tibble as_tibble(rownames = "cell") %>% # Label cells with high mitochondrial content mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type = "higher")) mito_info_one_dataset ``` Finally, we join the mitochondrial information back to the original data so we will be able to filter out the cells with high mitochondrial content. ```{r} mito_info_one_dataset <- left_join(one_dataset, mito_info_one_dataset, by = "cell") mito_info_one_dataset ``` The steps above perform the mitochondrial QC analysis for one dataset. To analyse all 10 datasets, we could repeat the steps for each dataset or create a function and apply it to each dataset. However, a more efficient way is to use nesting. We can nest our data by dataset. ```{r} pbmc = bioc2021tidytranscriptomics::mito_info_all_datasets ``` We can use tidyverse to reshape the data and create [beeswarm plots](http://www.cbs.dtu.dk/~eklund/beeswarm/) to visualise the mitochondrial content. ```{r, warning=FALSE} pbmc %>% # Reshaping pivot_longer(c(detected, sum, subsets_Mito_percent)) %>% ggplot(aes( x = dataset, y = value, color = high_mitochondrion, alpha = high_mitochondrion, size = high_mitochondrion )) + # Plotting geom_quasirandom() + facet_wrap(~name, scale = "free_y") + # Customisation scale_color_manual(values = c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2)) + theme_bw() + theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1)) ``` In the faceted plot, "detected" is the number of genes in each of the 10 datasets, "sum" is the total counts. We then proceed to filter out cells with high mitochondrial content. ```{r} pbmc <- pbmc %>% filter(!high_mitochondrion) ``` ## Scaling and Integrating As we are working with multiple datasets, we need to integrate them and adjust for technical variability between them. Here we'll nest by dataset (batch), normalise within each batch with `multiBatchNorm` and correct for batch effects with `fastMNN`. ```{r preprocess SingleCellExperiment, message=FALSE, warning=FALSE, eval = FALSE} # Scaling within each dataset pbmc <- pbmc %>% # Define batch mutate(batch = dataset) %>% nest(data = -batch) %>% # Normalisation mutate(data = multiBatchNorm(data)) %>% # Integration pull(data) %>% fastMNN() %>% # Join old information left_join(pbmc %>% as_tibble()) ``` ```{r} # Load the integrated dataset pbmc = bioc2021tidytranscriptomics::pbmc_integrated ``` ## Identify clusters We proceed with identifying cell clusters. ```{r cluster} # Assign clusters to the 'colLabels' # of the SingleCellExperiment object colLabels(pbmc) <- # from SingleCellExperiment pbmc %>% buildSNNGraph(use.dimred="corrected") %>% # from scran - shared nearest neighbour cluster_walktrap() %$% # from igraph membership %>% as.factor() # Reorder columns pbmc %>% select(label, everything()) ``` Thanks to tidySingleCellExperiment we can interrogate the object with tidyverse commands and use `count` to count the number of cells in each cluster. ## Reduce dimensions Besides PCA which is a linear dimensionality reduction, we can apply neighbour aware methods such as UMAP, to better define locally similar cells. ```{r umap SingleCellExperiment 2, warning=FALSE} # Calculate UMAP with scater pbmc %>% runUMAP(ncomponents = 2, dimred="corrected") # from scater ``` ```{challenge_4 class.source="challenge"} Challenge: Is the variability of the 1st UMAP dimension when calculating 2 components (ncomponents = 2) equal/more/less than when calculating 3 components? Select one of the multiple choice options in www.menti.com (code in Google doc). ``` **Tip: your can use as_tibble() to convert the tibble abstraction to a simple (and light) cell-wise tibble** We can calculate the first 3 UMAP dimensions using the scater framework. ```{r umap SingleCellExperiment 3, warning=FALSE} # Calculate UMAP with scater pbmc <- pbmc %>% runUMAP(ncomponents = 3, dimred="corrected") # from scater ``` And we can plot the clusters as a 3D plot using plotly. This time we are colouring by estimated cluster labels to visually check the cluster labels. ```{r umap plot 2, eval=FALSE} pbmc %>% plot_ly( x = ~`UMAP1`, y = ~`UMAP2`, z = ~`UMAP3`, colors = friendly_cols[1:10], color = ~label, size=0.5 ) ``` ```{r, echo=FALSE} knitr::include_graphics("../inst/vignettes/plotly_2.png") ``` ## Key Points - Some basic steps of a single-cell RNA sequencing analysis are dimensionality reduction and cluster identification - `tidySingleCellExperiment` is an invisible layer that operates on a `SingleCellExperiment` object and enables us to visualise and manipulate data as if it were a tidy data frame. - `tidySingleCellExperiment` object is a `SingleCellExperiment object` so it can be used with any `SingleCellExperiment` compatible method # Contributing If you want to suggest improvements for this workshop or ask questions, you can do so as described [here](https://github.com/stemangiola/bioc2021_tidytranscriptomics/blob/master/CONTRIBUTING.md). # Reproducibility Record package and version information with `sessionInfo` ```{r} sessionInfo() ``` # References