---
title: "Introduction to Tidy Transcriptomics"
author:
  - Maria Doyle, Peter MacCallum Cancer Centre^[<maria.doyle at petermac.org>]
  - Stefano Mangiola, Walter and Eliza Hall Institute^[<mangiola.s at wehi.edu.au>]
output: rmarkdown::html_vignette
bibliography: "`r file.path(system.file(package='monashMay5tidytranscriptomics', 'vignettes'), 'tidytranscriptomics.bib')`"
vignette: >
  %\VignetteIndexEntry{Introduction to Tidy Transcriptomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Talk introduction

This talk will present how to perform analysis of RNA sequencing data following the tidy data paradigm [@wickham2014tidy]. The tidy data paradigm provides a standard way to organise data values within a dataset. Each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. [tidybulk](https://stemangiola.github.io/tidybulk/), [tidySummarizedExperiment](https://stemangiola.github.io/tidySummarizedExperiment/), [tidySingleCellExperiment](https://stemangiola.github.io/tidySingleCellExperiment/) and [tidyseurat](https://stemangiola.github.io/tidyseurat/) packages can achieve this consistency for RNA sequencing data. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis. The roadmap for the development of the tidytranscriptomics packages is shown in the figure below.

```{r, echo=FALSE, out.width = "600px"}
knitr::include_graphics("../inst/vignettes/roadmap.png")
```

# Part 1 Bulk RNA sequencing with tidySummarizedExperiment and tidybulk

```{r, echo=FALSE, out.width = "100px"}
knitr::include_graphics("../inst/vignettes/tidybulk_logo.png")
```

**Acknowledgements**

Some of the material in Part 1 was adapted from an R for RNA sequencing talk first run [here](http://combine-australia.github.io/2016-05-11-RNAseq/). The use of the airway dataset was inspired by the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).

## Introduction

While mapping and counting are important and necessary tasks, today, we will be starting from the count data and showing how differential expression analysis can be performed in a friendly way using the Bioconductor package, tidybulk.

First, let's load all the packages we will need to analyse the data.

*Note: you should load the tidybulk* and *tidySummarizedExperiment* libraries after the tidyverse core packages for best integration.

```{r message=FALSE, warning=FALSE}
# dataset
library(airway)

# tidyverse core packages
library(tibble)
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(ggplot2)
library(purrr)

# tidyverse-friendly packages
library(plotly)
library(ggrepel)
library(GGally)
library(tidyHeatmap)
library(tidybulk)
# library(tidySummarizedExperiment) # we'll load this below to show what it can do
```

```{r echo=FALSE}
# Use colourblind-friendly colours
friendly_cols <- dittoSeq::dittoColors()

# Set theme
custom_theme <-
  list(
    scale_fill_manual(values = friendly_cols),
    scale_color_manual(values = friendly_cols),
    theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        text = element_text(size = 12),
        legend.position = "bottom",
        strip.background = element_blank()
      )
  )
```

## tidySummarizedExperiment

Here we will perform our analysis using the data from the *airway* package. The airway data comes from the paper by [@himes2014rna], and it includes 8 samples from human airway smooth muscle cells from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control), a sample has undergone RNA sequencing, and gene counts have been generated.

```{r}
# load airway RNA sequencing data
data(airway)

# take a look
airway
```

The data in the airway package is a Bioconductor *SummarizedExperiment* object. For more information, see [here](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).

The package `tidySummarizedExperiment` enables any *SummarizedExperiment* object to be displayed and manipulated according to tidy data principles without affecting any *SummarizedExperiment* behaviour.

If we load the *tidySummarizedExperiment* package and then view the airway data, it now displays as a tibble. A tibble is the tidyverse table format.

```{r message=FALSE, warning=FALSE}
# load tidySummarizedExperiment package
library(tidySummarizedExperiment)

# take a look
airway
```

Now we can more easily see the data. The airway object contains information about genes and samples. The first column has the Ensembl gene identifier, the second column has the sample identifier, and the third column has the gene transcription abundance. The abundance is the number of reads aligning to the gene in each experimental sample. The remaining columns include sample information. The dex column tells us whether the samples are treated or untreated, and the cell column tells us what cell line they are from.

We can still interact with the tidy SummarizedExperiment object using commands for SummarizedExperiment objects.

```{r}
assays(airway)
```

### Tidyverse commands

And now we can also use tidyverse commands, such as `filter`, `select` and `mutate` to explore the tidy SummarizedExperiment object. Some examples are shown below, and more can be seen at the tidySummarizedExperiment website [here](https://stemangiola.github.io/tidySummarizedExperiment/articles/introduction.html#tidyverse-commands-1). We can also use the tidyverse pipe `%>%`. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential, but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe, see [here](https://r4ds.had.co.nz/pipes.html).

We can use `filter` to choose rows, for example, to see just the rows for the treated samples.

```{r}
airway %>% filter(dex == "trt")
```

We can use `select` to choose columns, for example, to see the sample, cell line and treatment columns.

```{r}
airway %>% select(sample, cell, dex)
```

We can use `mutate` to create a column. For example, we could create a new sample_name column that contains shorter sample names. We can remove the SRR1039 prefix present in all samples, as shorter names can fit better in some of the plots we will create. We can use `mutate` together with `str_replace` to remove the SRR1039 string from the sample column.

```{r}
airway %>%
  mutate(sample_name=str_remove(sample, "SRR1039")) %>%
  # select columns to view    
  select(sample, sample_name)
```

### Setting up the data

We'll set up the airway data for our RNA sequencing analysis. We'll create a column with shorter sample names and a column with gene symbols. We can get the gene symbols for these Ensembl gene ids using the Bioconductor annotation package for human, `org.Hs.eg.db`.

```{r}
# setup data workflow
counts <-
  airway %>%
  mutate(sample_name = str_remove(sample, "SRR1039")) %>%
  mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    keys = feature,
    keytype = "ENSEMBL",
    column = "SYMBOL",
    multiVals = "first"
  ))
# take a look
counts
```

## tidybulk

+-----------------------------------+------------------------------------------------------------------------------+
| Function                          | Description                                                                  |
+===================================+==============================================================================+
| `identify_abundant`               | identify the abundant genes                                                  |
+-----------------------------------+------------------------------------------------------------------------------+
| `aggregate_duplicates`            | Aggregate abundance and annotation of duplicated transcripts in a robust way |
+-----------------------------------+------------------------------------------------------------------------------+
| `scale_abundance`                 | Scale (normalise) abundance for RNA sequencing depth                         |
+-----------------------------------+------------------------------------------------------------------------------+
| `reduce_dimensions`               | Perform dimensionality reduction (PCA, MDS, tSNE)                            |
+-----------------------------------+------------------------------------------------------------------------------+
| `cluster_elements`                | Labels elements with cluster identity (kmeans, SNN)                          |
+-----------------------------------+------------------------------------------------------------------------------+
| `remove_redundancy`               | Filter out elements with highly correlated features                          |
+-----------------------------------+------------------------------------------------------------------------------+
| `adjust_abundance`                | Remove known unwanted variation (Combat)                                     |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_differential_abundance`     | Differential transcript abundance testing (DE)                               |
+-----------------------------------+------------------------------------------------------------------------------+
| `deconvolve_cellularity`          | Estimated tissue composition (Cibersort or llsr)                             |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_differential_cellularity`   | Differential cell-type abundance testing                                     |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_stratification_cellularity` | Estimate Kaplan-Meier survival differences                                   |
+-----------------------------------+------------------------------------------------------------------------------+
| `keep_variable`                   | Filter for top variable features                                             |
+-----------------------------------+------------------------------------------------------------------------------+
| `keep_abundant`                   | Filter out lowly abundant transcripts                                        |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_gene_enrichment`            | Gene enrichment analyses (EGSEA)                                             |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_gene_overrepresentation`    | Gene enrichment on list of transcript names (no rank)                        |
+-----------------------------------+------------------------------------------------------------------------------+
| `test_gene_rank`                  | Gene enrichment on ranked list of transcript names (GSEA)                    |
+-----------------------------------+------------------------------------------------------------------------------+

+----------------------------+-----------------------------------------------------------------+
| Utilities                  | Description                                                     |
+============================+=================================================================+
| `get_bibliography`         | Get the bibliography of your workflow                           |
+----------------------------+-----------------------------------------------------------------+
| `tidybulk`                 | add tidybulk attributes to a tibble object                      |
+----------------------------+-----------------------------------------------------------------+
| `tidybulk_SAM_BAM`         | Convert SAM BAM files into tidybulk tibble                      |
+----------------------------+-----------------------------------------------------------------+
| `pivot_sample`             | Select sample-wise columns/information                          |
+----------------------------+-----------------------------------------------------------------+
| `pivot_transcript`         | Select transcript-wise columns/information                      |
+----------------------------+-----------------------------------------------------------------+
| `rotate_dimensions`        | Rotate two dimensions of a degree                               |
+----------------------------+-----------------------------------------------------------------+
| `ensembl_to_symbol`        | Add gene symbol from ensembl IDs                                |
+----------------------------+-----------------------------------------------------------------+
| `symbol_to_entrez`         | Add entrez ID from gene symbol                                  |
+----------------------------+-----------------------------------------------------------------+
| `describe_transcript`      | Add gene description from gene symbol                           |
+----------------------------+-----------------------------------------------------------------+
| `impute_missing_abundance` | Impute abundance for missing data points using sample groupings |
+----------------------------+-----------------------------------------------------------------+
| `fill_missing_abundance`   | Fill abundance for missing data points using an arbitrary value |
+----------------------------+-----------------------------------------------------------------+

### Filtering lowly transcribed genes

Genes with very low counts across all libraries provide little evidence for differential expression, and they can interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

We can perform the filtering using tidybulk `keep_abundant` or `identify_abundant`. These functions can use the *edgeR* `filterByExpr` function described in [@law2016rna] to automatically identify the genes with adequate abundance for differential expression testing. By default, this will keep genes with \~10 counts in a minimum number of samples, the number of the samples in the smallest group. In this dataset the smallest group size is four (as we have four dex-treated samples versus four untreated). Alternatively, we could use `identify_abundant` to identify which genes are abundant or not (TRUE/FALSE), rather than just keeping the abundant ones.

```{r}
# Filtering counts
counts_filtered <- counts %>% keep_abundant(factor_of_interest = dex)

# take a look
counts_filtered
```

After running `keep_abundant` we have a column called `.abundant` containing TRUE (`identify_abundant` would have TRUE/FALSE).

### Scaling counts to normalise

Scaling of counts, normalisation, is performed to eliminate uninteresting differences between samples due to sequencing depth or composition. A more detailed explanation can be found [here](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html). In the tidybulk package, the function `scale_abundance` generates scaled counts, with scaling factors calculated on abundant (filtered) transcripts and applied to all transcripts. We can choose from different normalisation methods. Here we will use the default, edgeR's trimmed mean of M values (TMM), [@robinson2010scaling]. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.

```{r}
# Scaling counts
counts_scaled <- counts_filtered %>% scale_abundance()

# take a look
counts_scaled
```

After we run `scale_abundance` we should see some columns have been added at the end. The column contains the scaled counts.

We can visualise the difference in abundance densities before and after scaling. As tidybulk output is compatible with tidyverse, we can simply pipe it into standard tidyverse functions such as `filter`, `pivot_longer` and `ggplot`. We can also take advantage of ggplot's `facet_wrap` to easily create multiple plots.

```{r out.width = "70%"}
counts_scaled %>%

  # Reshaping
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%

  # Plotting
  ggplot(aes(x = abundance + 1, color = sample_name)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme
```

In this dataset, the distributions of the counts are not very different to each other before scaling but scaling does make the distributions more similar. If we saw a sample with a very different distribution, we may need to investigate it.

As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from [@law2016rna]).

**tidybulk**

```{r eval=FALSE}
# tidybulk
airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() %>%
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
  ggplot(aes(x = abundance + 1, color = sample)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme
```

**base R using edgeR**

```{r eval=FALSE}
# Example code, no need to run

# Prepare data set
library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
nsamples <- ncol(dgList)
logcounts <- log2(dgList$counts)

# Setup graphics
col <- RColorBrewer::brewer.pal(nsamples, "Paired")
par(mfrow = c(1, 2))

# Plot raw counts
plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title(main = "Counts")
for (i in 2:nsamples) {
  den <- density(logcounts[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n")

# Plot scaled counts
dgList_norm <- calcNormFactors(dgList)
lcpm_n <- cpm(dgList_norm, log = TRUE)
plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title("Counts scaled")
for (i in 2:nsamples) {
  den <- density(lcpm_n[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")
```

### Dimensionality reduction

By far, one of the most important plots we make when we analyse RNA sequencing data are principal-component analysis (PCA) or multi-dimensional scaling (MDS) plots. We reduce the dimensions of the data to identify the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don't need to specify the groups. If your experiment is well controlled and has worked well, we hope to see that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also a handy tool for quality control and checking for outliers. We can use the `reduce_dimensions` function to calculate the dimensions.

```{r}
# Get principal components
counts_scal_PCA <-
  counts_scaled %>%
  reduce_dimensions(method = "PCA")
```

This joins the result to the counts object.

```{r}
# Take a look
counts_scal_PCA
```

For plotting, we can select just the sample-wise information with `pivot_sample`.

```{r}
# take a look
counts_scal_PCA %>% pivot_sample()
```

We can now plot the reduced dimensions.

```{r out.width = "70%"}
# PCA plot
counts_scal_PCA %>%
  pivot_sample() %>%
  ggplot(aes(x = PC1, y = PC2, colour = dex, shape = cell)) +
  geom_point() +
  geom_text_repel(aes(label = sample_name), show.legend = FALSE) +
  custom_theme
```

The samples separate by treatment on PC1 which is what we hope to see. PC2 separates the N080611 cell line from the other samples, indicating a greater difference between that cell line and the others.

### Hierarchical clustering with heatmaps

An alternative to principal component analysis for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. tidybulk has a simple function we can use, `keep_variable`, to extract the most variable genes which we can then plot with tidyHeatmap.

```{r out.width = "70%"}
counts_scal_PCA %>%

  # extract 500 most variable genes
  keep_variable(.abundance = counts_scaled, top = 500) %>%
  as_tibble() %>%
  
  # create heatmap
  heatmap(
    .column = sample_name,
    .row = feature,
    .value = counts_scaled,
    transform = log1p
  ) %>%
  add_tile(dex) %>%
  add_tile(cell)
```

In the heatmap we can see the samples cluster into two groups, treated and untreated, for three of the cell lines, and the cell line (N080611) again is further away from the others.

Tidybulk enables a simplified way of generating a clustered heatmap of variable genes. Compare the code below for tidybulk versus a base R method.

**base R using edgeR**

```{r eval=FALSE}
# Example code, no need to run

library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
dgList <- calcNormFactors(dgList)
logcounts <- cpm(dgList, log = TRUE)
var_genes <- apply(logcounts, 1, var)
select_var <- names(sort(var_genes, decreasing = TRUE))[1:500]
highly_variable_lcpm <- logcounts[select_var, ]
colours <- c("#440154FF", "#21908CFF", "#fefada")
col.group <- c("red", "grey")[group]
gplots::heatmap.2(highly_variable_lcpm, col = colours, trace = "none", ColSideColors = col.group, scale = "row")
```

### Differential expression

*tidybulk* integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood [@chen2016reads] (tidybulk default method), edgeR likelihood ratio [@mccarthy2012differential], limma-voom [@law2014voom] and DESeq2 [@love2014moderated]. A common question researchers have is which method to choose. With tidybulk, we can easily run multiple methods and compare.

We give `test_differential_abundance` our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance such as batch differences, we could use the formula `~ dex`. However, each treated and untreated sample is from a different cell line so we add the cell line as an additional factor `~ dex + cell`.

```{r message=FALSE}
de_all <-

  counts_scal_PCA %>%

  # edgeR QLT
  test_differential_abundance(
    ~ dex + cell,
    method = "edgeR_quasi_likelihood",
    prefix = "edgerQLT_"
  ) %>%

  # edgeR LRT
  test_differential_abundance(
    ~ dex + cell,
    method = "edgeR_likelihood_ratio",
    prefix = "edgerLR_"
  ) %>%

  # limma-voom
  test_differential_abundance(
    ~ dex + cell,
    method = "limma_voom",
    prefix = "voom_"
  ) %>%

  # DESeq2
  test_differential_abundance(
    ~ dex + cell,
    method = "deseq2",
    prefix = "deseq2_"
  )

# take a look

de_all
```

This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(treated/untreated).

#### Comparison of methods

We can visually compare the significance for all methods. We will notice that there is some difference between the methods.

```{r message=FALSE}
de_all %>%
  pivot_transcript() %>%
  select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>%
  ggpairs(1:4)
```

#### Single method

If we just wanted to run one differential testing method we could do that. The default method is edgeR quasi-likelihood.

```{r}
counts_de <- counts_scal_PCA %>%
  test_differential_abundance(~ dex + cell)
```

Tidybulk enables a simplified way of performing an RNA sequencing differential expression analysis (with the benefit of smoothly integrating with ggplot2 and other tidyverse packages). Compare the code for a tidybulk edgeR analysis versus standard edgeR below.

**standard edgeR**

```{r eval=FALSE}
# Example code, no need to run

library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
dgList <- calcNormFactors(dgList)
cell <- factor(dgList$samples$cell)
design <- model.matrix(~ group + cell)
dgList <- estimateDisp(dgList, design)
fit <- glmQLFit(dgList, design)
qlf <- glmQLFTest(fit, coef=2)
```

#### Volcano plots

Volcano plots are a useful genome-wide plot for checking that the analysis looks good. Volcano plots enable us to visualise the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot. We can also colour significant genes e.g. genes with false-discovery rate \< 0.05. In order to decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the FDR (or adjusted P-value), NOT the raw p-value. This is because we are testing many genes (multiple testing), and the chances of finding differentially expressed genes are very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives.

```{r out.width = "70%"}
# volcano plot, minimal
counts_de %>%
  ggplot(aes(x = logFC, y = PValue, colour = FDR < 0.05)) +
  geom_point() +
  scale_y_continuous(trans = "log10_reverse") +
  custom_theme
```

A more informative plot, integrating some of the packages in tidyverse.

We'll extract the symbols for a few top genes (by P value) to use in some of the plots we will make.

```{r}
topgenes_symbols <-
  counts_de %>%
  pivot_transcript() %>%
  arrange(PValue) %>%
  head(6) %>%
  pull(symbol)
```

```{r out.width = "70%", warning=FALSE}
counts_de %>%
  pivot_transcript() %>%

  # Subset data
  mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>%
  mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>%

  # Plot
  ggplot(aes(x = logFC, y = PValue, label = symbol)) +
  geom_point(aes(color = significant, size = significant, alpha = significant)) +
  geom_text_repel() +

  # Custom scales
  custom_theme +
  scale_y_continuous(trans = "log10_reverse") +
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2))
```

### Cell type composition analysis

If we are sequencing tissue samples, we may want to know what cell types are present and if there are differences in expression between them. `tidybulk` has a `deconvolve_cellularity` function that can help us do this.

For this example, we will use a subset of the breast cancer dataset from [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/tcga).

```{r}
BRCA <- monashMay5tidytranscriptomics::BRCA 
BRCA
```

With tidybulk, we can easily infer the proportions of cell types within a tissue using one of several published methods (Cibersort [@newman2015robust], EPIC [@racle2017simultaneous] and llsr [@abbas2009deconvolution]). Here we will use Cibersort, which provides a default signature called LM22 to define the cell types. LM22 contains 547 genes that identify 22 human immune cell types.

```{r}
BRCA_cell_type <-
  BRCA %>%
  deconvolve_cellularity(prefix="cibersort: ", cores = 1)

BRCA_cell_type
```

Cell type proportions are added to the tibble as new columns. The prefix makes it easy to reshape the data frame, if needed, for visualisation or further analyses.

```{r}
BRCA_cell_type_long <-
  BRCA_cell_type %>%
  pivot_sample() %>%
  
  # Reshape
  pivot_longer(
    contains("cibersort"),
    names_prefix = "cibersort: ",
    names_to = "cell_type",
    values_to = "proportion"
  )

BRCA_cell_type_long
```

We can plot the proportions of immune cell types for each patient.

```{r}
BRCA_cell_type_long %>%

  # Plot proportions
  ggplot(aes(x = sample, y = proportion, fill = cell_type)) +
  geom_bar(stat = "identity") +
  custom_theme
```

### Automatic bibliography

Tidybulk provides a handy function called `get_bibliography` that keeps track of the references for the methods used in your tidybulk workflow. The references are in BibTeX format and can be imported into your reference manager.

```{r}
get_bibliography(counts_de)
```

## Key Points

-   Bulk RNA sequencing data can be represented and analysed in a 'tidy' way using tidySummarizedExperiment, tidybulk and the tidyverse.
-   tidySummarizedExperiment enables us to visualise and manipulate a Bioconductor SummarizedExperiment object as if it were in tidy data format.
-   Some of the key steps in an RNA sequencing analysis are filtering lowly abundant transcripts, adjusting for differences in sequencing depth and composition, testing for differential expression, and visualising the data, which can all be performed in a tidy way with tidybulk.
-   `tidybulk` allows streamlined multi-method analyses
-   `tidybulk` allow easy analyses of the cell-type composition

# Part 2 Single-cell RNA sequencing with tidySingleCellExperiment

You can read more about single-cell RNA sequencing workflows in the online book [Orchestrating Single-Cell Analysis with Bioconductor](http://bioconductor.org/books/release/OSCA/index.html).

In Part 1, we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single-cell sequencing enables a more direct estimation of cell-type composition and gives a greater resolution. For bulk RNA sequencing, we need to infer the cell types using the abundance of transcripts in the whole sample. With single-cell RNA sequencing, we can directly measure the transcripts in each cell and then classify the cells into cell types.

```{r message=FALSE, warning=FALSE}
# load packages
library(ggplot2)
library(purrr)
library(scater)
library(scran)
library(igraph)
library(batchelor)
library(SingleR)
library(scuttle)
library(EnsDb.Hsapiens.v86)
library(celldex)
library(ggbeeswarm)
```

## Introduction to tidySingleCellExperiment

The single-cell RNA sequencing data used here is 3000 cells in total, subsetted from 20 samples from 10 peripheral blood mononuclear cell (PBMC) datasets. The datasets are from GSE115189/SRR7244582 [@Freytag2018], SRR11038995 [@Cai2020, SCP345 (singlecell.broadinstitute.org), SCP424 [@Ding2020], SCP591 [@Karagiannis2020] and 10x-derived 6K and 8K datasets (support.10xgenomics.com/). The data is in [SingleCellExperiment](https://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) format. SingleCellExperiment is a very popular container of single-cell RNA sequencing data.

Similar to what we saw with tidySummarizedExperiment, `tidySingleCellExperiment` package enables any *SingleCellExperiment* object to be displayed and manipulated according to tidy data principles, without affecting any *SingleCellExperiment* behaviour.

```{r}
library(tidySingleCellExperiment)

# load pbmc single cell RNA sequencing data
pbmc <- monashMay5tidytranscriptomics::pbmc

# take a look
pbmc
```

### Join transcript information

```{r}

pbmc %>%
  join_features(c("CD3G", "CD8A"), shape="wide") %>%
  
  # Select columns for display
  select(cell, CD3G, CD8A, everything())

```

It can be interacted with using [SingleCellExperiment commands](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) such as `assayNames`.

```{r}
assayNames(pbmc)
```

We can also interact with our object as we do with any tidyverse tibble.

## Setting up the data

In this case we want to polish an annotation column. We will extract the sample, dataset and group information from the file name column into separate columns.

```{r}
# First take a look at the file column
pbmc %>% select(file)
```

```{r}
# Create columns for sample, dataset and groups
pbmc <-
  pbmc %>%

  # Extract sample and group
  extract(file, "sample", "../data/([a-zA-Z0-9_]+)/outs.+", remove = FALSE) %>%

  # Extract data source
  extract(file, c("dataset", "groups"), "../data/([a-zA-Z0-9_]+)_([0-9])/outs.+")

# Take a look
pbmc %>% select(sample, dataset, groups)
```

## Quality control

A key quality control step performed in single-cell analyses is the assessment of the proportion of mitochondrial transcripts. A high mitochondrial count can indicate the cell is dead or dying and is thus poor quality and should be filtered out.

### Nesting

A powerful tool tidySingleCellExperiment enables us to use with our single-cell data is tidyverse's [nest](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#nest-a-data-frame). Nesting allows us to easily perform independent analyses on subsets of the data. For example, our data contains 10 single-cell datasets from different sources. We can nest by dataset and analyse the mitochondrial content for each.

To demonstrate the power of nesting we'll first show the mitochondrial analysis for one of the 10 datasets.

```{r}
one_dataset <- pbmc %>% filter(dataset =="GSE115189")
```

We get the chromosomal location for each gene in the dataset so we can identify the mitochondrial genes. We'll get a warning that some of the ids don't find a match, but it should be just a small proportion.

```{r}
location <- mapIds(
  EnsDb.Hsapiens.v86,
  keys = rownames(one_dataset),
  column = "SEQNAME",
  keytype = "SYMBOL"
)
```

Next we calculate the mitchondrial content for each cell in the dataset.

```{r}
mito_info_one_dataset <- perCellQCMetrics(one_dataset, subsets = list(Mito = which(location == "MT")))

mito_info_one_dataset
```

We then label the cells with high mitochondrial content as outliers.

```{r}
mito_info_one_dataset <- mito_info_one_dataset %>%     
      # Converting to tibble
      as_tibble(rownames = "cell") %>%

      # Label cells with high mitochondrial content
      mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type = "higher"))

mito_info_one_dataset
```

Finally, we join the mitochondrial information back to the original data so we will be able to filter out the cells with high mitochondrial content.

```{r}
mito_info_one_dataset <- left_join(one_dataset, mito_info_one_dataset, by = "cell")

mito_info_one_dataset
```

The steps above perform the mitochondrial QC analysis for one dataset. To analyse all 10 datasets, we could repeat the steps for each dataset or create a function and apply it to each dataset. However, a more efficient way is to use nesting. We can nest our data by dataset.

```{r, echo=FALSE, out.width = "600px"}
knitr::include_graphics("../inst/vignettes/nesting.png")
```

```{r}
pbmc_nested <- pbmc %>% 
    nest(data = -dataset)

pbmc_nested
```

We will create a table (tibble) with the mitochondrial QC information for each dataset. Here we use `map` which is often used in combination with `nest` to perform commands on nested data.

```{r}
location <- mapIds(
  EnsDb.Hsapiens.v86,
  keys = rownames(pbmc),
  column = "SEQNAME",
  keytype = "SYMBOL"
)

mito_info_all_datasets <- pbmc_nested %>%
    mutate(mitochondrion_info = map(
    data,
    ~ # Calculate mitochondrial statistics
      perCellQCMetrics(.x, subsets = list(Mito = which(location == "MT"))) %>%
      
      # Convert to tibble
      as_tibble(rownames = "cell") %>%

      # Label cells with high mitochondrial content
      mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type = "higher"))
  ))
  
mito_info_all_datasets
```

As before, we join the mitochondrial information to the original data.

```{r}
  # Join the mitochondrial information to the original SingleCellExperiment data
 mito_info_all_datasets <- mito_info_all_datasets %>%
    mutate(data = map2(
    data, mitochondrion_info,
    ~ left_join(.x, .y, by = "cell")
  )) %>%
  
  # Remove the separate mitochondrial information table
  select(-mitochondrion_info)

mito_info_all_datasets
```

Remove the nesting to get back one table containing all datasets.

```{r}
pbmc <- mito_info_all_datasets %>%
  unnest(data)

pbmc
```

Nesting has allowed us to perform the QC on each dataset in an efficient way.

We can use tidyverse to reshape the data and create [beeswarm plots](http://www.cbs.dtu.dk/~eklund/beeswarm/) to visualise the mitochondrial content.

```{r, warning=FALSE}
pbmc %>%

  # Reshaping
  pivot_longer(c(detected, sum, subsets_Mito_percent)) %>%
  ggplot(aes(
    x = dataset, y = value,
    color = high_mitochondrion,
    alpha = high_mitochondrion,
    size = high_mitochondrion
  )) +

  # Plotting
  geom_quasirandom() +
  facet_wrap(~name, scale = "free_y") +

  # Customisation
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1))
```

In the faceted plot, detected is the number of genes in each of the 10 datasets, the sum is total counts.

We then proceed to filter out cells with high mitochondrial content.

```{r}
pbmc <- pbmc %>% filter(!high_mitochondrion)
```

## Scaling and Integrating

As we are working with multiple datasets, we need to integrate them and adjust for technical variability between them. Here we'll nest by dataset (batch), normalise within each batch with `multiBatchNorm` and correct for batch effects with `fastMNN`.

```{r preprocess SingleCellExperiment, message=FALSE, warning=FALSE}
# Scaling within each dataset
pbmc <-
  pbmc %>%

  # Define batch
  mutate(batch = dataset) %>%
  nest(data = -batch) %>%

  # Normalisation   
  mutate(data = multiBatchNorm(data)) %>%

  # Integration
  pull(data) %>%
  fastMNN() 
```

## Identify clusters

We proceed with identifying cell clusters.

```{r cluster}
# Assign clusters to the 'colLabels'
# of the SingleCellExperiment object
colLabels(pbmc) <- # from SingleCellExperiment
  pbmc %>%
  buildSNNGraph(use.dimred="corrected") %>% # from scran - shared nearest neighbour
  cluster_walktrap() %$% # from igraph
  membership %>%
  as.factor()

# Reorder columns
pbmc %>% select(label, everything())
```

Thanks to tidySingleCellExperiment we can interrogate the object with tidyverse commands and use `count` to count the number of cells in each cluster.

```{r cluster count SingleCellExperiment}
pbmc %>% count(label)
```

## Reduce dimensions

Besides PCA which is a linear dimensionality reduction, we can apply neighbour aware methods such as UMAP, to better define locally similar cells. We can calculate the first 3 UMAP dimensions using the scater framework.

```{r umap SingleCellExperiment, warning=FALSE}
# Calculate UMAP with scater
pbmc <-
  pbmc %>%
  runUMAP(ncomponents = 3, dimred="corrected") # from scater
```

And we can plot the clusters as a 3D plot using plotly. This time we are colouring by estimated cluster labels to visually check the cluster labels.

```{r umap plot 2, eval=FALSE}
pbmc %>%
  plot_ly(
    x = ~`UMAP1`,
    y = ~`UMAP2`,
    z = ~`UMAP3`,
    colors = friendly_cols[1:10],
    color = ~label,
    size=0.5 
  )
```

```{r, echo=FALSE}
knitr::include_graphics("../inst/vignettes/plotly_2.png")
```

## Join datasets

We can join datasets as if theywere tibbles

```{r}
pbmc %>% bind_rows(pbmc)
```
## Pseudobulk analyses

It is sometimes useful to aggregate cell-wise transcript abundance into pseudobulk samples. It is possible to explore data and perform hypothesis testing with tools and data-source that we are more familiar with. For example, we can use edgeR in tidybulk to perform differential expression testing. For more details on pseudobulk analysis, see [here](https://hbctraining.github.io/scRNA%20sequencing/lessons/pseudobulk_DESeq2_scrnaseq.html).

### Data exploration using pseudobulk samples

To do this, we will load a helper function called `aggregate_cells` from [GitHub Gist](https://gist.github.com/stemangiola/ebce5a3f931b298611b56680d39c073c) and use it to create a group for each sample.

```{r warning=FALSE, message=FALSE}
# Load aggregate function
devtools::source_gist("889e4089a0e5ca2649d3c9164e3b542e")
# Aggregate
pbmc_bulk = 
  monashMay5tidytranscriptomics::pbmc %>% 
  aggregate_cells(file) 
pbmc_bulk
```

```{r}
pbmc_bulk %>%
  
  # Tidybulk operations
  tidybulk::identify_abundant() %>%
  tidybulk::scale_abundance()
```

## tidyseurat

```{r}
library(SeuratObject)

data(pbmc_small)

pbmc_small
```

With `tidyseurat` we obtain the consistent tidy abstraction, effectively unifying Bioconductor- and Seurat-based containers for display and manipulation.

```{r, message=FALSE, warning=FALSE}

library(tidyseurat)
```

```{r}
pbmc_small
```

## Key Points

-   Some basic steps of a single-cell RNA sequencing analysis are dimensionality reduction, cluster identification and cell-type classification
-   `tidySingleCellExperiment` is an invisible layer that operates on a `SingleCellExperiment` object and enables us to visualise and manipulate data as if it were a tidy data frame.
-   `tidySingleCellExperiment` object is a `SingleCellExperiment object` so it can be used with any `SingleCellExperiment` compatible method
- `tidyseurat` behaves consistently with `tidySingleCellExperiment`

# Contributing

If you want to suggest improvements for this talk or ask questions, you can do so as described [here](https://github.com/stemangiola/monashMay5_tidytranscriptomics/blob/master/CONTRIBUTING.md).

# Reproducibility

Record package and version information with `sessionInfo`

```{r}
sessionInfo()
```

# References