---
title: "Tidy genomic and transcriptomic single-cell analyses"
author:
  - Maria Doyle, Peter MacCallum Cancer Centre^[<maria.doyle at petermac.org>]
  - Stefano Mangiola, Walter and Eliza Hall Institute^[<mangiola.s at wehi.edu.au>]
  - Michael Love, UNC-Chapel Hill^[<michaelisaiahlove at gmail.com>]
output: rmarkdown::html_vignette
bibliography: "`r file.path(system.file(package='tidyomicsWorkshopBioc2023', 'vignettes'), 'tidyomics.bib')`"
vignette: >
  %\VignetteIndexEntry{Tidy genomic and transcriptomic single-cell analyses}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```

## Instructors

*Dr. Stefano Mangiola* is currently a Postdoctoral researcher in the
laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall
Institute in Melbourne, Australia. His background spans from
biotechnology to bioinformatics and biostatistics. His research
focuses on prostate and breast tumour microenvironment, the
development of statistical models for the analysis of RNA sequencing
data, and data analysis and visualisation interfaces. 

*Dr. Michael Love* is an Associate Professor at UNC-Chapel Hill in the
Department of Genetics and the Department of Biostatistics. His
background is in Statistics and Computational Biology. His research
is highly collaborative, working with Geneticists, Molecular
Biologists, Epidemiologists, and Computer Scientists on methods for
transcriptomics, epigenetics, GWAS, and high-throughput functional
screens.

## Workshop goals and objectives

### What you will learn

-   Basic `tidy` operations possible with `tidySingleCellExperiment`
    and `GRanges`
-   How to interface `SingleCellExperiment` with tidy manipulation and
    visualisation 
-   A real-world case study that will showcase the power of `tidy`
    single-cell methods compared with base/ad-hoc methods
-   Examples of how to integrate genomic and transcriptomic data
    (ChIP-seq and RNA-seq)

### What you will *not* learn

-   The molecular technology of single-cell sequencing
-   The fundamentals of single-cell data analysis
-   The fundamentals of tidy data analysis
-   Detailed data integration methods (multi-view or multi-omics
    algorithms)

## Getting started

### Local

We will use the Cloud during the workshop and this method is available
if you want to run the material after the workshop. If you want to
install on your own computer, see instructions 
[here](https://tidyomics.github.io/tidyomicsWorkshopBioc2023/index.html#workshop-package-installation).

Alternatively, you can view the material at the workshop webpage 
[here](https://tidyomics.github.io/tidyomicsWorkshopBioc2023/articles/main.html).

## Introduction slides

We provide two links to introductory material (to consult outside of
the workshop slot):

1. [Introduction to tidytranscriptomics](https://docs.google.com/gview?url=https://raw.githubusercontent.com/tidyomics/tidyomicsWorkshopBioc2023/master/inst/tidytranscriptomics_slides.pdf)
2. [Introduction to tidy genomics](https://github.com/tidyomics/tidy-genomics-talk/blob/main/tidy-genomics-talk.pdf)

# Part 1 Introduction to tidySingleCellExperiment

```{r message = FALSE}
# Load packages
library(SingleCellExperiment)
library(ggplot2)
library(plotly)
library(dplyr)
library(colorspace)
library(dittoSeq)
```

SingleCellExperiment is a popular data object in the Bioconductor
ecosystem for representing single-cell datasets, which works
seamlessly with numerous Bioconductor packages and methods written by
different groups [@Amezquita2020]. It can easily be converted to and
from other formats such as SeuratObject and scverse's AnnData.

Here we load single-cell data in SingleCellExperiment object
format. This data is peripheral blood mononuclear cells (PBMCs) from
metastatic breast cancer patients.

```{r}
# load single cell RNA sequencing data
sce_obj <- tidyomicsWorkshopBioc2023::sce_obj

# take a look
sce_obj
```

tidySingleCellExperiment provides a bridge between the
SingleCellExperiment single-cell package and the tidyverse
[@wickham2019welcome]. It creates an invisible layer that enables
viewing the SingleCellExperiment object as a tidyverse tibble, and
provides SingleCellExperiment-compatible *dplyr*, *tidyr*, *ggplot*
and *plotly* functions.

For related work for bulk RNA-seq, see the 
[tidybulk](https://stemangiola.github.io/tidybulk/) package [@Mangiola2021].

If we load the *tidySingleCellExperiment* package and then view the
single cell data, it now displays as a tibble. 

```{r message = FALSE}
library(tidySingleCellExperiment)

sce_obj
```

If we want to revert to the standard SingleCellExperiment view we can do that.

```{r}
options("restore_SingleCellExperiment_show" = TRUE)
sce_obj
```

If we want to revert back to tidy SingleCellExperiment view we can.

```{r}
options("restore_SingleCellExperiment_show" = FALSE)
sce_obj
```

It can be interacted with using 
[SingleCellExperiment commands](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) 
such as `assays`.

```{r}
assays(sce_obj)
```

We can also interact with our object as we do with any tidyverse tibble.

## Tidyverse commands

We can use tidyverse commands, such as `filter`, `select` and `mutate`
to explore the tidySingleCellExperiment object. Some examples are
shown below and more can be seen at the tidySingleCellExperiment
website
[here](https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html#tidyverse-commands-1).

We can use `filter` to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage.

```{r}
sce_obj |> filter(Phase == "G1")
```

:::: {.note}
Note that *rows* in this context refers to rows of the abstraction,
not *rows* of the SingleCellExperiment which correspond to
genes. *tidySingleCellExperiment* prioritizes cells as the units of
observation in the abstraction, while the full dataset, including
measurements of expression of all genes, is still available "in the
background".
::::

We can use `select` to view columns, for example, to see the filename,
total cellular RNA abundance and cell phase. 

If we use `select` we will also get any view-only columns returned,
such as the UMAP columns generated during the preprocessing.

```{r}
sce_obj |> select(.cell, file, nCount_RNA, Phase)
```

We can use `mutate` to create a column. For example, we could create a
new `Phase_l` column that contains a lower-case version of `Phase`. 

```{r message=FALSE}
sce_obj |>
  mutate(Phase_l = tolower(Phase)) |>
  select(.cell, Phase, Phase_l)
```

We can use tidyverse commands to polish an annotation column. We will
extract the sample, and group information from the file name column
into separate columns. 

```{r message=FALSE}
# First take a look at the file column
sce_obj |> select(.cell, file)
```

```{r}
# Create column for sample
sce_obj <- sce_obj |>
  # Extract sample
  extract(file, "sample", "../data/.*/([a-zA-Z0-9_-]+)/outs.+", remove = FALSE)

# Take a look
sce_obj |> select(.cell, sample, everything())
```

We could use tidyverse `unite` to combine columns, for example to
create a new column for sample id combining the sample and patient id
(BCB) columns.

```{r message=FALSE}
sce_obj <- sce_obj |> unite("sample_id", sample, BCB, remove = FALSE)

# Take a look
sce_obj |> select(.cell, sample_id, sample, BCB)
```


# Part 2 Signature visualisation

## Data pre-processing

The object `sce_obj` we've been using was created as part of a study
on breast cancer systemic immune response. Peripheral blood
mononuclear cells have been sequenced for RNA at the single-cell
level. The steps used to generate the object are summarised below. 

-   `scran`, `scater`, and `DropletsUtils` packages have been used to
    eliminate empty droplets and dead cells. Samples were individually
    quality checked and cells were filtered for good gene coverage .

-   Variable features were identified using `modelGeneVar`.

-   Read counts were scaled and normalised using logNormCounts from `scuttle`.

-   Data integration was performed using `fastMNN` with default parameters.

-   PCA performed to reduce feature dimensionality.

-   Nearest-neighbor cell networks were calculated using 30 principal components.

-   2 UMAP dimensions were calculated using 30 principal components.

-   Cells with similar transcriptome profiles were grouped into clusters using Louvain clustering from `scran`.

## Analyse custom signature

The researcher analysing this dataset wanted to identify gamma delta T
cells using a gene signature from a published paper
[@Pizzolato2019]. We'll show how that can be done here. 

With tidySingleCellExperiment's `join_features` we can view the counts
for genes in the signature as columns joined to our single cell
tibble. 

```{r}
sce_obj |>
  join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide")
```

We can use tidyverse `mutate` to create a column containing the
signature score. To generate the score, we scale the sum of the 4
genes, CD3D, TRDC, TRGC1, TRGC2, and subtract the scaled sum of the 2
genes, CD8A and CD8B. `mutate` is powerful in enabling us to perform
complex arithmetic operations easily. 

```{r}
sce_scored <- sce_obj |>
    
  join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide") |>
    
  mutate(
    signature_score =
      scales::rescale(CD3D + TRDC + TRGC1 + TRGC2, to = c(0, 1)) -
        scales::rescale(CD8A + CD8B, to = c(0, 1))
  )

sce_scored |>
  
  select(.cell, signature_score, everything())
```

The gamma delta T cells could then be visualised by the signature
score using Bioconductor's visualisation functions. 

```{r}
sce_scored |>
    
  scater::plotUMAP(colour_by = "signature_score")
```

The cells could also be visualised using the popular and powerful
`ggplot2` package, enabling the researcher to use ggplot functions
they were familiar with, and to customise the plot with great
flexibility. 

```{r}
sce_scored |>
    
  # plot cells with high score last so they're not obscured by other cells
  arrange(signature_score) |>
    
  ggplot(aes(UMAP_1, UMAP_2, color = signature_score)) +
  geom_point() +
  scale_color_distiller(palette = "Spectral") +
  tidyomicsWorkshopBioc2023::theme_multipanel
```

For exploratory analyses, we can select the gamma delta T cells, the
red cluster on the left with high signature score. We'll filter for
cells with a signature score > 0.7. 

```{r}
sce_obj_gamma_delta <-
    
  sce_scored |>
    
	# Proper cluster selection should be used instead (see supplementary material)
  filter(signature_score > 0.7)
```

For comparison, we show the alternative using base R and
SingleCellExperiment. Note that the code contains more redundancy and
intermediate objects. 

```{r eval=FALSE}
counts_positive <-
  assay(sce_obj, "logcounts")[c("CD3D", "TRDC", "TRGC1", "TRGC2"), ] |>
  colSums() |>
  scales::rescale(to = c(0, 1))

counts_negative <-
  assay(sce_obj, "logcounts")[c("CD8A", "CD8B"), ] |>
  colSums() |>
  scales::rescale(to = c(0, 1))

sce_obj$signature_score <- counts_positive - counts_negative

sce_obj_gamma_delta <- sce_obj[, sce_obj$signature_score > 0.7]
```

We can then focus on just these gamma delta T cells and chain
Bioconductor and tidyverse commands together to analyse.

```{r warning=FALSE, message=FALSE}
library(batchelor)
library(scater)

sce_obj_gamma_delta <-
    
  sce_obj_gamma_delta |>
    
  # Integrate - using batchelor.
  multiBatchNorm(batch = colData(sce_obj_gamma_delta)$sample) |>
  fastMNN(batch = colData(sce_obj_gamma_delta)$sample) |>
    
  # Join metadata removed by fastMNN - using tidyverse
  left_join(as_tibble(sce_obj_gamma_delta)) |>
    
  # Dimension reduction - using scater
  runUMAP(ncomponents = 2, dimred = "corrected")
```

Visualise gamma delta T cells. As we have used rough threshold we are
left with only few cells. Proper cluster selection should be used
instead (see supplementary material). 

```{r}	
sce_obj_gamma_delta |> plotUMAP()	
```


It is also possible to visualise the cells as a 3D plot using plotly.
The example data used here only contains a few genes, for the sake of
time and size in this demonstration, but below is how you could
generate the 3 dimensions needed for 3D plot with a full dataset.

```{r eval = FALSE}
single_cell_object |>
  RunUMAP(dims = 1:30, n.components = 3L, spread = 0.5, min.dist = 0.01, n.neighbors = 10L)
```

We'll demonstrate creating a 3D plot using some data that has 3 UMAP
dimensions. This is a fantastic way to visualise both reduced
dimensions and metadata in the same representation.

```{r umap plot 2, message = FALSE, warning = FALSE}
pbmc <- tidyomicsWorkshopBioc2023::sce_obj_UMAP3

pbmc |>
  plot_ly(
    x = ~`UMAP_1`,
    y = ~`UMAP_2`,
    z = ~`UMAP_3`,
    color = ~cell_type,
    colors = dittoSeq::dittoColors()
  ) %>%
  add_markers(size = I(1))
```

## Exercises

Using the `sce_obj`:

1. What proportion of all cells are gamma-delta T cells? Use
   signature_score > 0.7 to identify gamma-delta T cells. 

2. There is a cluster of cells characterised by a low RNA output
   (nCount_RNA < 100). Identify the cell composition (cell_type) of
   that cluster. 

# Part 3 Genomic and transcriptomic data integration

So far we have examined expression of genes across our cells, but we
haven't considered the genomic location of the genes. In this section
we will operate on genomic location information using the *plyranges*
[@Lee2019] package, and tie this to the single cell transcriptomics
data we have been examining.

*plyranges* provides tidyverse-style functionality to genomic ranges
(GRanges objects [@Lawrence2013]) analogous to how
*tidySingleCellExperiment* provides functionality for
SingleCellExperiment objects. For an example of a workflow using
*plyranges* as part of a bulk RNA-seq analysis, see @Lee2020.

:::: {.note}
In some pipelines (e.g. using *tximeta* [@Love2020] to import bulk or
single-cell quantification data) our SingleCellExperiment or
SummarizedExperiment would already have genomic range information
attached to the rows of the object. That is, we would have information
about the starts and ends of the genes, their strand information, even
the lengths of the chromosomes and the genome build, etc.
::::

Before we start our integrative analysis, we will first take some
steps to add hg38 range information on our genes, matching by the gene
symbol provided on the rows of `sce_obj`. We add genes from the
*ensembldb* package [@Rainer2019] (to see how to add a particular
version of Ensembl, see the package vignette).

```{r message=FALSE}
# what our rownames look like: gene symbols
sce_obj |> rownames() |> head()

# we recommend Ensembl or GENCODE gene annotations
edb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 # hg38
g <- ensembldb::genes(edb)
head(genome(g))
```

We can examine the first three genes and their metadata:

```{r message=FALSE}
library(plyranges)
g |> slice(1:3)
```

Our first task is to subset `g`, the human genes, to the ones in our
SingleCellExperiment. While Ensembl IDs are unique, gene symbols are
not, so we will have to also remove any duplicate gene symbols 
(recommend working with Ensembl IDs as feature identifiers when
possible).

We subset the columns of `g` to just the `symbol` column, using the
`select()` function. *plyranges* enables us to use familiar tidy verbs
on GRanges objects, e.g. to `filter` rows or to `select` columns of
metadata attached to the ranges.

```{r}
g <- g |>

  select(symbol)

g |> slice(1:3)
```

Now we filter the genes of our SingleCellExperiment to only those
present as rownames of `sce_obj`, and remove duplicate IDs. We then
sort by genomic location, and use the gene symbols as names of the
ranges.

```{r}
gene_names <- sce_obj |> rownames()

g <- g |>
  
  filter(symbol %in% gene_names) |>
  filter(!duplicated(symbol)) |>
  sort() # genomic position sorting

names(g) <- g$symbol
```

Now we can add our gene information to the rows of the
`sce_obj`. 

:::: {.note}
Again, these following steps would not be necessary using a
pipeline that imports quantification data to ranged objects, but it
isn't too hard to add manually. If adding ranges manually, make note
of the provenance of where you downloaded the information (Ensembl,
GENCODE or otherwise), and assign a genome build if you plan on
sharing the final object (e.g. with `genome(x) <- "..."`).
::::

```{r}
all(names(g) %in% rownames(sce_obj)) 
sce_sub <- sce_obj[ names(g), ] # put SCE in order of `g`
rowRanges(sce_sub) <- g # assign ranges `g` to the SCE
sce_sub |> rowRanges() # peek at the ranges
```

Now let's do something interesting with the gene ranges: let's see if
genes near peaks of active chromatin marks (H3K4me3 measured with
ChIP-seq) in another experiment involving PBMC have a difference in
their expression level compared to other genes.

There are many sources of epigenetic tracks, ENCODE, Roadmap,
etc. Here we will use some ENCODE [@ENCODE2012] data available in
*AnnotationHub* on Bioconductor.

:::: {.note}
We had to do a bit of book-keeping first. These peaks were in
the hg19 genome build, so we have ran the following commands (here
un-evaluated) to 'lift" the peaks to the hg38 genome build, and
provide proper sequence information.
::::

```{r eval=FALSE}
### un-evaluated code chunk ###
# AnnotationHub contains many useful genome tracks
library(AnnotationHub)
ah <- AnnotationHub()
# can be queried with keywords
query(ah, c("Peripheral_Blood","h3k4me3"))
# downloading a particular resource
peaks <- ah[["AH44823"]]
library(rtracklayer)
# lifting hg19 to hg38
new_peaks <- unlist(
  liftOver(
    peaks,
    chain = import.chain("~/Downloads/hg19ToHg38.over.chain")
  )
)
# Ensembl-style chrom names
seqlevelsStyle(new_peaks) <- "NCBI"
# bring over any missing chroms
seqlevels(new_peaks) <- seqlevels(sce_sub)
# bring over chrom lengths
seqinfo(new_peaks) <- seqinfo(sce_sub)
# subset to a few columns
pbmc_h3k4me3_hg38 <- new_peaks |>
  select(signalValue, qValue, peak)
# save for reloading below
save(pbmc_h3k4me3_hg38, file="data/pbmc_h3k4me3_hg38.rda", compress="xz")
```

Loading those peaks.

```{r}
# loading those H3K4me3 peaks for PBMCs
peaks <- tidyomicsWorkshopBioc2023::pbmc_h3k4me3_hg38
plot(peaks$qValue, type="l", ylab="q-value", main="H3K4me3 peaks")
abline(v=5000, lty=2)
```

We will arbitrarily chose to take the top 5,000 peaks by p-value. How
to chose the number of epigenetic peaks and genes to consider in a
multi-omics analysis is a more involved question, and is worth
considering the impact of such a choice for enrichment analysis.

```{r}
peaks <- peaks |>
  
  slice(1:5000) |> # ordered already by p-value (most to least sig)
  sort() # genomic position sorting
```

It is easy to compute the distance of the genes in our
SingleCellExperiment to the nearest H3K4me3 peak, using *plyranges*
convenience functions.

See the 
[plyranges package website](https://sa-lee.github.io/plyranges/reference/index.html) 
for a full listing of such functions.

```{r}
dists <- rowRanges(sce_sub) |>

  anchor_5p() |> # anchor 5 prime end
  mutate(width=1) |> # shrink range to 1bp
  add_nearest_distance(peaks)

hist(log10(dists$distance + 1), breaks=40,
     main="gene-to-peak distance", xlab="log10 distance")
```

Let's put the genes into 4 different bins by their distance to a
H3K4me3 peak: 1) overlapping [distance of 0], 2) 1bp-10kb, 
3) 10kb-100kb, or 4) farther than 100kb.

```{r}
bins <- c(0,1,1e4,1e5,Inf)

dists <- dists |>
  
  select(symbol, distance, .drop_ranges = TRUE) |> # remove chr/pos/strand etc.
  as_tibble() |>
  mutate(distance_bin = cut(distance, bins, include.lowest = TRUE)) |>
  rename(feature = symbol) # this will help us add information to the SCE
```

We can see how many genes are in which category:

```{r}
dists |>
  
  ggplot(aes(distance_bin)) +
  geom_bar()
```

We will now take the SingleCellExperiment data and compress it down to
a SummarizedExperiment using the `aggregate_cells` function from
*tidySingleCellExperiment*. After this function, every column of the
SummarizedExperiment is a cell type.

We immediately pipe the SummarizedExperiment into a `left_join` where
we add the distance-to-peak information, and then into a `nest`
command where we create a nested (tidy)SummarizedExperiment object,
where we have grouped by the distance bin that the genes fall into.


```{r message=FALSE}
library(tidySummarizedExperiment)
library(purrr)
nested <- sce_sub |>
  
  aggregate_cells(cell_type) |>
  left_join(dists, by="feature") |>
  nest(se = -distance_bin)

nested
```

We can now operate on the SummarizedExperiment objects that are within
`nested`. For example, we can extract the column means of the counts
(cell-type-specific means of counts over genes).

We save this as a new tibble called `smry` as it contains summary
information.

```{r}
smry <- nested |>

  mutate(summary = map(se, \(se) {
    tibble(count_mean = colMeans(assay(se)),                             
           cell_type = se$cell_type)
  })
  ) |>
  select(-se) |>
  unnest(summary) |>
  mutate(cell = substr(cell_type, 1, 13)) # abbreviate cell_type for plot
```

As a final plot, we can look at the mean count over the genes in a
particular distance bin (x-axis), across the cell types (y-axis). The
different lines show how the genes' proximity to H3K4me3 peaks affects
the average count.

```{r}
library(forcats)
smry |>
  
  mutate(cell = fct_reorder(cell, count_mean, .fun=median)) |>
  ggplot(aes(count_mean, cell, color=distance_bin, group=distance_bin)) +
  geom_point() +
  geom_line(orientation="y") +
  xlab("mean of count over genes in bin") +
  ylab("cell type")
```

We finish this section with a re-cap of the steps we took:

1. calculated distances from genes to H3K4me3 peaks
2. added distance information onto the SingleCellExperiment
3. condensed the dataset via pseudobulking
4. computed the mean count (over genes) within 4 bins of genes
5. visualized these mean counts over cell types

Let's consider a number of issues with this first-pass analysis and
visualization:

1. we arbitrarily thresholded the peaks at the beginning of the
   analysis to take the top 5,000
2. we just computed the mean count, not taking into account total
   reads per cell type (sequencing depth per cell x number of cells) 
3. we are comparing cell-type-specific expression to aggregate
   ChIP-seq peaks in PBMC
4. the two experiments, while both PBMC, come from different projects,
   different labs, and one is from cancer patients, while the other is
   from the ENCODE project
5. we only plot the mean count over genes, perhaps it would be better
   to show more information about the distribution

:::: {.note}
As an alternative to (1), we could have counted the number of peaks
that were near each genes, or even computed on the metadata of the
peaks (signal value, etc.). An example follows of how we could have
counted overlaps instead (how many peaks within 20kb).
::::

```{r}
overlaps <- rowRanges(sce_sub) |>
 
  anchor_5p() |>
  mutate(width=1) %>%
  mutate(num_overlaps = count_overlaps(., peaks, maxgap=2e4))

table(overlaps$num_overlaps)
```

:::: {.note}
Another question that often arises is whether the distances or
overlaps we observe are more or less than we would expect if there
were no relationship between the positions of genes and peaks.
To answer questions like these, we have developed the *nullranges*
package, which allows for either bootstrapping of ranges throughout
the genome [@Mu2023], or selection of a covariate-matched background
set [@Davis2023], to  compute probabilities under the null hypothesis.
A brief example follows of bootstrapping some of the peaks in a window
of chr1:20Mb-30Mb (see *nullranges* website for more comprehensive
examples).
::::

```{r}
library(nullranges)

# make a small range for demonstration of bootstrapping
seg <- data.frame(seqnames=1, start=20e6, width=10e6) |>
  as_granges()

# generate a genome segmentation from this one range
seg <- oneRegionSegment(seg, seqlength=seqlengths(peaks)[[1]])
seg

# bootstrap peaks within this segment (just for demo)
peaks |>
  filter_by_overlaps(seg[2]) |>
  bootRanges(blockLength=1e6, seg=seg, R=10)

# these bootstrapped ranges could then be used for computing
# generic test statistics, with the `iter` column used for
# building a null distribution
```

**Session Information**

```{r}
sessionInfo()
```

**References**

```{css echo=FALSE}
.note {
  margin: 30px;
  padding: 1em;
  background: #FFF8F0;
  border: 1px solid #EFE8E0;
  border-radius: 10px;
}
```