--- title: "Tidy genomic and transcriptomic single-cell analyses" author: - Maria Doyle, Peter MacCallum Cancer Centre^[] - Stefano Mangiola, Walter and Eliza Hall Institute^[] - Michael Love, UNC-Chapel Hill^[] output: rmarkdown::html_vignette bibliography: "`r file.path(system.file(package='tidyomicsWorkshopBioc2023', 'vignettes'), 'tidyomics.bib')`" vignette: > %\VignetteIndexEntry{Tidy genomic and transcriptomic single-cell analyses} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache = FALSE) ``` ## Instructors *Dr. Stefano Mangiola* is currently a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall Institute in Melbourne, Australia. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical models for the analysis of RNA sequencing data, and data analysis and visualisation interfaces. *Dr. Michael Love* is an Associate Professor at UNC-Chapel Hill in the Department of Genetics and the Department of Biostatistics. His background is in Statistics and Computational Biology. His research is highly collaborative, working with Geneticists, Molecular Biologists, Epidemiologists, and Computer Scientists on methods for transcriptomics, epigenetics, GWAS, and high-throughput functional screens. ## Workshop goals and objectives ### What you will learn - Basic `tidy` operations possible with `tidySingleCellExperiment` and `GRanges` - How to interface `SingleCellExperiment` with tidy manipulation and visualisation - A real-world case study that will showcase the power of `tidy` single-cell methods compared with base/ad-hoc methods - Examples of how to integrate genomic and transcriptomic data (ChIP-seq and RNA-seq) ### What you will *not* learn - The molecular technology of single-cell sequencing - The fundamentals of single-cell data analysis - The fundamentals of tidy data analysis - Detailed data integration methods (multi-view or multi-omics algorithms) ## Getting started ### Local We will use the Cloud during the workshop and this method is available if you want to run the material after the workshop. If you want to install on your own computer, see instructions [here](https://tidyomics.github.io/tidyomicsWorkshopBioc2023/index.html#workshop-package-installation). Alternatively, you can view the material at the workshop webpage [here](https://tidyomics.github.io/tidyomicsWorkshopBioc2023/articles/main.html). ## Introduction slides We provide two links to introductory material (to consult outside of the workshop slot): 1. [Introduction to tidytranscriptomics](https://docs.google.com/gview?url=https://raw.githubusercontent.com/tidyomics/tidyomicsWorkshopBioc2023/master/inst/tidytranscriptomics_slides.pdf) 2. [Introduction to tidy genomics](https://github.com/tidyomics/tidy-genomics-talk/blob/main/tidy-genomics-talk.pdf) # Part 1 Introduction to tidySingleCellExperiment ```{r message = FALSE} # Load packages library(SingleCellExperiment) library(ggplot2) library(plotly) library(dplyr) library(colorspace) library(dittoSeq) ``` SingleCellExperiment is a popular data object in the Bioconductor ecosystem for representing single-cell datasets, which works seamlessly with numerous Bioconductor packages and methods written by different groups [@Amezquita2020]. It can easily be converted to and from other formats such as SeuratObject and scverse's AnnData. Here we load single-cell data in SingleCellExperiment object format. This data is peripheral blood mononuclear cells (PBMCs) from metastatic breast cancer patients. ```{r} # load single cell RNA sequencing data sce_obj <- tidyomicsWorkshopBioc2023::sce_obj # take a look sce_obj ``` tidySingleCellExperiment provides a bridge between the SingleCellExperiment single-cell package and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the SingleCellExperiment object as a tidyverse tibble, and provides SingleCellExperiment-compatible *dplyr*, *tidyr*, *ggplot* and *plotly* functions. For related work for bulk RNA-seq, see the [tidybulk](https://stemangiola.github.io/tidybulk/) package [@Mangiola2021]. If we load the *tidySingleCellExperiment* package and then view the single cell data, it now displays as a tibble. ```{r message = FALSE} library(tidySingleCellExperiment) sce_obj ``` If we want to revert to the standard SingleCellExperiment view we can do that. ```{r} options("restore_SingleCellExperiment_show" = TRUE) sce_obj ``` If we want to revert back to tidy SingleCellExperiment view we can. ```{r} options("restore_SingleCellExperiment_show" = FALSE) sce_obj ``` It can be interacted with using [SingleCellExperiment commands](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) such as `assays`. ```{r} assays(sce_obj) ``` We can also interact with our object as we do with any tidyverse tibble. ## Tidyverse commands We can use tidyverse commands, such as `filter`, `select` and `mutate` to explore the tidySingleCellExperiment object. Some examples are shown below and more can be seen at the tidySingleCellExperiment website [here](https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html#tidyverse-commands-1). We can use `filter` to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage. ```{r} sce_obj |> filter(Phase == "G1") ``` :::: {.note} Note that *rows* in this context refers to rows of the abstraction, not *rows* of the SingleCellExperiment which correspond to genes. *tidySingleCellExperiment* prioritizes cells as the units of observation in the abstraction, while the full dataset, including measurements of expression of all genes, is still available "in the background". :::: We can use `select` to view columns, for example, to see the filename, total cellular RNA abundance and cell phase. If we use `select` we will also get any view-only columns returned, such as the UMAP columns generated during the preprocessing. ```{r} sce_obj |> select(.cell, file, nCount_RNA, Phase) ``` We can use `mutate` to create a column. For example, we could create a new `Phase_l` column that contains a lower-case version of `Phase`. ```{r message=FALSE} sce_obj |> mutate(Phase_l = tolower(Phase)) |> select(.cell, Phase, Phase_l) ``` We can use tidyverse commands to polish an annotation column. We will extract the sample, and group information from the file name column into separate columns. ```{r message=FALSE} # First take a look at the file column sce_obj |> select(.cell, file) ``` ```{r} # Create column for sample sce_obj <- sce_obj |> # Extract sample extract(file, "sample", "../data/.*/([a-zA-Z0-9_-]+)/outs.+", remove = FALSE) # Take a look sce_obj |> select(.cell, sample, everything()) ``` We could use tidyverse `unite` to combine columns, for example to create a new column for sample id combining the sample and patient id (BCB) columns. ```{r message=FALSE} sce_obj <- sce_obj |> unite("sample_id", sample, BCB, remove = FALSE) # Take a look sce_obj |> select(.cell, sample_id, sample, BCB) ``` # Part 2 Signature visualisation ## Data pre-processing The object `sce_obj` we've been using was created as part of a study on breast cancer systemic immune response. Peripheral blood mononuclear cells have been sequenced for RNA at the single-cell level. The steps used to generate the object are summarised below. - `scran`, `scater`, and `DropletsUtils` packages have been used to eliminate empty droplets and dead cells. Samples were individually quality checked and cells were filtered for good gene coverage . - Variable features were identified using `modelGeneVar`. - Read counts were scaled and normalised using logNormCounts from `scuttle`. - Data integration was performed using `fastMNN` with default parameters. - PCA performed to reduce feature dimensionality. - Nearest-neighbor cell networks were calculated using 30 principal components. - 2 UMAP dimensions were calculated using 30 principal components. - Cells with similar transcriptome profiles were grouped into clusters using Louvain clustering from `scran`. ## Analyse custom signature The researcher analysing this dataset wanted to identify gamma delta T cells using a gene signature from a published paper [@Pizzolato2019]. We'll show how that can be done here. With tidySingleCellExperiment's `join_features` we can view the counts for genes in the signature as columns joined to our single cell tibble. ```{r} sce_obj |> join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide") ``` We can use tidyverse `mutate` to create a column containing the signature score. To generate the score, we scale the sum of the 4 genes, CD3D, TRDC, TRGC1, TRGC2, and subtract the scaled sum of the 2 genes, CD8A and CD8B. `mutate` is powerful in enabling us to perform complex arithmetic operations easily. ```{r} sce_scored <- sce_obj |> join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide") |> mutate( signature_score = scales::rescale(CD3D + TRDC + TRGC1 + TRGC2, to = c(0, 1)) - scales::rescale(CD8A + CD8B, to = c(0, 1)) ) sce_scored |> select(.cell, signature_score, everything()) ``` The gamma delta T cells could then be visualised by the signature score using Bioconductor's visualisation functions. ```{r} sce_scored |> scater::plotUMAP(colour_by = "signature_score") ``` The cells could also be visualised using the popular and powerful `ggplot2` package, enabling the researcher to use ggplot functions they were familiar with, and to customise the plot with great flexibility. ```{r} sce_scored |> # plot cells with high score last so they're not obscured by other cells arrange(signature_score) |> ggplot(aes(UMAP_1, UMAP_2, color = signature_score)) + geom_point() + scale_color_distiller(palette = "Spectral") + tidyomicsWorkshopBioc2023::theme_multipanel ``` For exploratory analyses, we can select the gamma delta T cells, the red cluster on the left with high signature score. We'll filter for cells with a signature score > 0.7. ```{r} sce_obj_gamma_delta <- sce_scored |> # Proper cluster selection should be used instead (see supplementary material) filter(signature_score > 0.7) ``` For comparison, we show the alternative using base R and SingleCellExperiment. Note that the code contains more redundancy and intermediate objects. ```{r eval=FALSE} counts_positive <- assay(sce_obj, "logcounts")[c("CD3D", "TRDC", "TRGC1", "TRGC2"), ] |> colSums() |> scales::rescale(to = c(0, 1)) counts_negative <- assay(sce_obj, "logcounts")[c("CD8A", "CD8B"), ] |> colSums() |> scales::rescale(to = c(0, 1)) sce_obj$signature_score <- counts_positive - counts_negative sce_obj_gamma_delta <- sce_obj[, sce_obj$signature_score > 0.7] ``` We can then focus on just these gamma delta T cells and chain Bioconductor and tidyverse commands together to analyse. ```{r warning=FALSE, message=FALSE} library(batchelor) library(scater) sce_obj_gamma_delta <- sce_obj_gamma_delta |> # Integrate - using batchelor. multiBatchNorm(batch = colData(sce_obj_gamma_delta)$sample) |> fastMNN(batch = colData(sce_obj_gamma_delta)$sample) |> # Join metadata removed by fastMNN - using tidyverse left_join(as_tibble(sce_obj_gamma_delta)) |> # Dimension reduction - using scater runUMAP(ncomponents = 2, dimred = "corrected") ``` Visualise gamma delta T cells. As we have used rough threshold we are left with only few cells. Proper cluster selection should be used instead (see supplementary material). ```{r} sce_obj_gamma_delta |> plotUMAP() ``` It is also possible to visualise the cells as a 3D plot using plotly. The example data used here only contains a few genes, for the sake of time and size in this demonstration, but below is how you could generate the 3 dimensions needed for 3D plot with a full dataset. ```{r eval = FALSE} single_cell_object |> RunUMAP(dims = 1:30, n.components = 3L, spread = 0.5, min.dist = 0.01, n.neighbors = 10L) ``` We'll demonstrate creating a 3D plot using some data that has 3 UMAP dimensions. This is a fantastic way to visualise both reduced dimensions and metadata in the same representation. ```{r umap plot 2, message = FALSE, warning = FALSE} pbmc <- tidyomicsWorkshopBioc2023::sce_obj_UMAP3 pbmc |> plot_ly( x = ~`UMAP_1`, y = ~`UMAP_2`, z = ~`UMAP_3`, color = ~cell_type, colors = dittoSeq::dittoColors() ) %>% add_markers(size = I(1)) ``` ## Exercises Using the `sce_obj`: 1. What proportion of all cells are gamma-delta T cells? Use signature_score > 0.7 to identify gamma-delta T cells. 2. There is a cluster of cells characterised by a low RNA output (nCount_RNA < 100). Identify the cell composition (cell_type) of that cluster. # Part 3 Genomic and transcriptomic data integration So far we have examined expression of genes across our cells, but we haven't considered the genomic location of the genes. In this section we will operate on genomic location information using the *plyranges* [@Lee2019] package, and tie this to the single cell transcriptomics data we have been examining. *plyranges* provides tidyverse-style functionality to genomic ranges (GRanges objects [@Lawrence2013]) analogous to how *tidySingleCellExperiment* provides functionality for SingleCellExperiment objects. For an example of a workflow using *plyranges* as part of a bulk RNA-seq analysis, see @Lee2020. :::: {.note} In some pipelines (e.g. using *tximeta* [@Love2020] to import bulk or single-cell quantification data) our SingleCellExperiment or SummarizedExperiment would already have genomic range information attached to the rows of the object. That is, we would have information about the starts and ends of the genes, their strand information, even the lengths of the chromosomes and the genome build, etc. :::: Before we start our integrative analysis, we will first take some steps to add hg38 range information on our genes, matching by the gene symbol provided on the rows of `sce_obj`. We add genes from the *ensembldb* package [@Rainer2019] (to see how to add a particular version of Ensembl, see the package vignette). ```{r message=FALSE} # what our rownames look like: gene symbols sce_obj |> rownames() |> head() # we recommend Ensembl or GENCODE gene annotations edb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 # hg38 g <- ensembldb::genes(edb) head(genome(g)) ``` We can examine the first three genes and their metadata: ```{r message=FALSE} library(plyranges) g |> slice(1:3) ``` Our first task is to subset `g`, the human genes, to the ones in our SingleCellExperiment. While Ensembl IDs are unique, gene symbols are not, so we will have to also remove any duplicate gene symbols (recommend working with Ensembl IDs as feature identifiers when possible). We subset the columns of `g` to just the `symbol` column, using the `select()` function. *plyranges* enables us to use familiar tidy verbs on GRanges objects, e.g. to `filter` rows or to `select` columns of metadata attached to the ranges. ```{r} g <- g |> select(symbol) g |> slice(1:3) ``` Now we filter the genes of our SingleCellExperiment to only those present as rownames of `sce_obj`, and remove duplicate IDs. We then sort by genomic location, and use the gene symbols as names of the ranges. ```{r} gene_names <- sce_obj |> rownames() g <- g |> filter(symbol %in% gene_names) |> filter(!duplicated(symbol)) |> sort() # genomic position sorting names(g) <- g$symbol ``` Now we can add our gene information to the rows of the `sce_obj`. :::: {.note} Again, these following steps would not be necessary using a pipeline that imports quantification data to ranged objects, but it isn't too hard to add manually. If adding ranges manually, make note of the provenance of where you downloaded the information (Ensembl, GENCODE or otherwise), and assign a genome build if you plan on sharing the final object (e.g. with `genome(x) <- "..."`). :::: ```{r} all(names(g) %in% rownames(sce_obj)) sce_sub <- sce_obj[ names(g), ] # put SCE in order of `g` rowRanges(sce_sub) <- g # assign ranges `g` to the SCE sce_sub |> rowRanges() # peek at the ranges ``` Now let's do something interesting with the gene ranges: let's see if genes near peaks of active chromatin marks (H3K4me3 measured with ChIP-seq) in another experiment involving PBMC have a difference in their expression level compared to other genes. There are many sources of epigenetic tracks, ENCODE, Roadmap, etc. Here we will use some ENCODE [@ENCODE2012] data available in *AnnotationHub* on Bioconductor. :::: {.note} We had to do a bit of book-keeping first. These peaks were in the hg19 genome build, so we have ran the following commands (here un-evaluated) to 'lift" the peaks to the hg38 genome build, and provide proper sequence information. :::: ```{r eval=FALSE} ### un-evaluated code chunk ### # AnnotationHub contains many useful genome tracks library(AnnotationHub) ah <- AnnotationHub() # can be queried with keywords query(ah, c("Peripheral_Blood","h3k4me3")) # downloading a particular resource peaks <- ah[["AH44823"]] library(rtracklayer) # lifting hg19 to hg38 new_peaks <- unlist( liftOver( peaks, chain = import.chain("~/Downloads/hg19ToHg38.over.chain") ) ) # Ensembl-style chrom names seqlevelsStyle(new_peaks) <- "NCBI" # bring over any missing chroms seqlevels(new_peaks) <- seqlevels(sce_sub) # bring over chrom lengths seqinfo(new_peaks) <- seqinfo(sce_sub) # subset to a few columns pbmc_h3k4me3_hg38 <- new_peaks |> select(signalValue, qValue, peak) # save for reloading below save(pbmc_h3k4me3_hg38, file="data/pbmc_h3k4me3_hg38.rda", compress="xz") ``` Loading those peaks. ```{r} # loading those H3K4me3 peaks for PBMCs peaks <- tidyomicsWorkshopBioc2023::pbmc_h3k4me3_hg38 plot(peaks$qValue, type="l", ylab="q-value", main="H3K4me3 peaks") abline(v=5000, lty=2) ``` We will arbitrarily chose to take the top 5,000 peaks by p-value. How to chose the number of epigenetic peaks and genes to consider in a multi-omics analysis is a more involved question, and is worth considering the impact of such a choice for enrichment analysis. ```{r} peaks <- peaks |> slice(1:5000) |> # ordered already by p-value (most to least sig) sort() # genomic position sorting ``` It is easy to compute the distance of the genes in our SingleCellExperiment to the nearest H3K4me3 peak, using *plyranges* convenience functions. See the [plyranges package website](https://sa-lee.github.io/plyranges/reference/index.html) for a full listing of such functions. ```{r} dists <- rowRanges(sce_sub) |> anchor_5p() |> # anchor 5 prime end mutate(width=1) |> # shrink range to 1bp add_nearest_distance(peaks) hist(log10(dists$distance + 1), breaks=40, main="gene-to-peak distance", xlab="log10 distance") ``` Let's put the genes into 4 different bins by their distance to a H3K4me3 peak: 1) overlapping [distance of 0], 2) 1bp-10kb, 3) 10kb-100kb, or 4) farther than 100kb. ```{r} bins <- c(0,1,1e4,1e5,Inf) dists <- dists |> select(symbol, distance, .drop_ranges = TRUE) |> # remove chr/pos/strand etc. as_tibble() |> mutate(distance_bin = cut(distance, bins, include.lowest = TRUE)) |> rename(feature = symbol) # this will help us add information to the SCE ``` We can see how many genes are in which category: ```{r} dists |> ggplot(aes(distance_bin)) + geom_bar() ``` We will now take the SingleCellExperiment data and compress it down to a SummarizedExperiment using the `aggregate_cells` function from *tidySingleCellExperiment*. After this function, every column of the SummarizedExperiment is a cell type. We immediately pipe the SummarizedExperiment into a `left_join` where we add the distance-to-peak information, and then into a `nest` command where we create a nested (tidy)SummarizedExperiment object, where we have grouped by the distance bin that the genes fall into. ```{r message=FALSE} library(tidySummarizedExperiment) library(purrr) nested <- sce_sub |> aggregate_cells(cell_type) |> left_join(dists, by="feature") |> nest(se = -distance_bin) nested ``` We can now operate on the SummarizedExperiment objects that are within `nested`. For example, we can extract the column means of the counts (cell-type-specific means of counts over genes). We save this as a new tibble called `smry` as it contains summary information. ```{r} smry <- nested |> mutate(summary = map(se, \(se) { tibble(count_mean = colMeans(assay(se)), cell_type = se$cell_type) }) ) |> select(-se) |> unnest(summary) |> mutate(cell = substr(cell_type, 1, 13)) # abbreviate cell_type for plot ``` As a final plot, we can look at the mean count over the genes in a particular distance bin (x-axis), across the cell types (y-axis). The different lines show how the genes' proximity to H3K4me3 peaks affects the average count. ```{r} library(forcats) smry |> mutate(cell = fct_reorder(cell, count_mean, .fun=median)) |> ggplot(aes(count_mean, cell, color=distance_bin, group=distance_bin)) + geom_point() + geom_line(orientation="y") + xlab("mean of count over genes in bin") + ylab("cell type") ``` We finish this section with a re-cap of the steps we took: 1. calculated distances from genes to H3K4me3 peaks 2. added distance information onto the SingleCellExperiment 3. condensed the dataset via pseudobulking 4. computed the mean count (over genes) within 4 bins of genes 5. visualized these mean counts over cell types Let's consider a number of issues with this first-pass analysis and visualization: 1. we arbitrarily thresholded the peaks at the beginning of the analysis to take the top 5,000 2. we just computed the mean count, not taking into account total reads per cell type (sequencing depth per cell x number of cells) 3. we are comparing cell-type-specific expression to aggregate ChIP-seq peaks in PBMC 4. the two experiments, while both PBMC, come from different projects, different labs, and one is from cancer patients, while the other is from the ENCODE project 5. we only plot the mean count over genes, perhaps it would be better to show more information about the distribution :::: {.note} As an alternative to (1), we could have counted the number of peaks that were near each genes, or even computed on the metadata of the peaks (signal value, etc.). An example follows of how we could have counted overlaps instead (how many peaks within 20kb). :::: ```{r} overlaps <- rowRanges(sce_sub) |> anchor_5p() |> mutate(width=1) %>% mutate(num_overlaps = count_overlaps(., peaks, maxgap=2e4)) table(overlaps$num_overlaps) ``` :::: {.note} Another question that often arises is whether the distances or overlaps we observe are more or less than we would expect if there were no relationship between the positions of genes and peaks. To answer questions like these, we have developed the *nullranges* package, which allows for either bootstrapping of ranges throughout the genome [@Mu2023], or selection of a covariate-matched background set [@Davis2023], to compute probabilities under the null hypothesis. A brief example follows of bootstrapping some of the peaks in a window of chr1:20Mb-30Mb (see *nullranges* website for more comprehensive examples). :::: ```{r} library(nullranges) # make a small range for demonstration of bootstrapping seg <- data.frame(seqnames=1, start=20e6, width=10e6) |> as_granges() # generate a genome segmentation from this one range seg <- oneRegionSegment(seg, seqlength=seqlengths(peaks)[[1]]) seg # bootstrap peaks within this segment (just for demo) peaks |> filter_by_overlaps(seg[2]) |> bootRanges(blockLength=1e6, seg=seg, R=10) # these bootstrapped ranges could then be used for computing # generic test statistics, with the `iter` column used for # building a null distribution ``` **Session Information** ```{r} sessionInfo() ``` **References** ```{css echo=FALSE} .note { margin: 30px; padding: 1em; background: #FFF8F0; border: 1px solid #EFE8E0; border-radius: 10px; } ```