---
description: Combining and harmonizing samples or datasets from
different batches such as experiments or conditions to enable
meaningful cross-sample comparisons.
subtitle: Bioconductor Toolkit
title: Data Integration
---
> **Note**
>
> Code chunks run R commands unless otherwise specified.
In this tutorial we will look at different ways of integrating multiple
single cell RNA-seq datasets. We will explore a few different methods to
correct for batch effects across datasets. Seurat uses the data
integration method presented in Comprehensive Integration of Single Cell
Data, while Scran and Scanpy use a mutual Nearest neighbour method
(MNN). Below you can find a list of some methods for single data
integration:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Markdown Language Library Ref
----------------- ----------------- ----------------- -----------------------------------------------------------------------------------------------------------------------------------
CCA R Seurat [Cell](https://www.sciencedirect.com/science/article/pii/S0092867419305598?via%3Dihub)
MNN R/Python Scater/Scanpy [Nat. Biotech.](https://www.nature.com/articles/nbt.4091)
Conos R conos [Nat.
Methods](https://www.nature.com/articles/s41592-019-0466-z?error=cookies_not_supported&code=5680289b-6edb-40ad-9934-415dac4fdb2f)
Scanorama Python scanorama [Nat. Biotech.](https://www.nature.com/articles/s41587-019-0113-3)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Data preparation
Let's first load necessary libraries and the data saved in the previous
lab.
``` {r}
#| label: libraries
suppressPackageStartupMessages({
library(scater)
library(scran)
library(patchwork)
library(ggplot2)
library(batchelor)
library(harmony)
library(basilisk)
})
# path to conda env for python environment with scanorama.
condapath = "/usr/local/conda/envs/seurat"
```
``` {r}
#| label: fetch-data
# download pre-computed data if missing or long compute
fetch_data <- TRUE
# url for source and intermediate data
path_data <- "https://nextcloud.dc.scilifelab.se/public.php/webdav"
curl_upass <- "-u zbC5fr2LbEZ9rSE:scRNAseq2025"
path_file <- "data/covid/results/bioc_covid_qc_dr.rds"
if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results_bioc/bioc_covid_qc_dr.rds"), destfile = path_file, method = "curl", extra = curl_upass)
sce <- readRDS(path_file)
print(reducedDims(sce))
```
We split the combined object into a list, with each dataset as an
element. We perform standard preprocessing (log-normalization), and
identify variable features individually for each dataset based on a
variance stabilizing transformation (**vst**).
If you recall from the dimensionality reduction exercise, we can run
variable genes detection with a blocking parameter to avoid including
batch effect genes. Here we will explore the genesets we get with and
without the blocking parameter and also the variable genes per dataset.
``` {r}
#| label: hvg
var.out <- modelGeneVar(sce, block = sce$sample)
hvgs <- getTopHVGs(var.out, n = 2000)
var.out.nobatch <- modelGeneVar(sce)
hvgs.nobatch <- getTopHVGs(var.out.nobatch, n = 2000)
# the var out with block has a data frame of data frames in column 7.
# one per dataset.
hvgs_per_dataset <- lapply(var.out[[7]], getTopHVGs, n=2000)
hvgs_per_dataset$all = hvgs
hvgs_per_dataset$all.nobatch = hvgs.nobatch
temp <- unique(unlist(hvgs_per_dataset))
overlap <- sapply(hvgs_per_dataset, function(x) {
temp %in% x
})
```
``` {r}
#| label: hvg-overlap
#| fig-height: 4
#| fig-width: 8
pheatmap::pheatmap(t(overlap * 1), cluster_rows = F, color = c("grey90", "grey20")) ## MNN
```
As you can see, there are a lot of genes that are variable in just one
dataset. There are also some genes in the gene set that was selected
using all the data without blocking samples, that are not variable in
any of the individual datasets. These are most likely genes driven by
batch effects.
The best way to select features for integration is to combine the
information on variable genes across the dataset. This is what we have
in the `all` section where the information on variable features in the
different datasets is combined.
> **Discuss**
>
> Did you understand the difference between running variable gene
> selection per dataset and combining them vs running it on all samples
> together. Can you think of any situation where it would be best to run
> it on all samples and a situation where it should be done by batch?
For all downstream integration we will use this set of genes so that it
is comparable across the methods. We already used that set of genes in
the dimensionality reduction exercise to run scaling and pca.
We also store the variable gene information in the object for use
furhter down the line.
``` {r}
#| label: hvg2
metadata(sce)$hvgs = hvgs
```
## fastMNN
The mutual nearest neighbors (MNN) approach within the scran package
utilizes a novel approach to adjust for batch effects. The `fastMNN()`
function returns a representation of the data with reduced
dimensionality, which can be used in a similar fashion to other
lower-dimensional representations such as PCA. In particular, this
representation can be used for downstream methods such as clustering.
The BNPARAM can be used to specify the specific nearest neighbors method
to use from the BiocNeighbors package. Here we make use of the [Annoy
library](https://github.com/spotify/annoy) via the
`BiocNeighbors::AnnoyParam()` argument. We save the reduced-dimension
MNN representation into the reducedDims slot of our sce object.
``` {r}
#| label: mnn
mnn_out <- batchelor::fastMNN(sce, subset.row = hvgs, batch = factor(sce$sample), k = 20, d = 50)
```
> **Caution**
>
> `fastMNN()` does not produce a batch-corrected expression matrix.
We will take the reduced dimension in the new `mnn_out` object and add
it into the original `sce` object.
``` {r}
#| label: dimred-mnn
mnn_dim <- reducedDim(mnn_out, "corrected")
reducedDim(sce, "MNN") <- mnn_dim
```
We can observe that a new assay slot is now created under the name
`MNN`.
``` {r}
#| label: dimred-list
reducedDims(sce)
```
Thus, the result from `fastMNN()` should solely be treated as a reduced
dimensionality representation, suitable for direct plotting, TSNE/UMAP,
clustering, and trajectory analysis that relies on such results.
``` {r}
#| label: proc-mnn
set.seed(42)
sce <- runTSNE(sce, dimred = "MNN", n_dimred = 50, perplexity = 30, name = "tSNE_on_MNN")
sce <- runUMAP(sce, dimred = "MNN", n_dimred = 50, ncomponents = 2, name = "UMAP_on_MNN")
```
We can now plot the unintegrated and the integrated space reduced
dimensions.
``` {r}
#| label: plot-mnn
#| fig-height: 6
#| fig-width: 12
wrap_plots(
plotReducedDim(sce, dimred = "PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "PCA"),
plotReducedDim(sce, dimred = "tSNE_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "tSNE_on_PCA"),
plotReducedDim(sce, dimred = "UMAP_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_PCA"),
plotReducedDim(sce, dimred = "MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "MNN"),
plotReducedDim(sce, dimred = "tSNE_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "tSNE_on_MNN"),
plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_MNN"),
ncol = 3
) + plot_layout(guides = "collect")
```
Let's plot some marker genes for different cell types onto the
embedding.
Markers Cell Type
-------------------------- -------------------
CD3E T cells
CD3E CD4 CD4+ T cells
CD3E CD8A CD8+ T cells
GNLY, NKG7 NK cells
MS4A1 B cells
CD14, LYZ, CST3, MS4A7 CD14+ Monocytes
FCGR3A, LYZ, CST3, MS4A7 FCGR3A+ Monocytes
FCER1A, CST3 DCs
``` {r}
#| label: plot-markers
#| fig-height: 16
#| fig-width: 13
plotlist <- list()
for (i in c("CD3E", "CD4", "CD8A", "NKG7", "GNLY", "MS4A1", "CD14", "LYZ", "MS4A7", "FCGR3A", "CST3", "FCER1A")) {
plotlist[[i]] <- plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = i, by_exprs_values = "logcounts", point_size = 0.6) +
scale_fill_gradientn(colours = colorRampPalette(c("grey90", "orange3", "firebrick", "firebrick", "red", "red"))(10)) +
ggtitle(label = i) + theme(plot.title = element_text(size = 20))
}
wrap_plots(plotlist = plotlist, ncol = 3)
```
## Harmony
An alternative method for integration is Harmony, for more details on
the method, please se their paper [Nat.
Methods](https://www.nature.com/articles/s41592-019-0619-0). This method
runs the integration on a dimensionality reduction, in most applications
the PCA. So first, we prefer to have scaling and PCA with the same set
of genes that were used for the CCA integration, which we ran earlier.
``` {r}
#| label: harmony
#| fig-height: 5
#| fig-width: 14
library(harmony)
reducedDimNames(sce)
sce <- RunHarmony(
sce,
group.by.vars = "sample",
reduction.save = "harmony",
reduction = "PCA",
dims.use = 1:50
)
# Here we use all PCs computed from Harmony for UMAP calculation
sce <- runUMAP(sce, dimred = "harmony", n_dimred = 50, ncomponents = 2, name = "UMAP_on_Harmony")
plotReducedDim(sce, dimred = "UMAP_on_Harmony", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Harmony")
```
## Scanorama
Another integration method is Scanorama (see [Nat.
Biotech.](https://www.nature.com/articles/s41587-019-0113-3)). This
method is implemented in python, but we can run it through the
Reticulate package.
We will run it with the same set of variable genes, but first we have to
create a list of all the objects per sample.
``` {r}
#| label: prep-scanorama
#| fig-height: 5
#| fig-width: 15
scelist <- lapply(unique(sce$sample), function(x) {
x <- t(as.matrix(assay(sce, "logcounts")[hvgs,sce$sample == x]))
})
genelist = rep(list(hvgs),length(scelist))
lapply(scelist, dim)
```
Scanorama is implemented in python, but through reticulate we can load
python packages and run python functions. In this case we also use the
`basilisk` package for a more clean activation of python environment.
At the top of this script, we set the variable `condapath` to point to
the conda environment where scanorama is included.
``` {r}
#| label: scanorama
#| fig-height: 5
#| fig-width: 15
# run scanorama via basilisk with scelist and genelist as input.
integrated.data = basiliskRun(env=condapath, fun=function(datas, genes) {
scanorama <- reticulate::import("scanorama")
output <- scanorama$integrate(datasets_full = datas,
genes_list = genes )
return(output)
}, datas = scelist, genes = genelist, testload="scanorama")
intdimred <- do.call(rbind, integrated.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)
rownames(intdimred) <- colnames(logcounts(sce))
# Add standard deviations in order to draw Elbow Plots
stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)
attr(intdimred, "varExplained") <- stdevs
reducedDim(sce, "Scanorama") <- intdimred
# Here we use all PCs computed from Scanorama for UMAP calculation
sce <- runUMAP(sce, dimred = "Scanorama", n_dimred = 50, ncomponents = 2, name = "UMAP_on_Scanorama")
plotReducedDim(sce, dimred = "UMAP_on_Scanorama", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Scanorama")
```
## Overview all methods
Now we will plot UMAPS with all three integration methods side by side.
``` {r}
#| label: plot-all
#| fig-height: 8
#| fig-width: 10
p1 <- plotReducedDim(sce, dimred = "UMAP_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_PCA")
p2 <- plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_MNN")
p3 <- plotReducedDim(sce, dimred = "UMAP_on_Harmony", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Harmony")
p4 <- plotReducedDim(sce, dimred = "UMAP_on_Scanorama", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Scanorama")
wrap_plots(p1, p2, p3, p4, nrow = 2) +
plot_layout(guides = "collect")
```
> **Discuss**
>
> Look at the different integration results, which one do you think
> looks the best? How would you motivate selecting one method over the
> other? How do you think you could best evaluate if the integration
> worked well?
Let's save the integrated data for further analysis.
``` {r}
#| label: save
saveRDS(sce, "data/covid/results/bioc_covid_qc_dr_int.rds")
```
## Session info
```{=html}
```
```{=html}
```
Click here
```{=html}
```
``` {r}
#| label: session
sessionInfo()
```
```{=html}
```