---
title: "Rmagic EMT Tutorial"
output:
  html_document:
    df_print: paged
    toc: yes
    toc_depth: '3'
---

<!-- emt_tutorial.md is generated from emt_tutorial.Rmd. Please edit that file -->

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## MAGIC (Markov Affinity-Based Graph Imputation of Cells)

* MAGIC imputes missing data values on sparse data sets, restoring the structure of the data
* It also proves dimensionality reduction and gene expression visualizations
* MAGIC can be performed on a variety of datasets
* Here, we show the effectiveness of MAGIC on epithelial-to-mesenchymal transition (EMT) data

Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm for denoising and transcript recover of single cells applied to single-cell RNA sequencing data, as described in Van Dijk D *et al.* (2018), *Recovering Gene Interactions from Single-Cell Data Using Data Diffusion*, Cell <https://www.cell.com/cell/abstract/S0092-8674(18)30724-4>.

### Installation

If you haven't yet installed MAGIC, you can find installation instructions in our [GitHub README](https://github.com/KrishnaswamyLab/MAGIC/tree/master/Rmagic).

We'll install a couple more tools for this tutorial.

```{r install_extras, eval=FALSE}
if (!require(viridis)) install.packages("viridis")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(readr)) install.packages("readr")
if (!require(phateR)) install.packages("phateR")
```

If you have never used PHATE, you should also install PHATE from the command line as follows:

```{bash install_python_phate, eval=FALSE}
pip install --user phate
```

### Loading packages

We load the Rmagic package and a few others for convenience functions.

```{r load_packages}
library(Rmagic)
library(readr)
library(ggplot2)
library(viridis)
library(phateR)
```

### Loading data

In this tutorial, we will analyze single-cell RNA sequencing of the epithelial to mesenchymal transition. The example data is located in the MAGIC Github repository. You can run this tutorial with your own data by downloading <https://raw.githubusercontent.com/KrishnaswamyLab/MAGIC/master/Rmagic/inst/examples/emt_tutorial.Rmd> and opening it in RStudio.

```{r load_data}
# load data
data <- read_csv("../../../data/HMLE_TGFb_day_8_10.csv.gz")
data[1:5, 1:10]
```

### Filtering data

First, we need to remove lowly expressed genes.

```{r remove_rare_genes}
# keep genes expressed in at least 10 cells
keep_cols <- colSums(data > 0) > 10
data <- data[, keep_cols]
```

Ordinarily, we would remove cells with small library sizes. In this dataset, it has already been done; however, if you wanted to do that, you could do it with the code below.

```{r libsize_histogram}
# look at the distribution of library sizes
ggplot() +
  geom_histogram(aes(x = rowSums(data)), bins = 50) +
  geom_vline(xintercept = 1000, color = "red")
```

```{r filter_libsize}
if (FALSE) {
  # keep cells with at least 1000 UMIs and at most 15000
  keep_rows <- rowSums(data) > 1000 & rowSums(data) < 15000
  data <- data[keep_rows, ]
}
```

### Normalizing data

We should library size normalize the data prior to MAGIC. Often we also transform the data with either log or square root. The log transform is commonly used, which requires adding a "pseudocount" to avoid log(0). We normally square root instead, which has a similar form but doesn't suffer from instabilities at zero. For this dataset, though, it is not necessary as the distribution of gene expression is not too extreme.

```{r normalize}
data <- library.size.normalize(data)
if (FALSE) {
  data <- sqrt(data)
}
```

### Running MAGIC

Running MAGIC is as simple as running the `magic` function. Because this dataset is rather small, we can decrease `knn` from the default of 5 down to 3.

```{r run_magic}
# run MAGIC
data_MAGIC <- magic(data, knn = 3, genes = c("VIM", "CDH1", "ZEB1"))
```

We can plot the data before and after MAGIC to visualize the results.

```{r plot_raw}
ggplot(data) +
  geom_point(aes(VIM, CDH1, color = ZEB1)) +
  scale_color_viridis(option = "B")
ggsave("EMT_data_R_before_magic.png", width = 5, height = 5)
```

```{r plot_magic}
ggplot(data_MAGIC) +
  geom_point(aes(VIM, CDH1, color = ZEB1)) +
  scale_color_viridis(option = "B")
ggsave("EMT_data_R_after_magic.png", width = 5, height = 5)
```

As you can see, the gene-gene relationships are much clearer after MAGIC.

### Visualizing MAGIC values on PCA

We can visualize the results of MAGIC on PCA with `genes="pca_only"`.

```{r run_pca}
data_MAGIC_PCA <- magic(data,
  genes = "pca_only",
  knn = 15, init = data_MAGIC
)
ggplot(data_MAGIC_PCA) +
  geom_point(aes(x = PC1, y = PC2, color = data_MAGIC$result$VIM)) +
  scale_color_viridis(option = "B") +
  labs(color = "VIM")
ggsave("EMT_data_R_pca_colored_by_magic.png", width = 5, height = 5)
```

### Using MAGIC for downstream analysis

We can look at the entire smoothed matrix with `genes='all_genes'`, passing the original result to the argument `init` to avoid recomputing intermediate steps. Note that this matrix may be large and could take up a lot of memory.

```{r run_magic_full_matrix}
data_MAGIC <- magic(data,
  genes = "all_genes",
  knn = 15, init = data_MAGIC
)
as.data.frame(data_MAGIC)[1:5, 1:10]
```

### Help

If you have any questions or require assistance using MAGIC, please contact us at <https://krishnaswamylab.org/get-help>.