---
title: "Batch correction and pseudotime"
output: html_document
---

This script walks though the steps to correct batch effects and find pseudotemporal relationships between single cells.


```{r setup, include=FALSE}
knitr::opts_chunk$set(
  message = F, 
  warning = F,
  comment = ""
)
```


```{r "Set variables and theme"}
# Load packages
library(Seurat)
library(cowplot)
library(knitr)
library(tidyverse)
library(LaCroixColoR)
library(viridis)
library(RColorBrewer)
library(harmony)
library(slingshot)
library(ggridges)
library(tradeSeq)
library(pheatmap)
library(here)

# Set theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 10))

base_dir <- here()

```


# Description of data
We will use two datasets today. The first is associated with the publication "Single-cell RNA sequencing of murine islets shows high cellular complexity at all stages of autoimmune diabetes". In this publication, the authors were trying to characterize the immune cell response during onset of diabetes in nonobese diabetic (NOD) mice. NOD mice are an autoimmune model for Type 1 diabetes where T cells lead the destruction of insulin producing Beta cells. To characterize the development of diabetes, they performed a timecourse experiment looking at immune cells in the pancreatic islet at 4 weeks, 8 weeks, and 15 weeks of age.

> 4 wk is about the first time that the first infiltrating T cells can be identified; 8 wk represents a time when leukocyte infiltration is prominent in most islets, still with no evidence of dysglycemia; and 15 wk is just before the time when clinical diabetes becomes evident. 

To further complicate their design, samples were processed in two batches, the first batch contained one 4 week, one 8 week, and one 15 week sample. The second batch contained one 4 week and one 8 week sample.

We will start with just one of the 4 week samples from this dataset. The others we will include when we look at pseudotime.

The second dataset is from the paper "Single-cell transcriptome analysis defines heterogeneity of the murine pancreatic ductal tree". They isolated all islet and ductal cells, so some of the cells overlap with the first dataset and some do not. We will use all cells for this analysis.


## Download the files (Just an example, don't run today)
All of the timecourse samples were downloaded here. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141784. This group uploaded all of the filtered csv files from cellranger for each sample. These will be downloaded if you download the GSE141784_RAW.tar file. I have alread downloaded these files for you and ran basic filtering and aligning with the provided metadata using the code below. You do not need to run anything in this section, it is just provided so you can see all of the steps to process this data.


```{r, eval = FALSE}
################################## DO NOT RUN ################################## 
# This function will create a seurat object, filter to the cells used in the 
# publication and add in the provided meta data.
make_seurat <- function(sample_list){
  print(sample_list[["sample_full"]])
  sample_counts <- Read10X(file.path(data_dir, "raw_data",
                               sample_list["sample_full"]))

  sample_object <- CreateSeuratObject(counts = sample_counts,
                                      project = sample_list["sample_full"],
                                      min.cells = 5)

  # Subset meta data to only the current sample
  meta_data_short <- meta_data %>%
    filter(Sample == sample_list["sample"] & Batch == sample_list["batch"])
  
  # Change the barcode to match each sample pre-merging
  meta_data_short$Barcode <- sub(pattern = "-[0-9]", replacement = "-1",
                                 meta_data_short$Barcode)

  # Subset to only cells in meta data
  sample_object <- subset(sample_object, cells = meta_data_short$Barcode)

  # Check if identical
  print(identical(meta_data_short$Barcode, colnames(sample_object)))

  rownames(meta_data_short) <- meta_data_short$Barcode

  sample_object <- AddMetaData(sample_object, metadata = meta_data_short)

  # Add mitochondrial percent
  sample_object[["percent.mt"]] <- PercentageFeatureSet(sample_object,
                                                    pattern = "^mt-")
  
  sample_object <- CellCycleScoring(sample_object,
                                  s.features = s.genes,
                                  g2m.features = g2m.genes,
                                  set.ident = FALSE)
  

  # Normalize
  sample_object <- NormalizeData(sample_object) %>% 
    FindVariableFeatures() %>%
    ScaleData()
  return(sample_object)
}

# Set directories
base_dir <- here()

data_dir <- file.path(base_dir, "data")

# Read in meta data
meta_data <- read.table(file.path(data_dir,
                                  "GSE141784_Annotation_meta_data.txt"),
                        header = T)
# Set sample information
sample_list <- list(
  c(sample_full = "NOD_15w_2734", sample = "NOD_15w", batch = 2734),
  c(sample_full = "NOD_4w_2734", sample = "NOD_4w", batch = 2734),
  c(sample_full = "NOD_4w_2849", sample = "NOD_4w", batch = 2849),
  c(sample_full = "NOD_8w_2734", sample = "NOD_8w", batch = 2734),
  c(sample_full = "NOD_8w_2849", sample = "NOD_8w", batch = 2849))

# Cell cycle genes (taken from seurat cc.genes and translated to mouse genes)
s.genes <- c("Exo1", "Mcm4", "Msh2", "Gmnn", "Chaf1b", "Mcm2", "Rrm2",
             "Rad51ap1", "Gins2", "Hells", "Cdc6", "Ubr7", "Cdc45", "Fen1",
             "Rpa2", "Slbp", "Uhrf1", "Ung", "Mcm5", "Dtl", "Casp8ap2",
             "Wdr76", "Nasp", "Prim1", "Cdca7", "Clspn", "Pola1", "Mcm6",
             "Blm", "Dscc1", "Usp1", "Tipin", "Rfc2", "Brip1", "Rrm1",
             "Rad51", "Tyms", "Ccne2", "E2f8", "Pcna")    
g2m.genes <- c("Ctcf", "Smc4", "Dlgap5", "Cdc25c", "Gtse1", "Kif20b", "Ncapd2",
               "Ttk", "G2e3", "Lbr",  "Cks1brt", "Cdca2", "Tacc3", "Anp32e",
               "Cdca3", "Ckap2", "Cks2", "Hmgb2", "Top2a", "Tpx2", "Kif23",
               "Rangap1", "Psrc1", "Cks1b", "Aurkb", "Hmmr", "Cenpf", "Birc5",
               "Cdca8", "Ckap5", "Kif2c", "Kif11", "Hjurp", "Cenpe", "Nuf2",
               "Ndc80", "Nek2", "Cdc20", "Ect2", "Anln", "Tubb4b", "Bub1",
               "Aurka", "Ckap2l", "Ccnb2", "Nusap1", "Mki67", "Ube2c", "Cenpa")

# Make a list of seurat objects
seurat_list <- lapply(sample_list, make_seurat)

```


The files associated with the second can be downloaded here https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4826923. This group only uploaded their output files from 10x genomics, so I remade the object, performed the initial steps of analysis and identified celltypes by looking for clusters that shared the most marker genes with each of their identified cell types.


## Download the files
I have already created processed `Seurat` objects that we will use today. This processing included finding mitochondrial percents, creating a cell cycle score, normalizing the data, finding variable genes, and scaling the data. We will skip this today, but all steps are shown in the code segment above. We can use the code below to create a "data" directory (folder) and download the files into it.


```{r make-directory}
## Make a data directory only if it doesn't already exist
ifelse(!dir.exists(file.path(base_dir, "data")),
       dir.create(file.path(base_dir, "data")), FALSE)
```


```{r download-files-batch}
## DOWNLOAD FILES if they haven't already been downloaded

ifelse(!file.exists(file.path(base_dir, "data", "NOD_4w_2734.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/NOD_4w_2734.rda",
              destfile = file.path(base_dir, "data", "NOD_4w_2734.rda")),
  FALSE)


ifelse(!file.exists(file.path(base_dir, "data", "hendley_seurat.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/hendley_seurat.rda",
              destfile = file.path(base_dir, "data", "hendley_seurat.rda")),
  FALSE)
```


## Read in the files


```{r read-in-files-batch}
sample_name <- "NOD_4w_2734"

sample_two_name <- "hendley_seurat"

seurat_object <- readRDS(file.path(base_dir, "data", paste0(sample_name, ".rda")))  

seurat_two <- readRDS(file.path(base_dir, "data", paste0(sample_two_name, ".rda")))
```


# Integration of samples
Before we integrate samples, we first need to decide if there is a batch effect that we need to worry about. To do this, I first merge all of the samples together and use a few quick plots.


## Merge the files
We can take this list and pass it to the `merge` function from `Seurat`. This function takes one seurat object as `x` and a list of `Seurat` objects for `y`. This function automatically merges the data slots, so we won't need to repeat normalization. We can see other options by typing `?merge`


```{r merge-batch}
# merge objects
seurat_merge <- merge(x = seurat_object,
                      y = seurat_two)
```


Let's now look at this new object


```{r}
print(seurat_merge)
```


You can see that we no longer have variable features, so we will need to find new variable features and rescale the data


```{r process-1}
# Remeber if you want to regress out any variables, you can do that using ScaleData
seurat_merge <- FindVariableFeatures(seurat_merge) %>%
  ScaleData()
```


We also will need to repeat PCA and UMAP 


```{r process-2}
# It's best to set a seed before running UMAP
set.seed(0)
seurat_merge <- RunPCA(seurat_merge) %>%
  FindNeighbors(dims = 1:30) %>%
  FindClusters(resolution = 0.8) %>%
  RunUMAP(dims = 1:30, assay = "RNA")
```


## Check for batch
Now we should check if any sort of batch correction is necessary. Let's check by dataset. The dataset is stored as `orig.ident` in the object


```{r}
print(head(seurat_merge[[]]))
```


We can now plot to see if these look different, which they do. In samples with bad batch effects, like these two samples, no cells will overlap. You can see that the samples almost are a mirror image of each other.


```{r set-colors-batch}
batch_colors <- LaCroixColoR::lacroix_palette("Pamplemousse", 2)
names(batch_colors) <- unique(seurat_merge$orig.ident)

DimPlot(seurat_merge, group.by = "orig.ident", cols = batch_colors)
```


These two datasets are interesting because they contain some, but not all, of the same celltypes


```{r}
print(table(seurat_object$Cells))
```


```{r}
print(table(seurat_two$Cells))
```


```{r cells-batch, fig.width = 20, fig.height=10}
DimPlot(seurat_merge, group.by = "Cells", split.by = "orig.ident",
        reduction = "umap")
```


## Correct the batch
I personally first use `harmony` as my batch correction tool. There are some really nice benchmarking papers for batch correction that can give you more information on the best tool to use.

* https://doi.org/10.1186/s13059-019-1850-9
* https://doi.org/10.1101/2020.05.22.111161

Harmony works by weakly clustering cells and then calculating - and iteratively refining - a dataset correction factor for each cluster. Note that Harmony only computes a new corrected dimensionality reduction - it does not calculate corrected expression values (the `raw.data`, `data` and `scale.data` slots are unmodified). This works well for most datasets, but can be be insufficient for downstream differential expression analyses in extremely divergent samples.

Actually running `harmony` with a `seurat` object is quite simple, you just use a function called `RunHarmony`. We can see the inputs to this function by using `?RunHarmony`

We will use `plot_convergence = TRUE` so that we can ensure each iteration improves


```{r harmony}
# Here we use "orig.ident", but we can use anything from the meta data
seurat_merge <- RunHarmony(object = seurat_merge,
                           group.by.vars = "orig.ident",
                           plot_convergence = TRUE)
```


The objective function should be continuously improving. In this case, it's not perfect, but probably good enough. I recomend changing the theta parameter if the function gets severely worse in later itearation.


The returned object now has a new dimensional reduction called harmony. 


```{r}
print(seurat_merge)
```


We can use this to rerun our UMAP. I like to use a reduction key for my umap so I remember that it was made with harmony


```{r processing-harmony}
set.seed(0)
seurat_merge <- FindNeighbors(seurat_merge,
                              dims = 1:30, reduction = "harmony") %>%
  FindClusters(resolution = 0.8) %>%
  RunUMAP(dims = 1:30, assay = "RNA", reduction = "harmony",
          reduction.key = "harmony.UMAP_", reduction.name = "harmony.umap")
```


We can check with some plots. Now we need to specify the reduction. Below we can see that in many places cells overlap and in some places they don't. If `harmony` worked, the common cell types should overlap while the unique celltypes should not


```{r}
print(seurat_merge)
```


```{r ident-harmony}
DimPlot(seurat_merge, group.by = "orig.ident", cols = batch_colors,
        reduction = "harmony.umap")
```


Now we can check celltypes


```{r cells-harmony, fig.width=15, fig.height=10}
DimPlot(seurat_merge, group.by = "Cells",
        reduction = "harmony.umap")
```


Overall, this looks great. The ductal cells should only be in one dataset which is true, same with the Mesenchymal cells. The other cell types mix nicely.


```{r cells-split-harmony, fig.width = 20, fig.height=10}
DimPlot(seurat_merge, group.by = "Cells", split.by = "orig.ident",
        reduction = "harmony.umap")
```


Just to visualize more clearly, we can visualize specific celltypes in each dataset.


```{r cells-split-indiv-harmony, fig.width=10, fig.height=8}
Idents(seurat_merge) <- "Cells"
plots <- lapply(c("Bcell", "CD4", "CD8", "Mac"), function(x){
  return(DimPlot(seurat_merge, group.by = "Cells", split.by = "orig.ident",
                 cells.highlight = CellsByIdentities(object = seurat_merge,
                                                     idents = x),
                 reduction = "harmony.umap"))
})

plot_grid(plotlist = plots,
          nrow = 2, ncol = 2)
```


## Marker identification
We can now identify markers based on this integrated dataset. We can identify markers of celltypes that are conserved in both datasets using the funciton `FindConservedMarkers`. First, we can set the identity of the cells to the celltypes. The below function takes several minutes to run, so we will skip it here. It also requires a couple of additional packages


```{r, eval = FALSE}
################################## DO NOT RUN ################################## 
BiocManager::install('multtest')
install.packages('metap')
Idents(seurat_merge) <- "Cells"

macrophage_markers <- FindConservedMarkers(seurat_merge,
                                           ident.1 = "Mac",
                                           grouping.var = "orig.ident")

print(macrophage_markers[1:20,])
```


We can also identify any differences between the macrophage populations in the two samples. This would be more interesting if one was a treatment and one was a control.

First, we need to create a new column in the metadata that contains both sample information and cell type information


```{r}
seurat_merge$sample_celltype <- paste0(seurat_merge$orig.ident, "_",
                                       seurat_merge$Cells)
```


Now we can find markers between two populations


```{r markers}
Idents(seurat_merge) <- "sample_celltype"
mac_sample_markers <- FindMarkers(seurat_merge, ident.1 = "NOD_4w_2734_Mac",
                                  ident.2 = "GSM4826923_C57BL6J_Mac")

print(mac_sample_markers[1:20,])
```


One thing to remember while looking for differences between samples is that we had to perform a batch correction. Batch correction with `harmony` only corrects the dimensional reduction, not the gene counts so here we are likely also picking up on batch differences.

To save memory, we can remove the seurat objects from our enverionment


```{r}
rm(list = c("seurat_merge", "seurat_object", "seurat_two"))
gc()
```


## Batch correction conclusion
While I personally prefer `harmony` as my batch correction, it only helps identify shared celltypes or clusters while not actually correcting the sequencing counts. This is fine if you want to find differences between shared celltypes within each sample, but it may not be correct if you want to compare expression differences between samples. There are other batch correction tools that do allow you to also correct the expression values. One that has received high marks recently is `scVI`, which is implemented in `python` so we won't go into it here. Unfortunately, the creators of `scVI` don't seem comfortable with these corrected values being used to perform differential expression analysis between samples. `Seurat` also performs an integration that corrects expression values, but those creators also suggest using the uncorrected `RNA` matrix for differential expression testing.

An alternative option if you have replicates is to use `muscat` which creates pseudobulk profiles for each cell type in each sample and then runs traditional differential expression tests that can account for variation within replicates when running differential expression.


# Pseudotime
We can also perform puseoditme analysis on integrated samples.


## Download provided files
I have already created processed `Seurat` objects that we will use today. This processing included finding mitochondrial percents, creating a cell cycle score, normalizing the data, finding variable genes, and scaling the data. We will skip this today, but all steps are shown in the code segment above. 


```{r download-pseudotime}
## DOWNLOAD FILES if they haven't already been downloaded

ifelse(!file.exists(file.path(base_dir, "data", "NOD_4w_2849.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/NOD_4w_2849.rda",
              destfile = file.path(base_dir, "data", "NOD_4w_2849.rda")),
  FALSE)


ifelse(!file.exists(file.path(base_dir, "data", "NOD_8w_2734.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/NOD_8w_2734.rda",
              destfile = file.path(base_dir, "data", "NOD_8w_2734.rda")),
  FALSE)

ifelse(!file.exists(file.path(base_dir, "data", "NOD_8w_2849.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/NOD_8w_2849.rda",
              destfile = file.path(base_dir, "data", "NOD_8w_2849.rda")),
  FALSE)

ifelse(!file.exists(file.path(base_dir, "data", "NOD_15w_2734.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/NOD_15w_2734.rda",
              destfile = file.path(base_dir, "data", "NOD_15w_2734.rda")),
  FALSE)

ifelse(!file.exists(file.path(base_dir, "data", "meta_data.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/meta_data.rda",
              destfile = file.path(base_dir, "data", "meta_data.rda")),
  FALSE)

ifelse(!file.exists(file.path(base_dir, "data", "mac_sce.rda")),
       download.file(
  url = "http://amc-sandbox.ucdenver.edu/User60/single_cell_workshop/mac_sce.rda",
              destfile = file.path(base_dir, "data", "mac_sce.rda")),
  FALSE)
```


## Read in provided files
First, we need to read in all of the files associated with the first dataset. Seurat wants files as a list to merge, so let's read them in that way using lapply


```{r read-files-pseudotime}
sample_names <- c("NOD_4w_2734", "NOD_4w_2849",
                  "NOD_8w_2734", "NOD_8w_2849",
                  "NOD_15w_2734")

# I hate the default R colors, so I will also make my own
colors <- RColorBrewer::brewer.pal(5, "Set1")
names(colors) <- sample_names

seurat_list <- lapply(sample_names, function(x){
  seurat_object <- readRDS(file.path(base_dir, "data", paste0(x, ".rda")))  
})

```


## Merge the files
We can take this list and pass it to the `merge` function from `Seurat`.


```{r merge-pseudotime}
# merge objects
seurat_merge <- merge(x = seurat_list[[1]],
                      y = seurat_list[2:length(seurat_list)])

```


Clean up memory


```{r}
# Remove the list
rm(list = "seurat_list")
gc()
```


### Prepare data
One important thing to remember about running pseudotime is that connections will be found between cells that are related or not. Because we know that the macrophages should not become B cells, we should first subset to only the macrophage population.


```{r subset}
Idents(seurat_merge) <- "Cells"
seurat_mac <- subset(seurat_merge, idents = "Mac")
```


```{r}
rm(list = "seurat_merge")
gc()
```


After subestting, we need to repeat some of the processing steps. We need to find variable genes, scale the data, run pca, and run umap


```{r process-3}
set.seed(0)
seurat_mac <- FindVariableFeatures(seurat_mac) %>%
  ScaleData() %>%
  RunPCA(npcs = 30) %>%
  FindNeighbors(dims = 1:15, reduction = "pca") %>%
  FindClusters(resolution = 0.4) %>%
  RunUMAP(dims = 1:15, assay = "RNA", reduction = "pca")
```


## Check for batch
Now we should check if any sort of batch correction is necessary. This data was processed in two batches. The cells within these two batches should only differ because of technical effects. To view these potential effects, we can first plot by batch.


```{r}
head(seurat_mac[[]])
```


```{r pseudotime-batch}
DimPlot(seurat_mac, group.by = "Batch", cols = "Set1")
```


While this isn't as bad as our previous example, we should still try to remove the batch effect. Here we can include both the batch information as variable.

*Note we can also include the original identity in our batch correction. We can include as many correction factors as we want here, but I hesitate to correct beyond batch because I don't want to remove our biological effect. In a real analysis, I would try removing just with batch and with batch and orig.ident*


```{r}
seurat_mac <- RunHarmony(seurat_mac, group.by.vars = c("Batch"))
```


We can now repeat clustering and performing UMAP dimensional reduction on this new batch corrected data.


```{r processing-harmony-2}  
seurat_mac <- FindNeighbors(seurat_mac, dims = 1:15, reduction = "harmony") %>%
  FindClusters(resolution = 0.4) %>%
  RunUMAP(dims = 1:15, assay = "RNA", reduction = "harmony")

DimPlot(seurat_mac, group.by = "Batch", cols = "Set1")
```


This looks much better. We can also check the sample and original identity. *If we had also used orig.ident in the correction, we may be concerned that we over-corrected the timecourse. It is always good to keep what you corrected in mind as you continue your analysis.*


```{r pseudotime-batch-plot}
samples <- unique(seurat_mac$Sample)
sample_colors <- LaCroixColoR::lacroix_palette("Coconut", 3)
names(sample_colors) <- samples
DimPlot(seurat_mac, group.by = "Sample", cols = sample_colors)
```


```{r pseudotime-batch-ident}
DimPlot(seurat_mac, group.by = "orig.ident", cols = "Set1")
```


```{r pseudotime-batch-cluster}
DimPlot(seurat_mac, group.by = "seurat_clusters", cols = "Set1")
```


*Note your UMAP and clusters may look slightly different than mine. That is okay. There are random seeds used to generate UMAPs and to run other functions that are influenced by that packages and versions of packages on your system. For now, we can just run it and acknowledge that all of our output may be a bit different*


## Run pseudotime
As with batch correction, there are many tools for pseudotime. `RNA velocity` or `velocyto` are interesting methods that rely on intron retention to predict future cell state. Other options just use transcriptome similarity to identify cells that are likely related. With this second type, we cannot use the tool to identify the direction of cell development, but we can use known biology to infer this relationship. There are also good benchmarking papers that compare pseudotime methods. 

* https://doi.org/10.1038/s41587-019-0071-9

`slingshot` tends to do well in these benchmarking studies, so I generally start with that. But I find it is a good idea to compare a couple of methods to make sure your conclusions are robust.

Now we can  run `slingshot`. Slingshot runs on your dimensionality reduction. Some tutorials show this being run on the UMAP dimensions, but it is best to run this either on the PCA (or harmony) reduction. The UMAP is generally generated based on the PCA (or harmony) dimensional reduction and is only 2 dimensions. PCA or harmony can be many dimensions (ex. 50) and is a better representation of your data.

First we need to pull the new harmony coordinates and cluster information from the `seurat` object. Make sure we use the same number of coordiantes we've used for other analysis


```{r}
pca_coords <- Embeddings(object = seurat_mac, reduction = "harmony")[ , 1:15]
clusters <- seurat_mac$seurat_clusters
```


We can then input these into slingshot


```{r run-slingshot}
mac_slingshot <- slingshot(data = pca_coords, clusterLabels = clusters)
```


Looking at this object, we can see that 4 lineages were found.


```{r}
mac_slingshot
```


So cluster 5, 1 and 0 start all lineages.


Pseudotime algorithims identify cells with similar transcriptional profiles to create a likely order of cell relationships. Unfortunately, there is no way to do this with a definite direction. For example, the curves above could either start or end at cluster 0, we don't know.

One nice thing about slingshot is that you can include known biology to improve the ability to identify correct lineages that go in the correct direction. For example, without any input, slingshot defined cluster 0 as the first cluster, but we know that the cells at the latest timepoint are in cluster 0 while the earliest is in cluster 2. We can repeat with cluster 3 as the starting cluster

We can then input these into slingshot


```{r}
mac_slingshot_2 <- slingshot(data = pca_coords, clusterLabels = clusters,
                             start.clus = 3)
```


We still find 3 lineages, but this time, all lineages start with cluster 3.


```{r}
print(mac_slingshot_2)
```


We can extract pseudotime values using `slingPseudotime` and add this to the metadata. This is the way that you would do it if you were running your own analysis...


```{r, eval = FALSE}
################################## DO NOT RUN ################################## 
pseudotime <- data.frame(slingPseudotime(mac_slingshot_2))
seurat_mac <- AddMetaData(seurat_mac, metadata = pseudotime)
```


... but to keep our analysis consistent, you will add in my values here


```{r}
seurat_meta_data <- readRDS(file.path(base_dir, "data", "meta_data.rda"))
seurat_mac <- AddMetaData(seurat_mac, metadata = seurat_meta_data)
```


And now we can visualize each curve


```{r curve1}
FeaturePlot(seurat_mac, features = "curve1", cols = viridis::magma(20))
```


```{r curve2}
FeaturePlot(seurat_mac, features = "curve2", cols = viridis::magma(20))
```

```{r curve3}
FeaturePlot(seurat_mac, features = "curve3", cols = viridis::magma(20))
```


We can now visualize the psuedotime across our timepoints. Below you can see that the cells nicely transition between timepoints along pseudotime.


```{r ridge-plot}
meta_data <- seurat_mac[[]]
meta_data$Sample <- factor(meta_data$Sample,
                           levels =c("NOD_4w", "NOD_8w", "NOD_15w"))

# Subest to only cells with value for curve 1
meta_data <- meta_data %>%
  dplyr::filter(!is.na(curve1))

density_plot <- ggplot2::ggplot(data = meta_data,
                                ggplot2::aes(x = curve1,
                                             y = Sample,
                                             fill = Sample)) +
  ggridges::geom_density_ridges() +
  ggplot2::scale_fill_manual(values = sample_colors)

density_plot
```


As discussed before, this could be completley due to batch effect, but we can plot each batch separetly to see if there is much difference in terms of pseudotime. As you can see, the batches look nearly identical indicating that the changes we see are likely due to the timecourse.


```{r ridge-plot-batch}
meta_data <- seurat_mac[[]]
meta_data$orig.ident <- factor(meta_data$orig.ident,
                           levels =c("NOD_4w_2734", "NOD_4w_2849",
                                     "NOD_8w_2734", "NOD_8w_2849",
                                     "NOD_15w_2734"))

# Subest to only cells with value for curve 1
meta_data <- meta_data %>%
  dplyr::filter(!is.na(curve1))

density_plot <- ggplot2::ggplot(data = meta_data,
                                ggplot2::aes(x = curve1,
                                             y = orig.ident,
                                             fill = orig.ident)) +
  ggridges::geom_density_ridges() +
  ggplot2::scale_fill_manual(values = colors)

density_plot
```


### Genes that correlate with pseudotime
We can also identify genes with expression patterns that correlate with pseudotime. 

For each gene, we will fit a general additive model (GAM) using a negative binomial noise distribution to model the (potentially nonlinear) relationship between gene expression and pseudotime. We will then test for significant associations between expression and pseudotime using the `associationTest` This requires the package `tradeSeq`

This function takes a long time to run (about 30 minutes) so I have already run it for you. The steps are below. I only ran it on the top 2000 variable genes as these are the most likely genes to also correlate with pseudotime.


```{r, eval = FALSE}
################################## DO NOT RUN ################################## 
variable_features_pattern <- paste0("^", VariableFeatures(seurat_mac), "$")
genes_use <- grep(paste0(variable_features_pattern, collapse="|"),
                  rownames(seurat_mac))
# fit negative binomial GAM
mac_sce <- fitGAM(counts = GetAssayData(object = seurat_mac,
                                                slot = "counts"),
                          sds = mac_slingshot_2,
                          genes = genes_use)

saveRDS(mac_sce, file.path(base_dir, "data", "mac_sce.rda"))
```


We can load in this saved object and run an association test to identify what genes correlate best with pseudotime. The output of this assocaition test is a data frame that includes p values and wald statistics for each lineage.


```{r}
mac_sce <- readRDS(file.path(base_dir, "data", "mac_sce.rda"))
pseudotime_genes <- associationTest(mac_sce, lineages = TRUE)
head(pseudotime_genes)
```


Let's first pull out the top genes associated with the first curve.


```{r}
# We care about lineage one, so we use pvalue_1 to rank the genes.
topgenes <- rownames(pseudotime_genes[order(pseudotime_genes$pvalue_1), ])[1:100]
```


We can now plot these genes in a heatmap to show the expression over time.


```{r heatmap, fig.height = 15, fig.width=10}
# Get the information for curve 1 so we can find what cells to keep
cell_info <- seurat_mac[["curve1"]]

cell_info <- cell_info %>%
  dplyr::filter(!is.na(curve1))

# Get the data for all cells
heatdata <- GetAssayData(object = seurat_mac, slot = "data")

# Subset to only genes and cells we want
heatdata <- heatdata[rownames(heatdata) %in% topgenes,
                     colnames(heatdata) %in% rownames(cell_info)]

# Order the data based on the pseudotime ordering of the cells
heatdata <- heatdata[ , order(cell_info$curve1)]

## Color the clusters and samples ##
# pull out the sample information and make cell order the same as the heatmap
# data
sample_info <- seurat_mac[[c("Sample", "seurat_clusters")]]
sample_info <- sample_info[colnames(heatdata) , ]

# Set colors
cluster_colors <- RColorBrewer::brewer.pal(8, "Set1")
names(cluster_colors) <- c(0:7)

samples <- unique(seurat_mac$Sample)
sample_colors <- LaCroixColoR::lacroix_palette("Coconut", 3)
names(sample_colors) <- samples

# We make a list of the colors, make sure the names match the sample info we
# created
color_list <- list(Sample = sample_colors, seurat_clusters = cluster_colors)

## Prepare heatmap values ##
# Scale the heatmap values
heatmap_scale <- t(scale(t(as.matrix(heatdata)), scale = TRUE))

# Colors for heatmap (from the ArchR package)
blueYellow <- c("#352A86", "#343DAE", "#0262E0", "#1389D2", "#2DB7A3",
                "#A5BE6A", "#F8BA43", "#F6DA23", "#F8FA0D")

# Add cutoffs for visualization. I actually stole this line of code from the
# Seurat heatmap functions. Without it you can only see some genes
heatmap_scale <- ifelse(heatmap_scale > 2.5, 2.5, heatmap_scale)
heatmap_scale <- ifelse(heatmap_scale < -2.5, -2.5, heatmap_scale)

# Make the heatmap
pheatmap(heatmap_scale, cluster_rows = TRUE,
         cluster_cols = FALSE,
         show_rownames = TRUE,
         show_colnames = FALSE,
         annotation_col = sample_info,
         annotation_colors = color_list, color = blueYellow,
         border_color = NA, clustering_method = "complete")

```


We can also visualize the expression of certain genes across pseudotime. I am going to write a function for this because we will do it several times. This is a quick function that could be easily improved to also be able to color by a continuous variable --> feel free to take this and modify as you please.


```{r function}
plot_pseudotime <- function(seurat_object, pseudotime_name,
                            gene_name, col_by = "seurat_clusters",
                            colors = NULL){
  plot_info <- FetchData(seurat_object, c(pseudotime_name, gene_name,
                                          col_by))
  
  colnames(plot_info) <- c("pseudotime", "gene", "color")
  
  # Set colors if not set already
  if(is.null(colors)){
    colors <- RColorBrewer::brewer.pal(length(unique(plot_info$color)))
    names(colors) <- unique(plot_info$color)
  }
  
  plot <- ggplot2::ggplot(plot_info, ggplot2::aes(x = pseudotime,
                                                  y = gene,
                                                  color = color)) +
    ggplot2::geom_point() +
    ggplot2::scale_color_manual(values = colors, name = col_by) + 
    ggplot2::ylab(paste0(gene_name, " log expression")) +
    ggplot2::geom_smooth(se = FALSE, color = "black")
  
  return(plot)
}
```


Let's run this with a few genes

```{r pseudotime-ptma}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Ptma",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r pseudotime-pmepa1}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Pmepa1",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r pseudotime-cd72}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Cd72",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r pseudotime-ptpn1}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Ptpn1",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r pseudotime-slamf8}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Slamf8",
                col_by = "seurat_clusters", color = cluster_colors)
```


With this function, we can also see expression by sample


```{r pseudotime-cluster}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Slamf8",
                col_by = "Sample", color = sample_colors)
```


But maybe you are more interested in groups of genes that change. We can find clusters of genes that change using the `tradeSeq` package as well. If you could not install clusterExperiment, don't run this secton (it's our last section)


```{r gene-groups}
library(clusterExperiment)
topgenes <- rownames(pseudotime_genes[order(pseudotime_genes$pvalue_1), ])[1:250]

# This clusters genes based on their expression across pseudotime
cluster_patterns <- clusterExpressionPatterns(mac_sce, nPoints = 20,
                                              genes = topgenes)

# We can pull out the labels and map genes to labels using the yhatScaled output
cluster_labels <- primaryCluster(cluster_patterns$rsec)

names(cluster_labels) <- rownames(cluster_patterns$yhatScaled)

# The -1 means the gene was not clustered
cluster_labels <- cluster_labels[cluster_labels != -1]

# Now let's make lists of genes based on the clusters

cluster_genes <- lapply(unique(cluster_labels), function(x){
  names(cluster_labels[cluster_labels == x])
})

# We can now make module scores with all of these lists
seurat_mac <- AddModuleScore(seurat_mac, features = cluster_genes)
```


We now have module scores for each gene list


```{r}
print(head(seurat_mac[[]]))
```


We can plot these the same way we plotted the genes


```{r cluster1}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Cluster1",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r cluster-2}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Cluster2",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r cluster-3}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Cluster10",
                col_by = "seurat_clusters", color = cluster_colors)
```


```{r cluster-10-sample}
plot_pseudotime(seurat_mac, pseudotime_name = "curve1", gene_name = "Cluster10",
                col_by = "Sample", color = sample_colors)
```