---
title: "Project-2: SCEPTRE"
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: true
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 1. Understand SCEPTRE

### a. Read the paper

[Barry *et al.* 2021: SCEPTRE](https://chenh19.github.io/rotation2/refs/Barry_2021.pdf)

**Some key ideas:**  

- A key confounder for perturb-seq is the read depth: ```total gRNA per cell``` seems to show a non-linear increase as ```total UMI per cell``` increases.

![](https://raw.githubusercontent.com/chenh19/rotation2/main/output/SCEPTRE/dispersion.png)  

- To overcome this, Barry *et al.* adopted two main strategies:
  - Include depth as a covariate when doing negative binomial (NB) regression for gRNA and expression.
  - Generate an empirical null distribution by randomization test.

Pipeline:

  - Negative binomial regression to get z-values for each gRNA-Expression pairs.
  - Logistic regression to get gRNA detection probabilities for each gRNA in each cell.
  - Conditional randomization test to get the empirical null distribution (when doing randomization, use the gRNA detection probabilities as the conditions).
  - Fit a skew-t distribution to the null distribution and then calculate p-values from z-values.

### b. Check the package

- [Katsevich-Lab/sceptre](https://github.com/Katsevich-Lab/sceptre)
- [Katsevich-Lab/sceptre-manuscript](https://github.com/Katsevich-Lab/sceptre-manuscript)

### c. Install the package

**Code:** [```sceptre.sh```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/0_prepare/sceptre.sh)  

### d. QQ plot function

**Code:** [```qqunif.plot.R```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/0_prepare/qqunif.plot.R)  
**Ref:** [Code Sample: Generating QQ Plots in R](https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R)  


## 2. Manual regression

**Aim:** To better understand SCEPTRE, I'll first perform naive negative binomial regression with and without total UMI as a covariate. This is mainly to check whether the negative controls have inflated p-values.

### a. Naive NB regression

- No UMI covariate.
- No empirical null distribution.

**Code:** [```naive_NB.R```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/6_naive_NB/naive_NB.R)  

![](https://raw.githubusercontent.com/chenh19/rotation2/main/output/SCEPTRE/naive_NB_qq.png)  

### b. NB regression w/ total UMI cov

- Total UMI counts for each cell in gene_matrix was used as a covariate.

**Code:** [```NB_with_covariate.R```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/7_total_UMI_covariate/NB_with_covariate.R)  

![](https://raw.githubusercontent.com/chenh19/rotation2/main/output/SCEPTRE/NB_with_covarite_qq.png)  

### c. Compare

**Data:** [p_value.zip](https://raw.githubusercontent.com/chenh19/rotation2/main/output/SCEPTRE/p_value.zip)

#### QQ plot

**Code:** [```negative_control_qq.R```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/7_total_UMI_covariate/negative_control_qq.R)  

**Observation:**  

- After introducing total UMI as a covariate, the inflation for negative controls was significantly reduced.
- However, inflation is still observed. SCEPTRE's empirical null distribution for p-value calculation is expected to further reduce the inflation.

![](https://raw.githubusercontent.com/chenh19/rotation2/main/output/SCEPTRE/negative_controls_qq.png)  

#### Smallest p-values

**Code:** [```head_pvalue.R```](https://raw.githubusercontent.com/chenh19/rotation2/main/code/7_total_UMI_covariate/head_pvalue.R)  

**Observation:**  

- For negative controls, the p-values increases after introducing the covariate in NB regression (less significant).
- For positive controls and candidates, the p_values decreases after introducing the covariate in NB regression (more significant).

```
[1] "The smallest p-values in [negative_control_pvalues.csv] :"
     Gene   gRNA      p_value
1   TIMP1  NTC-2 1.893138e-14
2   DAAM1 NTC-10 5.380489e-12
3 ANAPC11 NTC-12 5.970695e-10
4    OST4  NTC-2 7.803250e-10
5  NDUFS5 NTC-12 2.727219e-09
6   MYDGF NTC-12 5.158344e-09

[1] "The smallest p-values in [negative_control_pvalues_with_covariate.csv] :"
     Gene   gRNA      p_value
1   DAAM1 NTC-10 6.169608e-08
2   HSPA8  NTC-3 2.177909e-07
3    TPT1  NTC-3 1.876122e-06
4   TAF10 NTC-10 4.655719e-06
5 ZFAND2A  NTC-3 4.809740e-06
6  VKORC1  NTC-3 7.248893e-06
```

```
[1] "The smallest p-values in [positive_control_pvalues.csv] :"
   Gene    gRNA      p_value
1  CD46  CD46-2 1.404819e-32
2 HSPA8 HSPA8-1 3.658421e-15
3  CD52  CD52-1 5.918623e-07
4 HSPA8 HSPA8-2 7.758467e-02
5   NMU   NMU-1 1.077502e-01
6  PPIA  PPIA-1 2.090641e-01

[1] "The smallest p-values in [positive_control_pvalues_with_covariate.csv] :"
   Gene    gRNA      p_value
1 HSPA8 HSPA8-1 7.940576e-42
2  CD46  CD46-2 4.646680e-36
3 HSPA8 HSPA8-2 3.540640e-19
4  CD52  CD52-1 2.594172e-08
5   NMU   NMU-1 5.694585e-02
6  PPIA  PPIA-1 9.576647e-01
```

```
[1] "The smallest p-values in [candidate_pvalues.csv] :"
    Gene     gRNA      p_value
1   CR1L SNP-20-2 6.805935e-33
2  PTPRC SNP-14-2 1.010292e-24
3   CTU2 SNP-35-1 2.420409e-20
4   ANK1 SNP-61-1 9.305671e-17
5 KDELR2 SNP-85-1 1.871454e-15
6   PAXX SNP-29-2 6.924731e-14

[1] "The smallest p-values in [candidate_pvalues_with_covariate.csv] :"
    Gene     gRNA      p_value
1   CR1L SNP-20-2 3.905370e-43
2   ANK1 SNP-61-1 2.579855e-31
3  PTPRC SNP-14-2 1.613814e-30
4  GLRX5 SNP-58-2 2.276511e-17
5 PDLIM1 SNP-77-2 1.042396e-12
6  NUDT4 SNP-62-2 1.123020e-12
```


## 3. SCEPTRE regression

- **SCEPTRE [tutorial](https://katsevich-lab.github.io/sceptre/articles/using_sceptre_v2.html)**
- **SCEPTRE [functions](https://katsevich-lab.github.io/sceptre/reference/index.html)**

```{r,eval=FALSE}
#install.packages("devtools")
#devtools::install_github("katsevich-lab/sceptre")
#install.packages("tidyverse")
#install.packages("cowplot")

# SCEPTRE example
library(tidyverse)
library(cowplot)
library(Matrix)
library(sceptre)

data(gene_matrix)
data(gRNA_matrix)
data(covariate_matrix)
perturbation_matrix <- threshold_gRNA_matrix(gRNA_matrix)
data("gRNA_groups_table")
combined_perturbation_matrix <- combine_perturbations(perturbation_matrix = perturbation_matrix,
                                                      gRNA_groups_table = gRNA_groups_table)

data(gene_gRNA_group_pairs)
side <- "left"
result <- run_sceptre_high_moi(gene_matrix = gene_matrix,
                               combined_perturbation_matrix = combined_perturbation_matrix,
                               covariate_matrix = covariate_matrix,
                               gene_gRNA_group_pairs = gene_gRNA_group_pairs,
                               side = side)
```

```{r,eval=FAlSE}
# read gene and gRNA matrices
Expression <- readRDS("../Morris_2021/filter/Expression.rds", refhook = NULL)
gRNA <- readRDS("../Morris_2021/filter/gRNA.rds", refhook = NULL)
gene_matrix2 <- as(Expression@assays$RNA@counts, "dgTMatrix")
gRNA_matrix2 <- as(gRNA@assays$RNA@counts, "dgTMatrix")
rm("Expression","gRNA")

# generate covariate matrix
covariate_matrix2 <- as.data.frame(colnames(gene_matrix2))
rownames(covariate_matrix2) <- colnames(gene_matrix2)
covariate_matrix2$lg_gRNA_lib_size <- 10
covariate_matrix2$lg_gene_lib_size <- 10
covariate_matrix2$p_mito <- 0.001
covariate_matrix2$batch <- "prep_batch_1"
covariate_matrix2 <- covariate_matrix2[c("lg_gRNA_lib_size","lg_gene_lib_size",
                                         "p_mito","batch")]

# subset the expression matrix by gene
genes <- lapply(paste0("./output/SCEPTRE/gene-gRNA_pairs/",
                list.files(path = "./output/SCEPTRE/gene-gRNA_pairs/", pattern='*_pairs.csv')),
                read.csv)
genes <- do.call(rbind.data.frame, genes)[,"Gene"]
Sys.sleep(3)
gene_matrix2 <- gene_matrix2[(rownames(gene_matrix2) %in% genes), ]
rm("genes")

# generate perturbation matrix
perturbation_matrix2 <- threshold_gRNA_matrix(gRNA_matrix2)

# gRNA group table
gRNA_groups_table2 <- as_tibble(read.csv("./output/SCEPTRE/gene-gRNA_pairs/gRNA_groups_table2.csv", header = TRUE))









# combine perturbation
combined_perturbation_matrix2 <- combine_perturbations(perturbation_matrix = perturbation_matrix2,
                                                      gRNA_groups_table = gRNA_groups_table2)











# gene gRNA group pairs
gene_gRNA_group_pairs2 <- as.data.frame(read.csv("./output/SCEPTRE/gene-gRNA_pairs/gene_gRNA_group_pairs2.csv", header = TRUE)[,1:3])

gene_gRNA_group_pairs2$check <- paste0(gene_gRNA_group_pairs2$gene_id, gene_gRNA_group_pairs2$gRNA_group, gene_gRNA_group_pairs2$pair_type)
gene_gRNA_group_pairs2 <- gene_gRNA_group_pairs2[!duplicated(gene_gRNA_group_pairs2[,"check"]),]

genes <- rownames(gene_matrix2)

rownames(gene_gRNA_group_pairs2) <- 
gene_gRNA_group_pairs2 <- gene_gRNA_group_pairs2[(gene_gRNA_group_pairs2$gene_id %in% rownames(gene_matrix2)), ]


gene_gRNA_group_pairs2 <- as_tibble(gene_gRNA_group_pairs2)[,1:3]

result2 <- run_sceptre_high_moi(gene_matrix = gene_matrix2,
                               combined_perturbation_matrix = combined_perturbation_matrix2,
                               covariate_matrix = covariate_matrix2,
                               gene_gRNA_group_pairs = gene_gRNA_group_pairs2,
                               side = side)


```


```{r,eval=FALSE}
write.table(perturbation_matrix2, file="./analysis/cache/perturbation_matrix2.csv", sep=",", row.names = TRUE)
write.table(gRNA_groups_table2, file="./analysis/cache/gRNA_groups_table2.csv", sep=",", row.names = FALSE)

gene_names2 <- rownames(gene_matrix2)
gRNA_names2 <- rownames(gRNA_matrix2)

#write.table(gene_names2, file="./analysis/cache/gene_names.csv", sep=",", row.names = FALSE)
#write.table(gRNA_names2, file="./analysis/cache/gRNA_names.csv", sep=",", row.names = FALSE)

Expression <- Seurat::ReadMtx(mtx = "../Morris_2021/GSM5225857_cDNA.mtx",
                                features = "../Morris_2021/GSM5225857_cDNA.features.txt",
                                cells = "../Morris_2021/GSM5225857_cDNA.barcodes.txt",
                                cell.column=1,
                                feature.column=1,
                                mtx.transpose=T)
gene_name_ori <- rownames(as(Expression, "dgTMatrix"))
rm("Expression")
```
