This function is the core function of the sceptre package. The function applies SCEPTRE to test for association between a set of gRNA groups and genes while controlling for technical confounders. The function returns a p-value for each pairwise test of association.

run_sceptre_high_moi(
  gene_matrix,
  combined_perturbation_matrix,
  covariate_matrix,
  gene_gRNA_group_pairs,
  side = "both",
  storage_dir = tempdir(),
  regularization_amount = 0.1,
  B = 1000,
  full_output = FALSE,
  parallel = TRUE,
  seed = 4
)

Arguments

gene_matrix

a gene-by-cell expression matrix; the rows (i.e., gene IDs) and columns (i.e., cell barcodes) should be named

combined_perturbation_matrix

a binary matrix of perturbations (i.e., gRNA group-to-cell assignments); the rows (i.e., gRNA groups) and columns (i.e., cell barcodes) should be named.

covariate_matrix

the cell-specific matrix of technical factors, ideally containing the following covariates: log-transformed gene library size (numeric), log-transformed gRNA library size (numeric), percent mitochondrial reads (numeric), and batch (factor). The rows (i.e., cell barcodes) should be named

gene_gRNA_group_pairs

a data frame specifying the gene-gRNA group pairs to test for association; the data frame should contain columns named gene_id and gRNA_group.

side

sidedness of the test; one of "both," "left," and "right"

storage_dir

directory in which to store the intermediate computations

regularization_amount

non-negative number specifying the amount of regularization to apply to the negative binomial dispersion parameter estimates

B

number of resamples to draw for the conditional randomization test

full_output

return the full output (TRUE) or a simplified, reduced output (FALSE; default)?

parallel

parallelize execution?

seed

seed to the random number generator

Value

the gene_gRNA_group_pairs data frame with new columns p_value and z_value appended. See "details" for a description of the output when full_output is set to TRUE.

Details

Details are arranged from most to least important.

  • gene_matrix should be a raw (i.e., un-normalized) matrix of UMI (unique molecular identifier) counts.

  • combined_perturbation_matrix should be a "combined perturbation matrix", which can be obtained by applying the functions threshold_gRNA_matrix and combine_perturbations (in that order) to a raw gRNA count matrix. combined_perturbation_matrix optionally can be a raw gRNA expression matrix or an uncombined perturbation matrix, in which case each gRNA is treated as its own group of one. See the tutorial for more details.

  • The gene IDs (respectively, gRNA groups) within gene_gRNA_group_pairs must be a subset of the row names of gene_matrix (respectively, combined_perturbation_matrix).

  • The side parameter controls the sidedness of the test. The arguments "left" and "right" are appropriate when testing for a decrease and increase in gene expression, respectively. The default argument -- "both" -- is appropriate when testing for an increase or decrease in gene expression.

  • The default value of regularization_amount is 0.1, meaning that a small amount of regularization is applied to the estimated negative binomial size parameters, which helps protect against overfitting. When the number of genes is < 50, however, the default value of regularization_amount is set to 0 (i.e., no regularization), as regularization is known to be ineffective when there are few genes.

  • When full_output is set to TRUE (as opposed to FALSE, the default), the output is a data frame with the following columns: gene_id, gRNA_id, p_value, skew_t_fit_success (if TRUE, p-value based on tail probability of fitted skew-t distribution returned; if FALSE, empirical p-value returned), xi, omega, alpha, nu (fitted parameters of the skew-t distribution; NA if fit failed), z_value (z-value obtained on "ground truth" data), and z_null_1, ..., z_null_B (z-values obtained from resampled datasets).

Examples

if (FALSE) {
library(dplyr)
library(magrittr)
# 1. load the data
data(gene_matrix) # i. gene expression matrix
data(gRNA_matrix) # ii. gRNA expression matrix
data(covariate_matrix) # iii. covariate matrix
data(gRNA_groups_table) # iv. gRNAs grouped by target site
data(gene_gRNA_group_pairs) # v. gene-gRNA group pairs to analyze

# 2. threshold and combine gRNA matrix
combined_perturbation_matrix <- threshold_gRNA_matrix(gRNA_matrix) %>%
combine_perturbations(gRNA_groups_table)

# 3. select the gene-gRNA group pairs to analyze
set.seed(4)
gene_gRNA_group_pairs <- gene_gRNA_group_pairs %>% sample_n(25)
# 3. run method (takes ~40s on an 8-core Macbook Pro)
result <- run_sceptre_high_moi(gene_matrix = gene_matrix,
combined_perturbation_matrix = combined_perturbation_matrix,
covariate_matrix = covariate_matrix,
gene_gRNA_group_pairs = gene_gRNA_group_pairs,
side = "left")
}