sceptre tutorialUpdated March 2022
This vignette illustrates application of sceptre to an
example high-multiplicity-of-infection (MOI) single-cell CRISPR screen
dataset. We begin by installing and loading all required packages,
including the up-to-date version of sceptre.
install.packages("devtools")
devtools::install_github("katsevich-lab/sceptre")
install.packages("tidyverse")
install.packages("cowplot")library(tidyverse)
library(cowplot)
library(Matrix)
library(sceptre)The first step is to prepare the data to pass to
sceptre. We must prepare three separate data objects: the
gene-by-cell expression matrix, the gRNA-by-cell expression matrix, and
the cell-specific matrix of covariates.
We load the example gene-by-cell and gRNA-by-cell expression matrices
that are included in the sceptre package.
data(gene_matrix)
data(gRNA_matrix)Briefly, gene_matrix (respectively,
gRNA_matrix) is a 20 x 40,000 (respectively, 50 x 106,660)
matrix of gene (respectively, gRNA) unique molecular identifier (UMI)
counts. The data are taken from the paper
“A genome-wide framework for mapping gene regulation via cellular
genetic screens” by Gasperini et al., 2019. The authors used a
CRISPRi-based assay to target putative enhancers in a population of K562
cells. Genes, gRNAs, and cells are downsampled to reduce the size of the
data. One can use the commands ?gene_matrix and
?gRNA_matrix to read more about these data.
The row names of gene_matrix and
gRNA_matrix are the gene IDs and gRNA IDs, respectively.
The column names, meanwhile, are the cell barcodes. In general the gRNA
IDs must be strings that uniquely identify each gRNA (for example, the
gRNA sequence itself); in this case the gRNA IDs are gRNA-specific
oligonucleotide barcodes.
gene_matrix and gRNA_matrix are sparse
matrices (as implemented by the Matrix package); in general
these matrices can be either sparse matrices or standard (dense) R
matrices. We print a few rows and columns of gene_matrix
and gRNA_matrix to get a sense of what the data look
like.
options(width = 300)
# gene_matrix; rows are genes, columns are cells
gene_matrix[1:10, 17:19]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#> TCTGAGAGTGACGCCT-1_1B_3 TGGACGCGTAAAGTCA-1_1B_2 GTTAAGCGTTAAAGTG-1_2A_8
#> ENSG00000171530 9 8 3
#> ENSG00000116750 . 5 .
#> ENSG00000061794 1 2 1
#> ENSG00000178127 2 5 8
#> ENSG00000136521 4 4 5
#> ENSG00000065183 . 2 .
#> ENSG00000074211 . . .
#> ENSG00000119714 1 . .
#> ENSG00000182256 . . .
#> ENSG00000185222 1 2 .
# gRNA matrix; rows are gRNAs, columns are cells
gRNA_matrix[1:10, 17:19]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#> TCTGAGAGTGACGCCT-1_1B_3 TGGACGCGTAAAGTCA-1_1B_2 GTTAAGCGTTAAAGTG-1_2A_8
#> GAATCGGGTGGGATTCCCAG . . .
#> ACAGAAAGTGAGATAGCAGG . . .
#> CCTGCCATTGGGTCACCATG . . .
#> ACTTCCTCATGGTGACCCAA . . .
#> TTATAGGAGGAGGATGCAGG . . .
#> CCAGGCACTTGTGAGAACAA . . .
#> TTACTGCGTGACCCTAGAGA . . .
#> TTATCTGCACCACAACCGTG . . .
#> GCAGCTGCACAGGTTCTCCG . . .
#> GGAAGAACCCAGAAACAGAG . . .We also plot a histogram of the counts of an arbitrarily selected gene (“ENSG00000125968”) and gRNA (“GTTGCAGATGAGGCAACCGA”) to visualize the data.
example_gene <- gene_matrix["ENSG00000171530",]
example_gRNA <- gRNA_matrix["GAATCGGGTGGGATTCCCAG",]
hist_gene <- ggplot(data = example_gene %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
geom_histogram(binwidth = 3, col = "black", fill = "dodgerblue3", alpha = 0.7) +
scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("Gene expressions") + ylab("") + theme_bw(base_size = 10)
hist_gRNA <- ggplot(data = example_gRNA %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
geom_histogram(binwidth = 3, col = "black", fill = "orchid4", alpha = 0.7) +
scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("gRNA expressions") + ylab("") + theme_bw(base_size = 10)
plot_grid(hist_gene, hist_gRNA, labels = c("a", "b"))As expected, the data are highly discrete counts. Note that we do not normalize either the gene or gRNA expression matrices, opting instead to work directly with the raw counts.
Next, we load the cell-wise covariate matrix,
covariate_matrix.
data(covariate_matrix)
head(covariate_matrix)
#> lg_gRNA_lib_size lg_gene_lib_size p_mito batch
#> CCCAATCTCACCTTAT-1_2B_1 6.276643 9.485089 0.03493317 prep_batch_2
#> CCACTACAGATCCGAG-1_2A_6 6.746412 10.083515 0.05295248 prep_batch_2
#> CCGGGATCATTAGCCA-1_2B_5 7.203406 10.195224 0.04732756 prep_batch_2
#> TACTTACGTGTGGCTC-1_2A_3 5.783825 9.631285 0.04816905 prep_batch_2
#> GACCTGGCAAAGGAAG-1_2A_1 7.012115 9.973760 0.03327275 prep_batch_2
#> ACGCCAGCAATGGATA-1_1A_7 6.410175 10.336698 0.02616053 prep_batch_1covariate_matrix is a 106,660 x 4 data frame of
“technical factors,” or covariates. The row names of this data frame are
the cell barcodes. The covariates are as follows:
Log-transformed gene library size
(lg_gene_lib_size); this can be computed via
log(colSums(gene_matrix))
Log-transformed gRNA library size
(lg_gRNA_lib_size); this can be computed via
log(colSums(gRNA_matrix))
Sequencing batch (batch)
Percentage of gene transcripts that map to mitochondrial genes
(p_mito)
We strongly recommend that users include the same four covariates
(i.e., lg_gene_lib_size, lg_gRNA_lib_size,
batch, and p_mito) in their own cell-wise
covariate matrix.
The second step is to convert the gRNA count matrix into a “perturbation matrix.” This step consists of (i) imputing perturbation assignments onto the cells by thresholding the gRNA counts and (ii) combining gRNAs that target the same chromosomal site into a single “combined” gRNA.
We begin by loading the gRNA_groups_table data frame,
which contains additional information about each individual gRNA.
data("gRNA_groups_table")
head(gRNA_groups_table, 10)
#> # A tibble: 10 × 2
#> gRNA_id gRNA_group
#> <chr> <chr>
#> 1 GAATCGGGTGGGATTCCCAG chr11.3768_top_two
#> 2 ACAGAAAGTGAGATAGCAGG chr11.3768_top_two
#> 3 CCTGCCATTGGGTCACCATG chr12.3152_top_two
#> 4 ACTTCCTCATGGTGACCCAA chr12.3152_top_two
#> 5 TTATAGGAGGAGGATGCAGG chr14.1937_top_two
#> 6 CCAGGCACTTGTGAGAACAA chr14.1937_top_two
#> 7 TTACTGCGTGACCCTAGAGA chr14.2168_top_two
#> 8 TTATCTGCACCACAACCGTG chr14.2168_top_two
#> 9 GCAGCTGCACAGGTTCTCCG chr15.149_top_two
#> 10 GGAAGAACCCAGAAACAGAG chr15.149_top_twoperturabtion_matrix <- create_perturbation_matrix(gRNA_matrix = gRNA_matrix,
gRNA_groups_table = gRNA_groups_table)
perturabtion_matrix[1:5,1:3]
#> 5 x 3 sparse Matrix of class "dgCMatrix"
#> CCCAATCTCACCTTAT-1_2B_1 CCACTACAGATCCGAG-1_2A_6 CCGGGATCATTAGCCA-1_2B_5
#> chr11.3768_top_two . . .
#> chr12.3152_top_two . . .
#> chr14.1937_top_two . . .
#> chr14.2168_top_two . . .
#> chr15.149_top_two 1 . .