Basic sceptre tutorial

Updated March 2022

This vignette illustrates application of sceptre to an example high-multiplicity-of-infection (MOI) single-cell CRISPR screen dataset. We begin by installing and loading all required packages, including the up-to-date version of sceptre.

install.packages("devtools")
devtools::install_github("katsevich-lab/sceptre")
install.packages("tidyverse")
install.packages("cowplot")
library(tidyverse)
library(cowplot)
library(Matrix)
library(sceptre)

Step 1: Prepare the data

The first step is to prepare the data to pass to sceptre. We must prepare three separate data objects: the gene-by-cell expression matrix, the gRNA-by-cell expression matrix, and the cell-specific matrix of covariates.

Gene and gRNA expression matrices

We load the example gene-by-cell and gRNA-by-cell expression matrices that are included in the sceptre package.

data(gene_matrix)
data(gRNA_matrix)

Briefly, gene_matrix (respectively, gRNA_matrix) is a 20 x 40,000 (respectively, 50 x 106,660) matrix of gene (respectively, gRNA) unique molecular identifier (UMI) counts. The data are taken from the paper “A genome-wide framework for mapping gene regulation via cellular genetic screens” by Gasperini et al., 2019. The authors used a CRISPRi-based assay to target putative enhancers in a population of K562 cells. Genes, gRNAs, and cells are downsampled to reduce the size of the data. One can use the commands ?gene_matrix and ?gRNA_matrix to read more about these data.

The row names of gene_matrix and gRNA_matrix are the gene IDs and gRNA IDs, respectively. The column names, meanwhile, are the cell barcodes. In general the gRNA IDs must be strings that uniquely identify each gRNA (for example, the gRNA sequence itself); in this case the gRNA IDs are gRNA-specific oligonucleotide barcodes.

gene_matrix and gRNA_matrix are sparse matrices (as implemented by the Matrix package); in general these matrices can be either sparse matrices or standard (dense) R matrices. We print a few rows and columns of gene_matrix and gRNA_matrix to get a sense of what the data look like.

options(width = 300)
# gene_matrix; rows are genes, columns are cells
gene_matrix[1:10, 17:19]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#>                 TCTGAGAGTGACGCCT-1_1B_3 TGGACGCGTAAAGTCA-1_1B_2 GTTAAGCGTTAAAGTG-1_2A_8
#> ENSG00000171530                       9                       8                       3
#> ENSG00000116750                       .                       5                       .
#> ENSG00000061794                       1                       2                       1
#> ENSG00000178127                       2                       5                       8
#> ENSG00000136521                       4                       4                       5
#> ENSG00000065183                       .                       2                       .
#> ENSG00000074211                       .                       .                       .
#> ENSG00000119714                       1                       .                       .
#> ENSG00000182256                       .                       .                       .
#> ENSG00000185222                       1                       2                       .

# gRNA matrix; rows are gRNAs, columns are cells
gRNA_matrix[1:10, 17:19]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#>                      TCTGAGAGTGACGCCT-1_1B_3 TGGACGCGTAAAGTCA-1_1B_2 GTTAAGCGTTAAAGTG-1_2A_8
#> GAATCGGGTGGGATTCCCAG                       .                       .                       .
#> ACAGAAAGTGAGATAGCAGG                       .                       .                       .
#> CCTGCCATTGGGTCACCATG                       .                       .                       .
#> ACTTCCTCATGGTGACCCAA                       .                       .                       .
#> TTATAGGAGGAGGATGCAGG                       .                       .                       .
#> CCAGGCACTTGTGAGAACAA                       .                       .                       .
#> TTACTGCGTGACCCTAGAGA                       .                       .                       .
#> TTATCTGCACCACAACCGTG                       .                       .                       .
#> GCAGCTGCACAGGTTCTCCG                       .                       .                       .
#> GGAAGAACCCAGAAACAGAG                       .                       .                       .

We also plot a histogram of the counts of an arbitrarily selected gene (“ENSG00000125968”) and gRNA (“GTTGCAGATGAGGCAACCGA”) to visualize the data.

example_gene <- gene_matrix["ENSG00000171530",]
example_gRNA <- gRNA_matrix["GAATCGGGTGGGATTCCCAG",]

hist_gene <- ggplot(data = example_gene %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
  geom_histogram(binwidth = 3, col = "black", fill = "dodgerblue3", alpha = 0.7) +
  scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("Gene expressions") + ylab("") + theme_bw(base_size = 10)

hist_gRNA <- ggplot(data = example_gRNA %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
  geom_histogram(binwidth = 3, col = "black", fill = "orchid4", alpha = 0.7) +
  scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("gRNA expressions") + ylab("") + theme_bw(base_size = 10)

plot_grid(hist_gene, hist_gRNA, labels = c("a", "b"))

As expected, the data are highly discrete counts. Note that we do not normalize either the gene or gRNA expression matrices, opting instead to work directly with the raw counts.

Cell-wise covariate matrix

Next, we load the cell-wise covariate matrix, covariate_matrix.

data(covariate_matrix)
head(covariate_matrix)
#>                         lg_gRNA_lib_size lg_gene_lib_size     p_mito        batch
#> CCCAATCTCACCTTAT-1_2B_1         6.276643         9.485089 0.03493317 prep_batch_2
#> CCACTACAGATCCGAG-1_2A_6         6.746412        10.083515 0.05295248 prep_batch_2
#> CCGGGATCATTAGCCA-1_2B_5         7.203406        10.195224 0.04732756 prep_batch_2
#> TACTTACGTGTGGCTC-1_2A_3         5.783825         9.631285 0.04816905 prep_batch_2
#> GACCTGGCAAAGGAAG-1_2A_1         7.012115         9.973760 0.03327275 prep_batch_2
#> ACGCCAGCAATGGATA-1_1A_7         6.410175        10.336698 0.02616053 prep_batch_1

covariate_matrix is a 106,660 x 4 data frame of “technical factors,” or covariates. The row names of this data frame are the cell barcodes. The covariates are as follows:

We strongly recommend that users include the same four covariates (i.e., lg_gene_lib_size, lg_gRNA_lib_size, batch, and p_mito) in their own cell-wise covariate matrix.

Step 2: create the “perturbation matrix”

The second step is to convert the gRNA count matrix into a “perturbation matrix.” This step consists of (i) imputing perturbation assignments onto the cells by thresholding the gRNA counts and (ii) combining gRNAs that target the same chromosomal site into a single “combined” gRNA.

We begin by loading the gRNA_groups_table data frame, which contains additional information about each individual gRNA.

data("gRNA_groups_table")
head(gRNA_groups_table, 10)
#> # A tibble: 10 × 2
#>    gRNA_id              gRNA_group        
#>    <chr>                <chr>             
#>  1 GAATCGGGTGGGATTCCCAG chr11.3768_top_two
#>  2 ACAGAAAGTGAGATAGCAGG chr11.3768_top_two
#>  3 CCTGCCATTGGGTCACCATG chr12.3152_top_two
#>  4 ACTTCCTCATGGTGACCCAA chr12.3152_top_two
#>  5 TTATAGGAGGAGGATGCAGG chr14.1937_top_two
#>  6 CCAGGCACTTGTGAGAACAA chr14.1937_top_two
#>  7 TTACTGCGTGACCCTAGAGA chr14.2168_top_two
#>  8 TTATCTGCACCACAACCGTG chr14.2168_top_two
#>  9 GCAGCTGCACAGGTTCTCCG chr15.149_top_two 
#> 10 GGAAGAACCCAGAAACAGAG chr15.149_top_two
perturabtion_matrix <- create_perturbation_matrix(gRNA_matrix = gRNA_matrix,
                                                  gRNA_groups_table = gRNA_groups_table)
perturabtion_matrix[1:5,1:3]
#> 5 x 3 sparse Matrix of class "dgCMatrix"
#>                    CCCAATCTCACCTTAT-1_2B_1 CCACTACAGATCCGAG-1_2A_6 CCGGGATCATTAGCCA-1_2B_5
#> chr11.3768_top_two                       .                       .                       .
#> chr12.3152_top_two                       .                       .                       .
#> chr14.1937_top_two                       .                       .                       .
#> chr14.2168_top_two                       .                       .                       .
#> chr15.149_top_two                        1                       .                       .