# Somatic mutation analysis in single cells { #SNV_analysis } In the following sections, we will show two example data sets to illustrating: - SMART-seq2 for SNV and mtSNV analysis - 10x Genomic for CNV analysis (and possibly mtSNV analysis if lucky) ## SNV analysis Here, we will use a SMART-seq2 dataset, the joxm dataset (77 QC cells) generated from [McCarthy et al, 2020](https://www.nature.com/articles/s41592-020-0766-3)), as an example of SNV analsyis in single cells (Fig. 4.3 below for this donor). The processed data sets are directly available from its according [cardelino](https://github.com/single-cell-genetics/cardelino) R package. We also prepared the raw data (in `.bam` format) and scripts to generate them in [Chapter 6.2](https://statbiomed.github.io/SingleCell-Workshop-2021/preprocess.html#preprocess_mutations) for you to practice the preprocessing steps with tools [cellsnp-lit](https://cellsnp-lite.readthedocs.io) and [MQuad](https://github.com/single-cell-genetics/MQuad). ### Call somatic variants and genotype single cells Nuclear genome is large and scRNA-seq is generally noisy, hence making it extremely difficult to call somatic SNVs from scRNA-seq data directly. In this dataset, probably in a general situation, we could detect somatic SNVs using bulk whole genome sequencing with tumor and matched normal samples first, e.g., with the commonly used [MuTect](https://software.broadinstitute.org/cancer/cga/mutect). In this joxm dataset, we have called ~650 SNVs from bulk whole exome-seq. Given this list of called somatic variants, we can genotype each cell with genotyping tool. Here, we used our [cellsnp-lit](https://cellsnp-lite.readthedocs.io) (with high recommendation). Finally, we obtained 112 SNVs are expressed in at least one of the 77 cells. The output file in VCF is available in the cardelino package [cellSNP.cells.vcf.gz](https://github.com/single-cell-genetics/cardelino/blob/lite-depends/inst/extdata/cellSNP.cells.vcf.gz) (12KB). ### Visualize variat allele fequency Now, let's load the data and visualise the allele frequency here: ```{r load-pkg} library(ggplot2) library(cardelino) ``` ```{r read-vcf-data} vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz", package = "cardelino") input_data <- load_cellSNP_vcf(vcf_file) ``` ```{r allele-frequency} AF <- as.matrix(input_data$A / input_data$D) p = pheatmap::pheatmap(AF, cluster_rows=FALSE, cluster_cols=FALSE, show_rownames = TRUE, show_colnames = TRUE, labels_row='77 cells', labels_col='112 SNVs', angle_col=0, angle_row=0) ``` As expected, the majority of entries are missing (in grey) due to the high sparsity in scRNA-seq data. For the same reason, even for the non-missing entries, the estimate of allele frequency is of high uncertainty. Therefore, it is crucial to probabilistic clustering with accounting the uncertainty, ideally with guide clonal tree structure from external data. ### Use extra information on clonal tree As mentioned, the variant calling is based on bulk WES, which can also be used to infer a clonal tree, even though it is of high uncertainty by only using bulk data. Here, we obtain a maximum-a-posteriori (MAP) tree from [Canopy](https://github.com/yuchaojiang/Canopy). The clonal tree inferred by Canopy for this donor consists of three clones, including a "base" clone ("clone1") that has no clonal somatic mutations present. ```{r read-canopy-data} canopy_res <- readRDS(system.file("extdata", "canopy_results.coveraged.rds", package = "cardelino")) plot_tree(canopy_res$tree, orient = "v") ``` ### Run cardelino for inferring clonal structure Now, we can leverage the inferred clonal tree as an guidance to cluster cells based on their mutation profiles. Before starting, be careful to ensure that the same variant IDs are used in both data sources. ```{r correct-variant-ids} Config <- canopy_res$tree$Z rownames(Config) <- gsub(":", "_", rownames(Config)) ``` The run `clone_id` the main function in `cardelino` package: ```{r cell-assign} set.seed(7) assignments <- clone_id(input_data$A, input_data$D, Config = Config, min_iter = 800, max_iter = 1200) names(assignments) ``` We can visualize the cell-clone assignment probabilities as a heatmap. ```{r prob-heatmap} prob_heatmap(assignments$prob) ``` We recommend assigning a cell to the highest-probability clone if the highest posterior probability is greater than 0.5 and leaving cells "unassigned" if they do not reach this threshold. The `assign_cells_to_clones` function conveniently assigns cells to clones based on a threshold and returns a data.frame with the cell-clone assignments. ```{r assign-cell-clone-easy} df <- assign_cells_to_clones(assignments$prob) head(df) table(df$clone) ``` Also, Cardelino will update the guide clonal tree Config matrix (as a prior) and return a posterior estimate. In the figure below, negative value means the probability of a certain variant presents in a certain clone is reduced in posterior compared to prior (i.e., the input Config). Vice verse for the positive values. ```{r Config-update} heat_matrix(t(assignments$Config_prob - Config)) + scale_fill_gradient2() + ggtitle('Changes of clonal Config') + labs(x='Clones', y='112 SNVs') + theme(axis.text.y = element_blank(), legend.position = "right") ``` Finally, we can visualize the results cell assignment and updated mutations clonal configuration at the raw allele frequency matrix: ```{r results-plot} AF <- as.matrix(input_data$A / input_data$D) cardelino::vc_heatmap(AF, assignments$prob, Config, show_legend=TRUE) ``` ## mtSNV analysis Recently, mitochondrial variants have also been shown powerful makeups for lineage tracing [Ludwig et al, 2019](https://doi.org/10.1016/j.cell.2019.01.022). The much higher copies of mtDNA not only give much higher sequencing coverage, but generally highly mutation rate (probably also some other biological reasons). ### Visualise allele frequency Here, we illustrate the mtDNA on the joxm dataset again. During the preprocssing (see [Chapter 7.2](https://statbiomed.github.io/SingleCell-Workshop-2021/preprocess.html#preprocess_mutations)), we first genotype the whole mito genome directly with `cellsnp-lit` and then call the clonal informed variants by `MQuad`. Let's first load the processed dataset from `cardelino` package and visualize the allele frequency. ```{r read-mtx-files} AD_file <- system.file("extdata", "passed_ad.mtx", package = "cardelino") DP_file <- system.file("extdata", "passed_dp.mtx", package = "cardelino") id_file <- system.file("extdata", "passed_variant_names.txt", package = "cardelino") AD <- Matrix::readMM(AD_file) DP <- Matrix::readMM(DP_file) var_ids <- read.table(id_file, ) rownames(AD) <- rownames(DP) <- var_ids[, 1] colnames(AD) <- colnames(DP) <- paste0('Cell', seq(ncol(DP))) ``` ```{r show-pheatmap} AF_mt <- as.matrix(AD / DP) pheatmap::pheatmap(AF_mt) ``` As expected, the coverage of mtDNA is a lot higher than the nuclear genome above (much fewer missing values). ### Infer clonal structure with mtDNA variants Now, we can run cardelino on the mitochondrial variations. Note, as there is no prior clonal tree, the model is easier to return a local optima. Generally, we recommend running it multiple time (with different random seed or initialization) and pick the one with highest DIC. ```{r denovo-mtDNA} set.seed(7) assign_mtClones <- clone_id(AD, DP, Config=NULL, n_clone = 3, keep_base_clone=FALSE) ``` Then visualize allele frequency along with the clustering of cells and variants: ```{r vc_heatmap-mtDNA} Config_mt <- assign_mtClones$Config_prob Config_mt[Config_mt >= 0.5] = 1 Config_mt[Config_mt < 0.5] = 0 cardelino::vc_heatmap(AF_mt, assign_mtClones$prob, Config_mt, show_legend=TRUE) ``` ## Last notes As shown, `cardelino` allows us to cluster cells based on somatic mutations with accounting for the high uncertainty caused by technical variability. Last, we'd like to highlight a few potential drawbacks of `cardelino`: 1. It is computationally expensive due to the nature of Gibbs sampling that scarifies speed for accuracy. For large data sets, e.g., >1000 cells, cardelino may be too slow to handle. If speed is a major concertain, please consider our more efficient tool [Vireo](https://vireosnp.readthedocs.io/en/latest/vireoSNP_clones.html) as a sister tool. 2. It may not always return a converged results. You could consider running it longer or with more chains or initialization. 3. It is generally difficult to select the best number of clone. You may get some clue by trial and error or visualise it from hierarchical clustering first.