# Somatic mutation analysis in single cells { #SNV_analysis }


In the following sections, we will show two example data sets to illustrating:

- SMART-seq2 for SNV and mtSNV analysis
- 10x Genomic for CNV analysis (and possibly mtSNV analysis if lucky)


## SNV analysis
Here, we will use a SMART-seq2 dataset, the joxm dataset (77 QC cells) generated
from [McCarthy et al, 2020](https://www.nature.com/articles/s41592-020-0766-3)),
as an example of SNV analsyis in single cells (Fig. 4.3 below for this donor).
The processed data sets are directly available from its according
[cardelino](https://github.com/single-cell-genetics/cardelino) R package. We 
also prepared the raw data (in `.bam` format) and scripts to generate them in 
[Chapter 6.2](https://statbiomed.github.io/SingleCell-Workshop-2021/preprocess.html#preprocess_mutations) for you to practice the preprocessing steps with 
tools [cellsnp-lit](https://cellsnp-lite.readthedocs.io) and 
[MQuad](https://github.com/single-cell-genetics/MQuad).


<!-- ![Figure 4.3. SNVs analysis with Joxm dataset. Fig adapted from McCarthy et al, 2020](images/cellGen-fig4.png) -->

### Call somatic variants and genotype single cells
Nuclear genome is large and scRNA-seq is generally noisy, hence making it 
extremely difficult to call somatic SNVs from scRNA-seq data directly. In this
dataset, probably in a general situation, we could detect somatic SNVs using 
bulk whole genome sequencing with tumor and matched normal samples first, e.g.,
with the commonly used 
[MuTect](https://software.broadinstitute.org/cancer/cga/mutect).

In this joxm dataset, we have called ~650 SNVs from bulk whole exome-seq. Given
this list of called somatic variants, we can genotype each cell with genotyping
tool. Here, we used our [cellsnp-lit](https://cellsnp-lite.readthedocs.io) (with
high recommendation).

Finally, we obtained 112 SNVs are expressed in at least one of the 77 cells. The
output file in VCF is available in the cardelino package 
[cellSNP.cells.vcf.gz](https://github.com/single-cell-genetics/cardelino/blob/lite-depends/inst/extdata/cellSNP.cells.vcf.gz) (12KB).


### Visualize variat allele fequency
Now, let's load the data and visualise the allele frequency here:
```{r load-pkg}
library(ggplot2)
library(cardelino)
```

```{r read-vcf-data}
vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz", package = "cardelino")
input_data <- load_cellSNP_vcf(vcf_file)
```

```{r allele-frequency}
AF <- as.matrix(input_data$A / input_data$D)

p = pheatmap::pheatmap(AF, cluster_rows=FALSE, cluster_cols=FALSE,
                   show_rownames = TRUE, show_colnames = TRUE,
                   labels_row='77 cells', labels_col='112 SNVs',
                   angle_col=0, angle_row=0)
```
<!-- ![Figure 4.4. Allele frequency of somatic SNVs in Joxm dataset](images/cellGen-fig5.png){ width=75% } -->

As expected, the
majority of entries are missing (in grey) due to the high sparsity in scRNA-seq 
data. For the same reason, even for the non-missing entries, the estimate of 
allele frequency is of high uncertainty. Therefore, it is crucial to 
probabilistic clustering with accounting the uncertainty, ideally with guide 
clonal tree structure from external data. 

### Use extra information on clonal tree
As mentioned, the variant calling is based on bulk WES, which can also be used
to infer a clonal tree, even though it is of high uncertainty by only using bulk
data. Here, we obtain a maximum-a-posteriori (MAP) tree from 
[Canopy](https://github.com/yuchaojiang/Canopy).
The clonal tree inferred by Canopy for this donor consists of three clones, 
including a "base" clone ("clone1") that has no clonal somatic mutations 
present.

```{r read-canopy-data}
canopy_res <- readRDS(system.file("extdata", "canopy_results.coveraged.rds", 
                                  package = "cardelino"))
plot_tree(canopy_res$tree, orient = "v")
```
<!-- ![Figure 4.5. Clonal tree inferred from bulk WES data](images/cellGen-fig6.png){ width=40% } -->

### Run cardelino for inferring clonal structure
Now, we can leverage the inferred clonal tree as an guidance to cluster cells
based on their mutation profiles.

Before starting, be careful to ensure that the same variant IDs are used in both 
data sources.

```{r correct-variant-ids}
Config <- canopy_res$tree$Z
rownames(Config) <- gsub(":", "_", rownames(Config))
```

The run `clone_id` the main function in `cardelino` package:

```{r cell-assign}
set.seed(7)
assignments <- clone_id(input_data$A, input_data$D, Config = Config,
                        min_iter = 800, max_iter = 1200)
names(assignments)
```

We can visualize the cell-clone assignment probabilities as a heatmap.

```{r prob-heatmap}
prob_heatmap(assignments$prob)
```
<!-- ![Figure 4.6. Assignment probability of cells to clones](images/cellGen-fig7.png){ width=40% } -->

We recommend assigning a cell to the highest-probability clone if the highest 
posterior probability is greater than 0.5 and leaving cells "unassigned" if they
do not reach this threshold. The `assign_cells_to_clones` function conveniently
assigns cells to clones based on a threshold and returns a data.frame with the
cell-clone assignments.

```{r assign-cell-clone-easy}
df <- assign_cells_to_clones(assignments$prob)
head(df)
table(df$clone)
```


Also, Cardelino will update the guide clonal tree Config matrix (as a prior)
and return a posterior estimate. In the figure below, negative value means the
probability of a certain variant presents in a certain clone is reduced in 
posterior compared to prior (i.e., the input Config). Vice verse for the 
positive values.
```{r Config-update}
heat_matrix(t(assignments$Config_prob - Config)) + 
  scale_fill_gradient2() +
  ggtitle('Changes of clonal Config') + 
  labs(x='Clones', y='112 SNVs') +
  theme(axis.text.y = element_blank(), legend.position = "right")
```
<!-- ![Figure 4.7. Difference between updated and input clonal configuration](images/cellGen-fig8.png){ width=40% } -->

Finally, we can visualize the results cell assignment and updated mutations 
clonal configuration at the raw allele frequency matrix:

```{r results-plot}
AF <- as.matrix(input_data$A / input_data$D)
cardelino::vc_heatmap(AF, assignments$prob, Config, show_legend=TRUE)
```
<!-- ![Figure 4.8. Allele frequency of clustered cells and variants](images/cellGen-fig9.png){ width=75% } -->

## mtSNV analysis
Recently, mitochondrial variants have also been shown powerful makeups for 
lineage tracing [Ludwig et al, 2019](https://doi.org/10.1016/j.cell.2019.01.022).
The much higher copies of mtDNA not only give much higher sequencing coverage,
but generally highly mutation rate (probably also some other biological reasons).

<!-- ![Figure 4.9. Mitochondrial variants for lineage tracing. Fig. adapted from Ludwig et al, 2019](images/cellGen-fig10.png) -->

### Visualise allele frequency
Here, we illustrate the mtDNA on the joxm dataset again. During the 
preprocssing 
(see [Chapter 7.2](https://statbiomed.github.io/SingleCell-Workshop-2021/preprocess.html#preprocess_mutations)),
we first genotype the whole mito genome directly with `cellsnp-lit` and then
call the clonal informed variants by `MQuad`.

Let's first load the processed dataset from `cardelino` package and visualize 
the allele frequency.

```{r read-mtx-files}
AD_file <- system.file("extdata", "passed_ad.mtx", package = "cardelino")
DP_file <- system.file("extdata", "passed_dp.mtx", package = "cardelino")
id_file <- system.file("extdata", "passed_variant_names.txt", package = "cardelino")

AD <- Matrix::readMM(AD_file)
DP <- Matrix::readMM(DP_file)
var_ids <- read.table(id_file, )
rownames(AD) <- rownames(DP) <- var_ids[, 1]
colnames(AD) <- colnames(DP) <- paste0('Cell', seq(ncol(DP)))
```

```{r show-pheatmap}
AF_mt <- as.matrix(AD / DP)
pheatmap::pheatmap(AF_mt)
```

<!-- ![Figure 4.10. Allele frequency of mtDNA variants](images/cellGen-fig11.png){ width=75% } -->

As expected, the coverage of mtDNA is a lot higher than the nuclear genome above
(much fewer missing values).

### Infer clonal structure with mtDNA variants
Now, we can run cardelino on the mitochondrial variations. Note, as there is 
no prior clonal tree, the model is easier to return a local optima. Generally,
we recommend running it multiple time (with different random seed or 
initialization) and pick the one with highest DIC.

```{r denovo-mtDNA}
set.seed(7)
assign_mtClones <- clone_id(AD, DP, Config=NULL, n_clone = 3, keep_base_clone=FALSE)
```

Then visualize allele frequency along with the clustering of cells and variants:
```{r vc_heatmap-mtDNA}
Config_mt <- assign_mtClones$Config_prob
Config_mt[Config_mt >= 0.5] = 1
Config_mt[Config_mt <  0.5] = 0
cardelino::vc_heatmap(AF_mt, assign_mtClones$prob, Config_mt, show_legend=TRUE)
```

<!-- ![Figure 4.11. Allele frequency of mtSNVs with clustered cells](images/cellGen-fig12.png){ width=75% } -->

## Last notes
As shown, `cardelino` allows us to cluster cells based on somatic mutations with
accounting for the high uncertainty caused by technical variability. Last, we'd
like to highlight a few potential drawbacks of `cardelino`:

1. It is computationally expensive due to the nature of Gibbs sampling that 
scarifies speed for accuracy. For large data sets, e.g., >1000 cells, cardelino
may be too slow to handle. If speed is a major concertain, please consider our
more efficient tool 
[Vireo](https://vireosnp.readthedocs.io/en/latest/vireoSNP_clones.html) as a 
sister tool.

2. It may not always return a converged results. You could consider running it 
longer or with more chains or initialization.

3. It is generally difficult to select the best number of clone. You may get 
some clue by trial and error or visualise it from hierarchical clustering first.