---
title: "Fitting and extracting mutational signatures with sigfit"
author: "Kevin Gori and Adrian Baez-Ortega (2021)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fitting and extracting mutational signatures with sigfit}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages(library(sigfit))
par(mar = c(6, 4, 6, 4))
```


__sigfit__ is used to estimate signatures of mutational processes and their degree of activity in a collection of cancer (or normal) samples. Starting from a set of single-nucleotide variants (SNVs), it allows both estimation of the exposure of samples to predefined mutational signatures (including whether the signatures are present at all), and identification of signatures _de novo_ from the mutation counts. These two procedures are often called __signature fitting__ and __signature extraction__, respectively. In addition, sigfit implements a novel methodology that combines signature fitting and extraction in a single inferential process. Moreover, the signature analysis methods in sigfit can be seamlessly applied to mutational profiles beyond SNV data, including insertion/deletion (indel) or rearrangement count data, or even to other types of biological data, such as DNA methylation profiles. The package also provides a range of functions to generate publication-quality graphics of the corresponding mutational catalogues, signatures and exposures.

###

__Vignette contents__

[Installation](#installation)

[Example 1: Fitting mutational signatures to a single simulated sample](#example-1-fitting-mutational-signatures-to-a-single-simulated-sample)

* [Fitting mutational signatures](#fitting-mutational-signatures)

* [Retrieving signature exposures](#retrieving-signature-exposures)

* [Visualisation](#visualisation)

[Example 2: Extracting signatures from multiple breast cancer samples](#example-2-extracting-signatures-from-multiple-breast-cancer-samples)

* [Building mutational catalogues](#building-mutational-catalogues)

* [Extracting mutational signatures _de novo_](#extracting-mutational-signatures-de-novo)

* [Retrieving, plotting and re-fitting extracted signatures](#retrieving-plotting-and-re-fitting-extracted-signatures)

[Other functionalities](#other-functionalities)

* [Alternative signature models](#alternative-signature-models)

* [Alternative mutation type definitions](#alternative-mutation-type-definitions)

* [Using mutational opportunities and signature/exposure priors](#using-mutational-opportunities-and-signatureexposure-priors)

* [Converting between genome- and exome-relative representations of signatures](#converting-between-genome--and-exome-relative-representations-of-signatures)

* [Using Fit-Ext models to discover rare or novel signatures](#using-fit-ext-models-to-discover-rare-or-novel-signatures)

* [Signature extraction using multiple chains](#signature-extraction-using-multiple-chains)

* [Using the Dirichlet process prior to determine the number of signatures](#using-the-dirichlet-process-prior-to-determine-the-number-of-signatures)

* [List of available data sets](#list-of-available-data-sets)

###


## Installation

sigfit is an __R__ package. As it is in early development it is not yet on CRAN, but can be installed from GitHub using the [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package.

```{r devtools_instructions, eval=FALSE}
devtools::install_github("kgori/sigfit", build_opts = c("--no-resave-data", "--no-manual"))
```

Solutions to some of the problems that may arise during installation are listed in the [README](https://github.com/kgori/sigfit/blob/master/README.md) file for the package.


## Example 1: Fitting mutational signatures to a single simulated sample

This example will use the mutational signatures from [COSMIC](http://cancer.sanger.ac.uk/cosmic/signatures) to generate simulated mutation counts, and then use sigfit to __fit signatures__ to these simulated data.

First of all, we need some mutational signatures to fit to our data. The line below loads the 30 mutational signatures published in COSMIC (v2).

```{r fetch}
library(sigfit)
data("cosmic_signatures_v2")
```

Let's use these signatures to __simulate__ some mutation data. This code will generate 20,000 mutations from a 4:3:2:1 mixture of signatures 1, 3, 7 and 11.

```{r sim}
set.seed(1)
probs <- c(0.4, 0.3, 0.2, 0.1) %*% as.matrix(cosmic_signatures_v2[c(1, 3, 7, 11), ])
mutations <- matrix(rmultinom(1, 20000, probs), nrow = 1)
colnames(mutations) <- colnames(cosmic_signatures_v2)
```

Here is what the __mutational spectrum__ of our simulated counts looks like.

```{r plotsim, fig.width=20, fig.height=7, out.width="100%", echo=-1}
par(mar = c(4.5,5.5,6.5,1))
plot_spectrum(mutations, name = "Simulated counts")
```

### Fitting mutational signatures

Next, we can use the `fit_signatures` function in sigfit to __fit__ the COSMIC signatures, i.e. to estimate the __exposure__ to each signature (pretending we ignore that it was generated from signatures 1, 3, 7 and 11). This function uses the [Stan](http://mc-stan.org/) framework to produces Markov chain Monte Carlo (MCMC) samples from a Bayesian model of signatures. 

Arguments to the `rstan::sampling` function, such as `iter`, `warmup`, `chains`, `seed`, etc., can be passed through the `fit_signatures` function.

* The arguments `warmup` and `iter` specify the number of __‘warm-up’__ (or ‘burn-in’) iterations and the total number of __sampling iterations__ (including warm-up iterations), respectively. We recommend that the value of `warmup` be between one-third and half the value of `iter`. The behaviour of the MCMC sampler (which ultimately affects the quality of the analysis) depends on parameters set during the warm-up, so it is important to run enough warm-up iterations. By default, `rstan::sampling` uses `iter = 2000` and `warmup = floor(iter/2)`; going below these values is generally not advised.

> __In general, one should run as many MCMC iterations (`iter`) as one’s computer and patience allow, with time and memory being the major constraints.__

* The `chains` argument can be used to parallelise the sampling process by using multiple __Markov chains__, the number of which should never exceed the number of processors (the `rstan::sampling` function uses `chains = 4` by default). To run multiple chains in parallel, the number of processors first needs to be set using `options(mc.cores = parallel::detectCores())`.

* The `seed` argument can be used to make the MCMC samples __reproducible__ over different runs.

* For further sampling options, type `?rstan::sampling` to read the documentation.

The `fit_signatures` function can be used to fit the COSMIC signatures to the simulated counts as follows.

```{r fitting, warning=FALSE}
mcmc_samples_fit <- fit_signatures(counts = mutations, 
                                   signatures = cosmic_signatures_v2,
                                   iter = 2000, 
                                   warmup = 1000, 
                                   chains = 1, 
                                   seed = 1756)
```

For details about further arguments of `fit_signatures`, including alternative signature models, mutational opportunities and exposure priors, see the _Other functionalities_ section at the end of this vignette, or type `?fit_signatures` to read the documentation.

### Retrieving signature exposures

Once we have the result of the MCMC sampling in `mcmc_samples_fit`, we can __retrieve__ the estimated exposures from it using the `retrieve_pars` function. This returns a named list with three matrices, one containing the __mean__ exposures, and the others containing the values corresponding to the lower and upper limits of the highest posterior density (__HPD__) interval for each signature exposure in each sample (HPD intervals are the Bayesian alternative to classical confidence intervals). The `prob` argument can be used to indicate the target probability content of the HPD interval; by default, 95% HPD intervals are returned.

Since we are fitting known signatures instead of extracting them _de novo_, the exposures will be automatically labelled with the signature names. If the signatures have no names, or if they have been extracted instead of fitted, they will be labelled as ‘Signature A’, ‘Signature B’, etc.

```{r retrieve_exp}
exposures <- retrieve_pars(mcmc_samples_fit, 
                           par = "exposures", 
                           hpd_prob = 0.90)
names(exposures)
exposures$mean
```

In this case, because there is only one sample, the exposures are returned as a numeric vector. For cases with multiple `samples`, `exposures$mean` is a matrix with one row per sample and one column per signature.

The entire __posterior distribution__ of the signature exposures and other model parameters in the `mcmc_samples_fit` object can be further explored by means of the functions provided by the [rstan](http://mc-stan.org/rstan) package. In addition, [ShinyStan](http://mc-stan.org/users/interfaces/shinystan) can be easily used in R for visual exploration of the MCMC samples.

### Visualisation

sigfit provides several easy-to-use __plotting functions__. As seen in the previous section, the `plot_spectrum` function allows visualisation of both __mutational catalogues__ and __mutational signatures__. These plots can be produced even for catalogues and signatures with arbitrary mutation types (e.g. indel or rearrangement signatures; see the _Other functionalities_ section below).

> __Note that the plotting functions in sigfit are designed with a preference for plotting directly to a PDF file__.

The __path__ to an output PDF file can be specified using the `pdf_path` argument (however, we will avoid this option here, so that the plots are shown in the vignette). When plotting to a PDF, each function will automatically define appropriate __graphical parameters__, such as plot sizes and margins. However, some parameters can be customised via additional arguments; for instance, custom plot sizes can be specified using the `pdf_width` and `pdf_height` arguments. For more details, see the documentation for each function (e.g. `?plot_spectrum`).

The `plot_exposures` function produces barplots of the estimated __signature exposures__ across the sample set; because there is only one sample in this case, a single barplot will be produced. The function needs to be supplied with either the object resulting from MCMC sampling (`mcmc_samples` argument) or the exposures themselves (`exposures` argument), the latter being either a matrix, or a list like the one returned by the `retrieve_pars` function (see above). In the present case, since we have the stanfit object generated by `fit_signatures`, we will make use of the `mcmc_samples` argument.

```{r plot_exp, fig.width=17, fig.height=7, out.width='100%', fig.align="center", echo=-1}
par(mar=c(8,5,3.5,0))
plot_exposures(mcmc_samples = mcmc_samples_fit)
```

The bars in this plot are coloured __blue__ if the estimated exposure value is ‘sufficiently non-zero’, and __grey__ otherwise. It is difficult for the model to make hard assignments of which signatures are present or absent due to the non-negative constraint on the estimate, which means that the range of values in the sample will not normally include zero. In practice, ‘sufficiently non-zero’ means that the lower end of the Bayesian HPD interval (see the previous section) is above a threshold value close to zero (by default 0.01, and adjustable via the `thresh` argument). In this example, sigfit has identified the 4 signatures used to construct the sample.

Next, we would recommend re-running `fit_signatures`, this time to fit only those signatures (i.e. those rows of the signatures matrix) which have been highlighted as ‘sufficiently non-zero’ in the plot above, in order to obtain more-accurate estimates of exposures. This is done as follows.

```{r refitting, eval=FALSE}
selected_signatures <- c(1, 3, 7, 11)
mcmc_samples_fit_2 <- fit_signatures(mutations, 
                                     cosmic_signatures_v2[selected_signatures, ],
                                     iter = 2000, 
                                     warmup = 1000, 
                                     chains = 1, 
                                     seed = 1756)
```

We can also examine how effectively the estimated signatures and/or exposures can reconstruct the original mutational catalogues, by producing __spectrum reconstruction__ plots with the `plot_reconstruction` function.

```{r reconstruct, fig.width=25, fig.height=18.5, out.width='100%', warning=FALSE, results="hide", echo=-1}
par(mar=c(5,6,6.5,1))
plot_reconstruction(mcmc_samples = mcmc_samples_fit)
```

Finally, the `plot_all` function provides a simpler way of __simultaneously__ calling the `plot_spectrum`, `plot_exposures` and `plot_reconstructions` functions for a single set of results. This function plots exclusively to PDF files, and the `out_path` argument is used to indicate the path of the directory where the files should be produced (if the directory does not exist, it will be automatically created). The `prefix` argument applies to the output file names, and can be used to distinguish different sets of plots from each other. An example is shown below.

```{r plot_all, eval=FALSE}
plot_all(mcmc_samples = mcmc_samples_fit, 
         out_path = "your/output/dir/here",
         prefix = "Fitting")
```


## Example 2: Extracting signatures from multiple breast cancer samples

### Building mutational catalogues

In this second example, we will use single-nucleotide variant (SNV) data from the set of 21 breast cancer samples originally analysed by [Nik-Zainal _et al._ (2012)](http://dx.doi.org/10.1016/j.cell.2012.04.024). These data can be accessed using the `data` function as follows.

```{r load_mutations}
library(sigfit)
data("variants_21breast")
head(variants_21breast)
```

This __variant table__ illustrates the structure of the SNV data that can be used as input for the package (unless you already have mutational catalogues for your samples). It is a matrix or data frame with one row per variant, and four (or five) columns:

* __Sample ID__ (character, e.g. `"Sample 1"`).

* __Reference base__ (character: `"A"`, `"C"`, `"G"`, or `"T"`).

* __Mutated base__ (character: `"A"`, `"C"`, `"G"`, or `"T"`).

* __Trinucleotide context__ of the variant (character; reference sequence between the positions immediately before (-1) and after (+1) the variant, e.g. `"TCA"`). This can be obtained from the reference genome that was used to call the variants, using an R package like [BSgenome](https://doi.org/doi:10.18129/B9.bioc.BSgenome) or [Biostrings](https://doi.org/doi:10.18129/B9.bioc.Biostrings); however, sequence context information is sometimes provided by variant callers within the ‘INFO’ field of the VCF file.

* __Optionally__: for variants in protein-coding regions, if information is available about the __transcriptional strand__ where each variant occurred, this can be incorporated as a fifth column taking character values `"1"` / `"U"` (for the untranscribed strand) or `"-1"` / `"T"` (for the transcribed strand). If the table is a data frame, integer values `1` and `-1` are also admitted. If this fifth column is present in the table, all the estimation and plotting functions will automatically incorporate transcriptional strand information. The interpretation of `1` and `-1` as the untranscribed and transcribed strands, respectively, allows direct use of the ‘Strand’ field that is normally provided by variant annotation tools, such as the [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) (Annotation tools use ‘Strand’ values of `1` and `-1` to indicate that the corresponding gene is located on the forward or reverse strand, respectively. Since variants are always called on the forward strand, this value informs about the transcriptional strand of variants.)

Importantly, since a variant can only have a single sample ID, variants which are found in __more than one sample__ need to be included multiple times in the table, using different sample IDs. The order in which the samples are found in this table is the order in which they will be displayed thereafter. In this case, the samples are already sorted alphabetically:

```{r show_samples}
unique(variants_21breast[, 1])
```

The first step is to transform these variants into __mutational catalogues__, which is done by the `build_catalogues` function. (You can skip this step if you already have mutational catalogues for your samples.)

```{r build_catalogues}
counts_21breast <- build_catalogues(variants_21breast)
dim(counts_21breast)
counts_21breast[1:5, 1:8]
```

The mutational catalogues are stored as a matrix of __mutation counts__, where each row refers to a sample and each column corresponds to a trinucleotide mutation type. (This particular set of 21 catalogues is also available via `data("counts_21breast")`.)

Note that the `build_catalogues` function only admits mutational catalogues defined over the typical 96 trinucleotide mutation types (or 192 types if using transcriptional strand information). Mutational catalogues defined over other sets of mutation types need to be generated by the user.

We can plot the __spectrum__ of each mutational catalogue using `plot_spectrum`, as in the previous example. For tables containing multiple catalogues, this function will produce __one plot per catalogue__, making the use of an output PDF file (`pdf_path` argument) more convenient. Here, however, we will plot all the catalogues on a grid.

```{r plot_spectra, fig.width=23, fig.height=22, out.width='100%', fig.align="center", echo=-1}
par(mar = c(5,6,7,2))
par(mfrow = c(7, 3))
plot_spectrum(counts_21breast)
```

### Extracting mutational signatures _de novo_

To __extract signatures__ from this set of catalogues, we will use the `extract_signatures` function, specifying the number of signatures to extract via the `nsignatures` argument; this can be a single integer value or a range of integers (e.g. `3:6`). The recommended approach is to first run the function for a small number of iterations and a reasonably wide range of numbers of signatures (e.g. `nsignatures = 2:8`). When `nsignatures` is a range of values, sigfit will automatically determine the __most plausible number of signatures__ present in the data (which is done by assessing goodness-of-fit through the `calculate_gof` and `plot_gof` functions).

Note that, for the correct determination of the number of signatures, `nsignatures` should include at least __four values__, and should extend beyond the ‘reasonable’ range for the expected number of signatures (e.g. if the number of signatures is expected to be between 3 and 8, `nsignatures` could take values `2:10`). Also, note that for ranges of `nsignatures` the output will be a __list__, in which element `[[N]]` corresponds to the extraction results for `nsignatures = N`.

> __The number of Markov chains in `extract_signatures` has a default value of `chains = 1`. We do not recommend the use of multiple chains for signature extraction, as this can result in a problem known as ‘label switching’ which can invalidate the inferences.__

With the exception of `chains`, the `extract_signatures` function recognises the same MCMC sampling options as `fit_signatures`  (`iter`, `warmup`, `seed`, etc.; see _Example 1_ above).

Extraction of signatures _de novo_ from the 21 mutational catalogues is done for a range of 2–7 signatures as follows.

```{r extraction, eval=FALSE}
mcmc_samples_extr <- extract_signatures(counts = counts_21breast,
                                        nsignatures = 2:7,
                                        iter = 1000, 
                                        seed = 1756)
plot_gof(mcmc_samples_extr)
```

```{r plot_gof_silent, echo=FALSE, fig.width=9, fig.height=6, out.width="100%"}
## Plot precalculated GOF in order to avoid running the model
data("sigfit_vignette_data", package = "sigfit")
plot(nS, gof, type = "o", lty = 3, pch = 16, col = "dodgerblue4",
     main = paste0("Goodness-of-fit (", stat, ")\nmodel: multinomial"),
     xlab = "Number of signatures", 
     ylab = paste0("Goodness-of-fit (", stat, ")"))
points(nS[best], gof[best], pch = 16, col = "orangered", cex = 1.1)
cat("Estimated best number of signatures:", nS[best], "\n")
```

The plot above shows that the most plausible number of signatures is four, based on the evolution of the goodness-of-fit (which by default is reconstruction accuracy measured via cosine similarity).

Next, it is recommended to __re-run__ the `extract_signatures` function, this time with `nsignatures = 4` and a much larger number of iterations, to obtain __more-accurate estimates__. We will skip this step in the present example.

For details about further arguments of `extract_signatures`, including alternative signature models, mutational opportunities and signature/exposure priors, see the _Other functionalities_ section at the end of this vignette, or type `?extract_signatures` to read the documentation.

### Retrieving, plotting and re-fitting extracted signatures

As in the case of signature fitting (see above), the extracted signatures and exposures can be __retrieved__ using the `retrieve_pars` function with `par = "signatures"` or `par = "exposures"`. However, because the output is a list containing the results for each value in `nsignatures`, __we need to specify which number of signatures we are interested in, using the `[[ ]]` operator__. This is not necessary if a single value of `nsignatures` was used for extraction.

```{r extr_names, eval=FALSE}
names(mcmc_samples_extr)
```
```{r extr_names_silent, echo=FALSE}
print(c("nsignatures=1", "nsignatures=2", "nsignatures=3", "nsignatures=4", "nsignatures=5", "nsignatures=6", "nsignatures=7", "best"))
signatures <- extr_signatures
```
```{r retrieve_sigs, eval=FALSE}
## Note: mcmc_samples_extr[[N]] contains the extraction results for N signatures
signatures <- retrieve_pars(mcmc_samples_extr[[4]],
                            par = "signatures")
```
```{r show_signames}
rownames(signatures$mean)
```

__Plotting__ can be done through the functions seen in _Example 1_, with the difference that, when plotting directly from MCMC results (via the `mcmc_samples` argument), we need to select the relevant element of the results list using `mcmc_samples_extr[[N]]`, as seen above. This is not necessary if a single value of `nsignatures` was used for extraction.

Below we plot the signatures extracted from these 21 catalogues.

```{r plot_sigs, warning=FALSE, fig.width=25, fig.height=10, out.width='100%', fig.align="center", echo=-1}
par(mar = c(6,7,6,1))
par(mfrow = c(2, 2))
plot_spectrum(signatures)
```

The extracted signatures seem to be a combination of COSMIC signatures 1, 2, 3, 5 and 13. The signatures published in [COSMIC](http://cancer.sanger.ac.uk/cosmic/signatures) were obtained using a collection of hundreds of catalogues across many cancer types, which offered much higher statistical power than the 21 breast cancer catalogues employed here. Furthermore, the signatures obtained by sigfit show high similarity to those originally reported by [Nik-Zainal _et al._ (2012)](http://dx.doi.org/10.1016/j.cell.2012.04.024) (Fig. 2A). Note that signatures C and D in Nik-Zainal _et al._, which are very similar, have been identified by sigfit as a single signature (Signature B in the plot above).

The extracted signatures can be __matched__ against a set of known signatures, such as the COSMIC signatures, to identify the best match for each signature, using the `match_signatures` function. This returns, for each signature in the first set, the index of the __best match__ in the second signature set.

```{r match_sigs}
data("cosmic_signatures_v2")
match_signatures(signatures, cosmic_signatures_v2)
```

In this case, the closest matches of the extracted signatures A, B, C and D are, respectively, COSMIC signatures 2, 3, 13 and 1.

As a last remark, once that more-accurate signatures have been extracted by re-running `extract_signatures` for a single value of `nsignatures` (as recommended above), it might be useful to __re-fit__ these signatures back to the original catalogues, as this sometimes results in exposure estimates that are __more accurate__ than the ones obtained by signature extraction. Assuming that the MCMC samples from the definitive signature extraction are stored in an object named `mcmc_samples_extr_2`, re-fitting is done as follows.

```{r refitting2, eval=FALSE}
signatures <- retrieve_pars(mcmc_samples_extr_2, "signatures")
mcmc_samples_refit <- fit_signatures(counts = counts_21breast,
                                     signatures = signatures,
                                     iter = 2000,
                                     warmup = 1000)
exposures <- retrieve_pars(mcmc_samples_refit, "exposures")
```

## Other functionalities

### Alternative signature models

By default, the functions `fit_signatures` and `extract_signatures` make use of a __multinomial model__ of signatures, which is statistically equivalent to the non-negative matrix factorisation (NMF) approach pioneered by [Alexandrov _et al._ (2013)](https://www.nature.com/articles/nature12477). However, sigfit provides three additional signature models which can be selected using the `model` argument.

* The __Poisson model__ (`model = "poisson"`) is equivalent to the probabilistic model introduced by [Fischer _et al._ (2013)](https://doi.org/10.1186/gb-2013-14-4-r39), and its utility is comparable to that of the multinomial model.

* The __negative binomial model__ (`model = "negbin"`) is a more noise-robust version of the Poisson model, which makes use of the overdispersed negative binomial distribution to model additional variation in the observed mutation counts. Although this model provides increased precision for __fitting__ signatures to __sparse or noisy__ mutational catalogues, its use is not recommended for cleaner catalogues or for signature extraction, due to its low efficiency.

* The __normal model__ (`model = "normal"`) makes use of a normal distribution in order to model catalogues that are composed of __continuous numeric values__, instead of discrete count values. This allows the analysis of patterns in catalogues describing continuous variables, such as DNA methylation levels. This model is intended for cases where it is not possible to __round__ the values in the catalogues without losing information (e.g. if the values are real numbers in [0, 1]). If the values are large enough that the information loss caused by rounding would be negligible, then one of the other models should be specified, in which case the catalogues will be automatically rounded (with a warning message).

For further details on these signature models, see the sigfit [paper](https://doi.org/10.1101/372896) or type `?extract_signatures` to read the documentation.

### Alternative mutation type definitions

sigfit allows the analysis of mutational patterns beyond the typical trinucleotide-context single-base substitution (SBS) spectrum. Additional supported mutation types include SBS with transcriptional strand information (strand-wise spectrum), small insertions and deletions (indel spectrum), doublet base substitutions (DBS spectrum) and arbitrary mutation types defined by the user (generic spectrum).

As explained in _Example 2_ (see above), the inclusion of transcriptional strand information for each variant results in __transcriptional strand-wise__ representations of mutational catalogues and signatures. These also have their own graphical representation via strand-wise mutational spectra, as shown in the example below.

```{r strandwise, fig.width=22, fig.height=7, out.width="100%", echo=-1}
par(mar = c(4.5,5.5,6.5,1))
# Load and plot a strand-wise catalogue
data("counts_88liver_strand")
plot_spectrum(counts_88liver_strand[1, ],
              name = rownames(counts_88liver_strand)[1])
```

sigfit can also process and plot mutational catalogues and signatures for __indels__ and __doublet base substitutions__, defined over the same sets of mutation categories as the COSMIC ID and DBS signatures (v3.2). However, note that the function `build_catalogues` is __not__ currently capable of producing mutational catalogues for these mutation types. Catalogues of indels and DBS can be produced using the [__SigProfilerMatrixGeneratorR__](https://github.com/AlexandrovLab/SigProfilerMatrixGeneratorR) package (for human, mouse, rat and dog genomes). Indel catalogues can also be produced via the `indel.spectrum` function in the [__Indelwald__](https://github.com/MaximilianStammnitz/Indelwald/blob/master/Scripts/3_Indelwald_spectrum.R) tool (for any reference genome).

```{r indel, fig.width=22, fig.height=7, out.width="100%", echo=-1}
par(mar = c(4.5,5.5,6.5,1))
# Load and plot an indel signature
data("cosmic_signatures_indel_v3.2")
plot_spectrum(cosmic_signatures_indel_v3.2[4, ],
              name = rownames(cosmic_signatures_indel_v3.2)[4])
```

```{r dbs, fig.width=22, fig.height=7, out.width="100%", echo=-1}
par(mar = c(4.5,5.5,6.5,1))
# Load and plot a doublet signature
data("cosmic_signatures_doublet_v3.2")
plot_spectrum(cosmic_signatures_doublet_v3.2[9, ],
              name = rownames(cosmic_signatures_doublet_v3.2)[9])
```

Furthermore, mutational catalogues and signatures can be defined using any set of __arbitrary mutation categories__. In this case, a ‘generic’ mutational spectrum is produced by sigfit, as shown in the example below. The colours of the spectrum bars can be customised using the `colors` argument in `plot_spectrum`.

```{r generic, fig.width=22, fig.height=8, out.width="100%", echo=-1}
par(mar = c(6.5,5.5,6.5,1)); set.seed(0xC0FFEE)
# Create and plot an arbitrary catalogue
counts <- as.numeric(rmultinom(1, 5000, runif(100)))
plot_spectrum(counts, name = "Arbitrary catalogue")
```

### Using mutational opportunities and signature/exposure priors

The signature models in sigfit are able to account for variation in __mutational opportunities__, which are the opportunities for each mutation type to occur in each sample. For catalogues composed of 96 trinucleotide mutation types, these opportunities are given by the frequency of each trinucleotide in the target sequence (usually a genome or exome). Mutational opportunities can be specified using the `opportunities` argument in the functions `fit_signatures` and `extract_signatures`. Opportunities can be provided as a numeric matrix with one row per sample and one column per mutation type. Alternatively, the mutational opportunities of the reference human genome or exome (for SNV catalogues with 96 or 192 mutation types) can be selected using the options `opportunities = "human-genome"` or `opportunities = "human-exome"`.

Importantly, using mutational opportunities during signature extraction will result in signatures which are already __normalised__ by the given opportunities (that is, their mutation probabilities are not relative to the composition of the target sequence). In contrast, signatures extracted without using opportunities are __relative__ to the sequence composition; this is not a problem unless the signatures are to be subsequently applied to other types of sequences. (See below for details on how to convert signatures between different sets of opportunities.)

In addition, `fit_signatures` and `extract_signatures` admit the specification of __priors for signatures and exposures__ via the arguments `sig_prior` and `exp_prior`. These priors allow the incorporation of _a priori_ knowledge about these parameters; for instance, using certain signatures from COSMIC as our signature priors reflects our belief that the signatures found in the data should be very similar to these. Given their potential for introducing subjective biases into the analysis, the use of priors should be strictly limited to situations where there is both __strong evidence__ and __good reason__ to use them. Priors must be numeric matrices with the same dimensions as the signature and exposure matrices to be inferred: signature priors must have one row per signature and one column per mutation type, while exposure priors must have one row per sample and one column per signature.

### Converting between genome- and exome-relative representations of signatures

In some analyses, signatures that have been inferred from whole-genome mutation data need to be fitted to mutation data from whole-exome sequencing samples. For instance, if the cohort of samples includes both whole genomes and exomes, it is normally advisable to extract signatures from the whole-genome samples (provided that there are enough of them) and re-fit the resulting signatures to the whole-exome samples. (An alternative approach would be to provide a matrix of sample-specific mutational opportunities that reflects the origin of each catalogue, as explained in the previous section.)

However, it is important to note that, unless appropriate mutational opportunities were used during signature extraction, the probability values of signatures that have been extracted from whole genomes (including the standard COSMIC signatures) are __relative__ to the specific mutational opportunities of that genome, and therefore should not be directly fitted to mutation counts obtained from exomes or from other genomes (such as genomes from different species). Instead, the signatures first need to be __converted__ so that they are relative to the new sequence.

The `convert_signatures` function can be used to convert a matrix of (human) genome-derived mutational signatures into their exome-relative representations as follows.

```{r convert_sigs, eval=F}
# (Following from the signature extraction example presented above)
# Retrieve genome-derived signatures
genome_signatures <- retrieve_pars(mcmc_samples_extr[[4]],
                                   par = "signatures")

# Apply exome mutational opportunities
exome_signatures <- convert_signatures(genome_signatures, 
                                       opportunities_from = "human-genome",
                                       opportunities_to = "human-exome")

par(mfrow = c(2, 1))
plot_spectrum(genome_signatures$mean[4, ],
              name = "Signature D, Genome-relative probabilities")
plot_spectrum(exome_signatures[4, ],
              name = "Signature D, Exome-relative probabilities")
```

```{r convert_sigs_silent, echo=F, fig.width=22, fig.height=14, out.width="100%"}
exome_signatures <- convert_signatures(signatures, opportunities_from = "human-genome", opportunities_to = "human-exome")
par(mfrow = c(2, 1), mar = c(5,5.5,6.5,1))
plot_spectrum(signatures$mean[4,], name="Signature D, Genome-relative probabilities")
plot_spectrum(exome_signatures[4,], name="Signature D, Exome-relative probabilities")
```

Signatures can be similarly converted from exome-relative to genome-relative representations, or between any two given __vectors of opportunities__ (with one element per mutation type). In addition, if the `opportunities_to` argument is omitted, the signatures will only be __normalised__ by the original opportunities (given by `opportunities_from`), such that they will not be relative to any particular reference sequence. (Another way of obtaining sequence-independent signatures is to specify the mutational opportunities when running `extract_signatures`.)

Conversely, if the `opportunities_from` argument is omitted, the signatures will be interpreted as being already normalised, and the desired opportunities (given by `opportunities_to`) will be directly __imposed__ on the signatures. For further details on signature conversion, type `?convert_signatures` to read the documentation.

### Using Fit-Ext models to discover rare or novel signatures

One novelty in sigfit is the introduction of __Fit-Ext models__, which are able to extract mutational signatures _de novo_ while fitting a set of predefined signatures that are already known to be present in the data. Such models are useful for the discovery of rare signatures for which there is some qualitative evidence, but insufficient support as to enable deconvolution via conventional methods, or in cases where only signature fitting is possible, yet the data clearly display a mutational pattern which is not captured by any of the available signatures.

The Fit-Ext models can be accessed via the `fit_extract_signatures` function. This is used similarly to `extract_signatures`, with the exception that a matrix of __known signatures__ to be fitted needs to be provided via the `signatures` argument (as in `fit_signatures`), and that the __number of additional signatures__ to extract is specified using the `num_extra_sigs` argument. Unlike the `nsignatures` argument in `extract_signatures`, `num_extra_sigs` currently admits only single integer values (e.g. `num_extra_sigs = 2`), and not ranges.

For further details on the Fit-Ext models, see the sigfit [paper](https://doi.org/10.1101/372896) or type `?fit_extract_signatures` to read the documentation.

### Signature extraction using multiple chains

__NOTE: This is an experimental feature and may not work properly.__

As explained above, the use of multiple Markov chains for signature extraction can give rise to a problem known as ‘label switching’ which can invalidate the inferences (see _Example 2_). In many cases, this problem can be mitigated by initialising the sampler with parameter values obtained from a preliminary sampling run. To do this, we can run the `extract_signatures` functions on a single chain for a short number of iterations, and then use the `get_initializer_list` function to pass initial parameter values obtained from this short run to a longer, multi-chain extraction run, via the `init` argument. An example is given below.

```{r multichain, eval=F}
# Load mutation counts
data("counts_21breast")

# Set number of cores for multi-chain sampling
ncores <- parallel::detectCores()
options(mc.cores = ncores)

# Run a short single chain to find initial parameter values
ext_init <- extract_signatures(counts_21breast,
                               nsignatures = 4,
                               iter = 1000,
                               chains = 1)

# Obtain a list of initial parameter values
inits <- get_initializer_list(ext_init, chains = ncores)

# Run multiple chains in parallel, initialised at the values found in the short run
# (make sure to use the same number of chains as in `get_initializer_list`)
ext_multichain <- extract_signatures(counts_21breast,
                                     nsignatures = 4,
                                     iter = 5000,
                                     chains = ncores,
                                     init = inits)
```

### Using the Dirichlet process prior to determine the number of signatures

__NOTE: This is an experimental feature and may not work properly.__

The Dirichlet process prior (DPP) allows the automatic determination of the number of signatures present in the data, without the need to perform signature extraction for different numbers of signatures. We recommend using the standard goodness-of-fit procedure for the determination of the number of signatures, as shown in _Example 2_ (above).

The DPP can be activated in the functions `extract_signatures` and `fit_extract_signatures` using `dpp = TRUE`. The number of signatures to extract should be larger than the maximum number of signatures expected (e.g. `nsignatures = 20`). Those signatures whose presence is not supported by the data should collapse into random noise, such that the number of signatures is given by the signatures which did not collapse. The `dpp_conc` argument allows to modify the value of the DPP concentration parameter.

### List of available data sets

sigfit includes the following data sets, which can be loaded using the `data` function.

* Three collections of __COSMIC mutational signatures__ for __single base substitutions__: original set of 30 signatures (v2, 2015) and updated sets of 67 signatures (v3.0, 2019) and 78 signatures (v3.2, 2021). The v3.0 signature set is also available in transcriptional strand-wise version (192 mutation types). These data can be loaded using:

```{r}
    data("cosmic_signatures_v3.2")
    data("cosmic_signatures_v3")
    data("cosmic_signatures_strand_v3")
    data("cosmic_signatures_v2")
```

* Collections of __COSMIC mutational signatures__ for __indels__ and __doublet base substitutions__: sets of 18 ID signatures and 11 DBS signatures (v3.2, 2021). These data can be loaded using:

```{r}
    data("cosmic_signatures_indel_v3.2")
    data("cosmic_signatures_doublet_v3.2")
```

* Mutation table and mutational catalogues (96 mutation types) for the set of __21 breast cancer genomes__ used in [Nik-Zainal _et al._ (2012)](http://dx.doi.org/10.1016/j.cell.2012.04.024). These data can be loaded using:

```{r}
    data("variants_21breast")
    data("counts_21breast")
```

* Mutation tables and mutational catalogues (96 and 192 mutation types) for the set of __119 breast cancer genomes__ used in [Alexandrov _et al._ (2013)](https://doi.org/10.1038/nature12477). These data can be loaded using:

```{r}
    data("variants_119breast")
    data("variants_119breast_strand")
    data("counts_119breast")
    data("counts_119breast_strand")
```

* Mutation tables and mutational catalogues (96 and 192 mutation types) for the set of __88 liver cancer genomes__ used in [Alexandrov _et al._ (2013)](https://doi.org/10.1038/nature12477). These data can be loaded using:

```{r}
    data("variants_88liver")
    data("variants_88liver_strand")
    data("counts_88liver")
    data("counts_88liver_strand")
```

* Catalogues composed of __methylation beta values__ for 50 breast cancers and 27 normal solid tissue samples, obtained from the [TCGA repository](https://portal.gdc.cancer.gov/repository) (project BRCA). These catalogues contain 99 'mutation types', corresponding to beta values for 99 CpG sites in the promoters of genes associated with breast cancer (see the sigfit [paper](https://doi.org/10.1101/372896) for details). These data can be loaded using:

```{r}
    data("methylation_50breast")
    data("methylation_27normal")
```

___

sigfit is an R package developed by the [Transmissible Cancer Group](http://www.tcg.vet.cam.ac.uk/) in the University of Cambridge Department of Veterinary Medicine.