--- title: "SNPmatch" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Motivation Given a SNP profile for a sample of unknown identity, what is the most likely identity based on its match with known samples that have been genotyped? [SNPmatch](https://github.com/Gregor-Mendel-Institute/SNPmatch) can match a SNP profile of a sample to a database of samples and calculates probabilities and likelihoods of the match/es. ## Background SNPmatch takes as input a SNP profile of an unknown sample and performs the [following calculations](https://www.ncbi.nlm.nih.gov/pubmed/29257129) to identify the most likely match from a database. 1. For each database strain "a", SNPmatch calculates the probability ($p_a$) of a sample matching to it. This is calculated from the genotype probability scores (PL) from GATK if present. $$ p_a = \sum_{j} p(g_j | a_j == g_j) $$ where, $j$ is the genomic position of a marker, $a_j$ is the genotype of database strain "a" at position $j$ and $g_j$ is the genotype of a sample at position $j$. 2. Likelihoods are calculated for each strain based on the binomial distribution. $$ \mathcal{L}_{a} = (p_a \times n_a \times ln(p_a / (1 - p))) + ((1 - p_a) \times n_a \times ln((1 - p_a)/p)) $$ where, $n_a$ is the number of informative sites between strain "a" and the sample, $p$ is the error rate (0.001). 3. Likelihood ratios (LR) are calculated for each strain with the top likely strain. $$ LR_{a} = \mathcal{L}_a / max(\{\mathcal{L}_i : i = 1, \dots, n\}) $$ under the assumption that LR is a chi-squared distribution, a threshold of 3.841 gives a list of strains that are not significantly different from the top hit (at a 95% confidence level). ## Installation First install `SNPmatch` if you haven't already. I will use [Docker](https://github.com/davetang/learning_docker) and [docker-miniconda](https://hub.docker.com/r/continuumio/miniconda/) to install `SNPmatch`. ```{bash eval=FALSE} docker pull continuumio/miniconda docker run -it -v /Users/dtang/github/learning_vcf_file:/learning_vcf_file continuumio/miniconda /bin/bash apt-get install build-essential conda install -c bioconda bcftools ``` Use `pip` to install `SNPmatch`. ```{bash eval=FALSE} pip install SNPmatch ``` Or use the `Dockerfile` in `src`. ```{bash eval=FALSE} # once inside src docker build -t snpmatch . docker run -it -v /Users/dtang/github/learning_vcf_file:/learning_vcf_file snpmatch /bin/bash ``` ## Testing To get started, we will create a random reference. ```{bash} ../script/generate_random_seq.pl 100000 1984 > raw/ref.fa ``` Next we will create a VCF file with 100 random SNVs. ```{bash} ../script/create_snv.pl raw/ref.fa 2000 1984 > raw/snv.tsv ../script/create_vcf.pl raw/ref.fa raw/snv.tsv > raw/snv.vcf rm raw/snv.tsv ``` Now we add 10 samples with random genotypes to the VCF file. ```{bash} ../script/vcf_add_sample.pl raw/snv.vcf 10 1984 > raw/sample.vcf rm raw/snv.vcf ``` Finally, we create a SNPmatch database. ```{bash eval=FALSE} snpmatch makedb -i raw/sample.vcf -o db ``` Now we're ready to do some testing! ### Example 1 SNPmatch can take as input either a VCF or BED file. However their "BED" specification is not the same as the [UCSC Genome Browser standard](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). Their format is a three column tab-delimited file, where the columns are for the chromosome, position, and genotype for each SNV. ``` 1 125 0/0 1 284 0/0 1 336 0/0 ``` In our first example we will create a sample with an identical SNV profile as sample 1. (I have to separate the examples into their own directories because `snpmatch` will complain if it sees another `*.npz` file in the same directory.) ```{bash eval=FALSE} rm -rf example/eg1 mkdir example/eg1 cat raw/sample.vcf | grep -v "^#" | cut -f1,2,10 > example/eg1/eg1.bed ``` We will use `snpmatch` to perform our query to the SNV database. ```{bash eval=FALSE} snpmatch inbred -v -i example/eg1/eg1.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg1/eg1 ``` Examine the `scores` output. ```{bash} cat example/eg1/eg1.scores.txt ``` The columns for the `scores` file are: 1. Sample ID 2. Number of matched SNPs 3. Total informative SNPs 4. Probability of match 5. Likelihood 6. Likelihood ratio against best hit 7. Number of SNPs 8. Average depth of SNPs We can see that our sample's SNVs matched all of sample 1's SNVs: 2000 / 2000 giving a probability and likelihood of 1. The script that created the sample genotypes was based on a "coin flip", hence the probability of two samples matching is around 50%. ### Example 2 We'll use half the total number of SNVs this time but again with an identical profile to sample 1. ```{bash eval=FALSE} rm -rf example/eg2 mkdir example/eg2 cat raw/sample.vcf | grep -v "^#" | cut -f1,2,10 | perl -nle "if ($. % 2 == 0){ print }" > example/eg2/eg2.bed ``` Run `snpmatch`. ```{bash eval=FALSE} snpmatch inbred -v -i example/eg2/eg2.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg2/eg2 ``` Let's check out the results. ```{bash} cat example/eg2/eg2.scores.txt ``` ### Example 3 However, if I don't pipe STDOUT to `perl`, the results are as expected. ```{bash eval=FALSE} rm -rf example/eg3 mkdir example/eg3 cat raw/sample.vcf | grep -v "^#" | head -1000 | cut -f1,2,10 > example/eg3/eg3.bed ``` Run `snpmatch`. ```{bash eval=FALSE} snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3 ``` As expected. ```{bash} cat example/eg3/eg3.scores.txt ``` ### Example 4 VCF input. ```{bash eval=FALSE} rm -rf example/eg4 mkdir example/eg4 cat raw/sample.vcf | grep "^#" > header cat raw/sample.vcf | grep -v "^#" | head -50 | cut -f 1-10 > tmp cat header tmp > example/eg4/eg4.vcf rm header tmp ``` Run `snpmatch`. ```{bash eval=FALSE} snpmatch inbred -v -i example/eg4/eg4.vcf -d db.hdf5 -e db.acc.hdf5 -o example/eg4/eg4 ``` As expected. ```{bash} cat example/eg4/eg4.scores.txt ``` ## Notes Make sure there are [no duplicate entries](https://github.com/Gregor-Mendel-Institute/SNPmatch/issues/13) in your input file or you will get unexpected results.