{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: vcf_to_hdf5\n", "\n", "[View as notebook](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/newdocs/API-analysis/cookbook-vcf2hdf5.ipynb)\n", "\n", "Many genome assembly tools will write variant SNP calls to the VCF format (variant call format). This is a plain text file that stores variant calls relative to a reference genome in tabular format. It includes a lot of additional information about the quality of SNP calls, etc., but is not very easy to read or efficient to parse. To make analyses run a bit faster ipyrad uses a simplified format to store this information in the form of an HDF5 database. You can easily convert any VCF file to this HDF5 format using the `ipa.vcf_to_hdf5()` tool. \n", "\n", "This tool includes an added benefit of allowing you to enter an (optional) `ld_block_size` argument when creating the file which will store information that can be used downstream by many other tools to subsample SNPs and perform bootstrap resampling in a way that reduces the effects of linkage among SNPs. If your data are assembled RAD data then the ld_block_size is not required, since we can simply use RAD loci as the linkage blocks. But if you want to combine reference-mapped RAD loci located nearby in the genome as being on the same linkage block then you can enter a value such as 50,000 to create 50Kb linkage block that will join many RAD loci together and sample only 1 SNP per block in each bootstrap replicate. If your data are not RAD data, e.g., whole genome data, then the ld_block_size argument will be required in order to encode linkage information as discrete blocks into your database. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software\n", "If you are converting a VCF file assembled from some other tool (e.g., GATK, freebayes, etc.) then you will need to install the `htslib` and `bcftools` software and use them as described below. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda \n", "# conda install htslib -c bioconda\n", "# conda install bcftools -c bioconda" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-filter data from other programs (e.g., FreeBayes, GATK)\n", "\n", "You can use the program `bcftools` to pre-filter your data to exclude indels and low quality SNPs. If you ran the `conda install` commands above then you will have all of the required tools installed. To achieve the format that ipyrad expects you will need to exclude indel containing SNPs (this may change in the future). Further quality filtering is optional. \n", "\n", "The example below reduced the size of a VCF data file from 29Gb to 80Mb! VCF contains a lot of information that you do not need to retain through all of your analyses. We will keep only the final genotype calls. \n", "\n", "Note that the code below is bash script. You can run this from a terminal, or in a jupyter notebook by appending the (%%bash) header like below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# compress the VCF file if not already done (creates .vcf.gz)\n", "bgzip data.vcf\n", "\n", "# tabix index the compressed VCF (creates .vcf.gz.tbi)\n", "tabix data.vcf.gz\n", "\n", "# remove multi-allelic SNPs and INDELs and PIPE to next command\n", "bcftools view -m2 -M2 -i'CIGAR=\"1X\" & QUAL>30' data.vcf.gz -Ou | \n", "\n", " # remove extra annotations/formatting info and save to new .vcf\n", " bcftools annotate -x FORMAT,INFO > data.cleaned.vcf\n", " \n", "# recompress the final file (create .vcf.gz)\n", "bgzip data.cleaned.vcf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A peek at the cleaned VCF file" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NC_018152.251273.GA280.482..1..2GT0/0...0/0.90/0.100/0.110/0.120/0.130/0.140/0.150/0.160/0.170/1.1
0NC_018152.251292.AG16750.300..GT1/1...1/1.1/11/11/11/10/01/11/11/1
1NC_018152.251349.AG628.563..GT0/0...0/00/00/00/00/00/00/00/00/00/0
2NC_018152.251351.CT943.353..GT0/0...0/00/00/00/00/00/00/00/00/00/0
3NC_018152.251352.GA607.681..GT0/0...0/00/00/00/00/00/00/00/00/00/0
4NC_018152.251398.CT510.120..GT0/0...0/00/00/00/00/00/00/00/00/00/0
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " NC_018152.2 51273 . G A 280.482 ..1 ..2 GT 0/0 ... 0/0.9 0/0.10 \\\n", "0 NC_018152.2 51292 . A G 16750.300 . . GT 1/1 ... 1/1 . \n", "1 NC_018152.2 51349 . A G 628.563 . . GT 0/0 ... 0/0 0/0 \n", "2 NC_018152.2 51351 . C T 943.353 . . GT 0/0 ... 0/0 0/0 \n", "3 NC_018152.2 51352 . G A 607.681 . . GT 0/0 ... 0/0 0/0 \n", "4 NC_018152.2 51398 . C T 510.120 . . GT 0/0 ... 0/0 0/0 \n", "\n", " 0/0.11 0/0.12 0/0.13 0/0.14 0/0.15 0/0.16 0/0.17 0/1.1 \n", "0 1/1 1/1 1/1 1/1 0/0 1/1 1/1 1/1 \n", "1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 \n", "2 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 \n", "3 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 \n", "4 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load the VCF as an datafram\n", "dfchunks = pd.read_csv(\n", " \"/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz\",\n", " sep=\"\\t\", \n", " skiprows=1000, \n", " chunksize=1000,\n", ")\n", "\n", "# show first few rows of first dataframe chunk\n", "next(dfchunks).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting clean VCF to HDF5 \n", "Here I using a VCF file from whole geome data for 20 monkey's from an unpublished study (in progress). It contains >6M SNPs all from chromosome 1. Because many SNPs are close together and thus tightly linked we will likely wish to take linkage into account in our downstream analyses.\n", "\n", "The ipyrad analysis tools can do this by encoding linkage block information into the HDF5 file. Here we encode `ld_block_size` of 20K bp. This breaks the 1 scaffold (chromosome) into about 10K linkage blocks. See the example below of this information being used in an ipyrad PCA analysis. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Indexing VCF to HDF5 database file\n", "VCF: 6094152 SNPs; 1 scaffolds\n", "[####################] 100% 0:02:22 | converting VCF to HDF5 \n", "HDF5: 6094152 SNPs; 10845 linkage group\n", "SNP database written to ./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5\n" ] } ], "source": [ "# init a conversion tool\n", "converter = ipa.vcf_to_hdf5(\n", " name=\"Macaque_LD20K\",\n", " data=\"/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz\",\n", " ld_block_size=20000,\n", ")\n", "\n", "# run the converter\n", "converter.run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downstream analyses\n", "The data file now contains 6M SNPs across 20 samples and N linkage blocks. By default the PCA tool subsamples a single SNP per linkage block. To explore variation over multiple random subsamplings we can use the `nreplicates` argument. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 20\n", "Sites before filtering: 6094152\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 0\n", "Filtered (mincov): 794597\n", "Filtered (minmap): 0\n", "Filtered (combined): 794597\n", "Sites after filtering: 5299555\n", "Sites containing missing values: 0 (0.00%)\n", "Missing values in SNP matrix: 0 (0.00%)\n" ] } ], "source": [ "# init a PCA tool and filter to allow no missing data\n", "pca = ipa.pca(\n", " data=\"./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5\",\n", " mincov=1.0, \n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run a single PCA analysis from subsampled unlinked SNPs" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling SNPs: 10841/5299555\n" ] }, { "data": { "text/html": [ "
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pca.run_and_plot_2D(0, 1, seed=123);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run multiple PCAs over replicates of subsampled SNPs \n", "Here you can see the results for a *different* 10K SNPs that are sampled in each replicate iteration. If the signal in the data is robust then we should expect to see the points clustering at a similar place across replicates. Internally ipyrad will rotate axes to ensure the replicate plots align despite axes swapping (which is arbitrary in PCA space). You can see this provides a better view of uncertainty in our estimates than the plot above (and it looks cool!)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling SNPs: 10841/5299555\n" ] }, { "data": { "text/html": [ "
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pca.run_and_plot_2D(0, 1, seed=123, nreplicates=25);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More details on running PCAs, toggling options, and styling plots can be found in our [ipyrad.analysis PCA tutorial](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-pca.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }