{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: vcf_to_hdf5\n", "\n", "[View as notebook](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/newdocs/API-analysis/cookbook-vcf2hdf5.ipynb)\n", "\n", "Many genome assembly tools will write variant SNP calls to the VCF format (variant call format). This is a plain text file that stores variant calls relative to a reference genome in tabular format. It includes a lot of additional information about the quality of SNP calls, etc., but is not very easy to read or efficient to parse. To make analyses run a bit faster ipyrad uses a simplified format to store this information in the form of an HDF5 database. You can easily convert any VCF file to this HDF5 format using the `ipa.vcf_to_hdf5()` tool. \n", "\n", "This tool includes an added benefit of allowing you to enter an (optional) `ld_block_size` argument when creating the file which will store information that can be used downstream by many other tools to subsample SNPs and perform bootstrap resampling in a way that reduces the effects of linkage among SNPs. If your data are assembled RAD data then the ld_block_size is not required, since we can simply use RAD loci as the linkage blocks. But if you want to combine reference-mapped RAD loci located nearby in the genome as being on the same linkage block then you can enter a value such as 50,000 to create 50Kb linkage block that will join many RAD loci together and sample only 1 SNP per block in each bootstrap replicate. If your data are not RAD data, e.g., whole genome data, then the ld_block_size argument will be required in order to encode linkage information as discrete blocks into your database. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software\n", "If you are converting a VCF file assembled from some other tool (e.g., GATK, freebayes, etc.) then you will need to install the `htslib` and `bcftools` software and use them as described below. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda \n", "# conda install htslib -c bioconda\n", "# conda install bcftools -c bioconda" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-filter data from other programs (e.g., FreeBayes, GATK)\n", "\n", "You can use the program `bcftools` to pre-filter your data to exclude indels and low quality SNPs. If you ran the `conda install` commands above then you will have all of the required tools installed. To achieve the format that ipyrad expects you will need to exclude indel containing SNPs (this may change in the future). Further quality filtering is optional. \n", "\n", "The example below reduced the size of a VCF data file from 29Gb to 80Mb! VCF contains a lot of information that you do not need to retain through all of your analyses. We will keep only the final genotype calls. \n", "\n", "Note that the code below is bash script. You can run this from a terminal, or in a jupyter notebook by appending the (%%bash) header like below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# compress the VCF file if not already done (creates .vcf.gz)\n", "bgzip data.vcf\n", "\n", "# tabix index the compressed VCF (creates .vcf.gz.tbi)\n", "tabix data.vcf.gz\n", "\n", "# remove multi-allelic SNPs and INDELs and PIPE to next command\n", "bcftools view -m2 -M2 -i'CIGAR=\"1X\" & QUAL>30' data.vcf.gz -Ou | \n", "\n", " # remove extra annotations/formatting info and save to new .vcf\n", " bcftools annotate -x FORMAT,INFO > data.cleaned.vcf\n", " \n", "# recompress the final file (create .vcf.gz)\n", "bgzip data.cleaned.vcf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A peek at the cleaned VCF file" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | NC_018152.2 | \n", "51273 | \n", ". | \n", "G | \n", "A | \n", "280.482 | \n", "..1 | \n", "..2 | \n", "GT | \n", "0/0 | \n", "... | \n", "0/0.9 | \n", "0/0.10 | \n", "0/0.11 | \n", "0/0.12 | \n", "0/0.13 | \n", "0/0.14 | \n", "0/0.15 | \n", "0/0.16 | \n", "0/0.17 | \n", "0/1.1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "NC_018152.2 | \n", "51292 | \n", ". | \n", "A | \n", "G | \n", "16750.300 | \n", ". | \n", ". | \n", "GT | \n", "1/1 | \n", "... | \n", "1/1 | \n", ". | \n", "1/1 | \n", "1/1 | \n", "1/1 | \n", "1/1 | \n", "0/0 | \n", "1/1 | \n", "1/1 | \n", "1/1 | \n", "
1 | \n", "NC_018152.2 | \n", "51349 | \n", ". | \n", "A | \n", "G | \n", "628.563 | \n", ". | \n", ". | \n", "GT | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "
2 | \n", "NC_018152.2 | \n", "51351 | \n", ". | \n", "C | \n", "T | \n", "943.353 | \n", ". | \n", ". | \n", "GT | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "
3 | \n", "NC_018152.2 | \n", "51352 | \n", ". | \n", "G | \n", "A | \n", "607.681 | \n", ". | \n", ". | \n", "GT | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "
4 | \n", "NC_018152.2 | \n", "51398 | \n", ". | \n", "C | \n", "T | \n", "510.120 | \n", ". | \n", ". | \n", "GT | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "0/0 | \n", "
5 rows × 29 columns
\n", "