<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <title>Reading VCF data</title> <script src="site_libs/jquery-1.11.3/jquery.min.js"></script> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link href="site_libs/bootstrap-3.3.5/css/sandstone.min.css" rel="stylesheet" /> <script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script> <script src="site_libs/navigation-1.1/tabsets.js"></script> <script src="site_libs/accessible-code-block-0.0.1/empty-anchor.js"></script> <link href="site_libs/anchor-sections-1.0/anchor-sections.css" rel="stylesheet" /> <script src="site_libs/anchor-sections-1.0/anchor-sections.js"></script> <!-- Global Site Tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-107144798-3"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments)}; gtag('js', new Date()); gtag('config', 'UA-107144798-3'); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> code.sourceCode > span { display: inline-block; line-height: 1.25; } code.sourceCode > span { color: inherit; text-decoration: inherit; } code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { background-color: #f8f8f8; } @media screen { code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ef2929; } /* Alert */ code span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #c4a000; } /* Attribute */ code span.bn { color: #0000cf; } /* BaseN */ code span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4e9a06; } /* Char */ code span.cn { color: #000000; } /* Constant */ code span.co { color: #8f5902; font-style: italic; } /* Comment */ code span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */ code span.dt { color: #204a87; } /* DataType */ code span.dv { color: #0000cf; } /* DecVal */ code span.er { color: #a40000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #0000cf; } /* Float */ code span.fu { color: #000000; } /* Function */ code span.im { } /* Import */ code span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #204a87; font-weight: bold; } /* Keyword */ code span.op { color: #ce5c00; font-weight: bold; } /* Operator */ code span.ot { color: #8f5902; } /* Other */ code span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */ code span.sc { color: #000000; } /* SpecialChar */ code span.ss { color: #4e9a06; } /* SpecialString */ code span.st { color: #4e9a06; } /* String */ code span.va { color: #000000; } /* Variable */ code span.vs { color: #4e9a06; } /* VerbatimString */ code span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css"> pre:not([class]) { background-color: white; } </style> <style type="text/css"> h1 { font-size: 34px; } h1.title { font-size: 38px; } h2 { font-size: 30px; } h3 { font-size: 24px; } h4 { font-size: 18px; } h5 { font-size: 16px; } h6 { font-size: 12px; } .table th:not([align]) { text-align: left; } </style> <link rel="stylesheet" href="styles.css" type="text/css" /> <style type = "text/css"> .main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; } .tabbed-pane { padding-top: 12px; } .html-widget { margin-bottom: 20px; } button.code-folding-btn:focus { outline: none; } summary { display: list-item; } </style> <style type="text/css"> /* padding for bootstrap navbar */ body { padding-top: 61px; padding-bottom: 40px; } /* offset scroll position for anchor links (for fixed navbar) */ .section h1 { padding-top: 66px; margin-top: -66px; } .section h2 { padding-top: 66px; margin-top: -66px; } .section h3 { padding-top: 66px; margin-top: -66px; } .section h4 { padding-top: 66px; margin-top: -66px; } .section h5 { padding-top: 66px; margin-top: -66px; } .section h6 { padding-top: 66px; margin-top: -66px; } .dropdown-submenu { position: relative; } .dropdown-submenu>.dropdown-menu { top: 0; left: 100%; margin-top: -6px; margin-left: -1px; border-radius: 0 6px 6px 6px; } .dropdown-submenu:hover>.dropdown-menu { display: block; } .dropdown-submenu>a:after { display: block; content: " "; float: right; width: 0; height: 0; border-color: transparent; border-style: solid; border-width: 5px 0 5px 5px; border-left-color: #cccccc; margin-top: 5px; margin-right: -10px; } .dropdown-submenu:hover>a:after { border-left-color: #ffffff; } .dropdown-submenu.pull-left { float: none; } .dropdown-submenu.pull-left>.dropdown-menu { left: -100%; margin-left: 10px; border-radius: 6px 0 6px 6px; } </style> <script> // manage active state of menu based on current page $(document).ready(function () { // active menu anchor href = window.location.pathname href = href.substr(href.lastIndexOf('/') + 1) if (href === "") href = "index.html"; var menuAnchor = $('a[href="' + href + '"]'); // mark it active menuAnchor.parent().addClass('active'); // if it's got a parent navbar menu mark it active as well menuAnchor.closest('li.dropdown').addClass('active'); }); </script> <!-- tabsets --> <style type="text/css"> .tabset-dropdown > .nav-tabs { display: inline-table; max-height: 500px; min-height: 44px; overflow-y: auto; background: white; border: 1px solid #ddd; border-radius: 4px; } .tabset-dropdown > .nav-tabs > li.active:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before { content: ""; border: none; } .tabset-dropdown > .nav-tabs.nav-tabs-open:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs > li.active { display: block; } .tabset-dropdown > .nav-tabs > li > a, .tabset-dropdown > .nav-tabs > li > a:focus, .tabset-dropdown > .nav-tabs > li > a:hover { border: none; display: inline-block; border-radius: 4px; background-color: transparent; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li { display: block; float: none; } .tabset-dropdown > .nav-tabs > li { display: none; } </style> <!-- code folding --> </head> <body> <div class="container-fluid main-container"> <div class="navbar navbar-default navbar-fixed-top" role="navigation"> <div class="container"> <div class="navbar-header"> <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> <a class="navbar-brand" href="index.html">Population genetics and genomics in R</a> </div> <div id="navbar" class="navbar-collapse collapse"> <ul class="nav navbar-nav"> <li> <a href="TOC.html">Table of contents</a> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Part I <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="Introduction.html">Introduction</a> </li> <li> <a href="Getting_ready_to_use_R.html">Getting ready to use R</a> </li> </ul> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Part II <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="Data_Preparation.html">Data preparation</a> </li> <li> <a href="First_Steps.html">First steps</a> </li> <li> <a href="Population_Strata.html">Population strata and clone correction</a> </li> <li> <a href="Locus_Stats.html">Locus-based statistics and missing data</a> </li> <li> <a href="Genotypic_EvenRichDiv.html">Genotypic evenness, richness, and diversity</a> </li> <li> <a href="Linkage_disequilibrium.html">Linkage disequilibrium</a> </li> <li> <a href="Pop_Structure.html">Population structure</a> </li> <li> <a href="Minimum_Spanning_Networks.html">Minimum Spanning Networks</a> </li> <li> <a href="AMOVA.html">AMOVA</a> </li> <li> <a href="DAPC.html">Discriminant analysis of principal components (DAPC)</a> </li> </ul> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Part III <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="intro_vcf.html">Population genomics and HTS</a> </li> <li> <a href="reading_vcf.html">Reading VCF data</a> </li> <li> <a href="analysis_of_genome.html">Analysis of genomic data</a> </li> <li> <a href="gbs_analysis.html">Analysis of GBS data</a> </li> <li> <a href="clustering_plot.html">Clustering plot</a> </li> </ul> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Workshops <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li class="dropdown-submenu"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">ICPP</a> <ul class="dropdown-menu" role="menu"> <li> <a href="workshop_icpp.html">Preparation</a> </li> <li> <a href="intro_vcf.html">Introduction</a> </li> <li> <a href="reading_vcf.html">VCF data</a> </li> <li> <a href="quality_control.html">Quality control</a> </li> <li> <a href="gbs_analysis.html">Analysis of GBS data</a> </li> <li> <a href="analysis_of_genome.html">Analysis of genome data</a> </li> </ul> </li> <li class="dropdown-submenu"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">APS Southern Division</a> <ul class="dropdown-menu" role="menu"> <li> <a href="workshop_southernAPS.html">Preparation</a> </li> <li> <a href="intro_vcf.html">Introduction</a> </li> <li> <a href="reading_vcf.html">VCF data</a> </li> <li> <a href="quality_control.html">Quality control</a> </li> <li> <a href="gbs_analysis.html">Analysis of GBS data</a> </li> </ul> </li> </ul> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> About <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="Authors.html">Authors</a> </li> </ul> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Appendices <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="intro_to_R.html">Introduction to R</a> </li> <li> <a href="Data_sets.html">Data sets</a> </li> <li> <a href="funpendix.html">Function glossary</a> </li> <li> <a href="background_functions.html">Background_functions</a> </li> <li> <a href="https://github.com/grunwaldlab/Population_Genetics_in_R/">Source Code</a> </li> </ul> </li> </ul> <ul class="nav navbar-nav navbar-right"> </ul> </div><!--/.nav-collapse --> </div><!--/.container --> </div><!--/.navbar --> <div class="fluid-row" id="header"> <h1 class="title toc-ignore">Reading VCF data</h1> <h3 class="subtitle"><em>BJ Knaus, JF Tabima and NJ Grünwald</em></h3> </div> <p>Genetic variation data is typically stored in <a href="http://samtools.github.io/hts-specs/" title="VCF format at hts-specs">variant call format (VCF)</a> files <span class="citation">(Danecek et al., 2011)</span>. This format is the preferred file format obtained from genome sequencing or high throughput genotyping. One advantage of using VCF files is that only variants (e.g., SNPs, indels, etc.) are reported which economizes files size relative to a format that may included invariant sites. Variant callers typically attempt to aggressively call variants with the perspective that a downstream quality control step will remove low quality variants. Note that VCF files come in different flavors and that each variant caller may report a slightly different information. A first step in working with this data is to understand their <a href="http://vcftools.sourceforge.net/VCF-poster.pdf">contents</a>.</p> <div id="vcf-file-structure" class="section level2"> <h2>VCF file structure</h2> <p>A VCF file can be thought of as having three sections: a <strong>vcf header</strong>, a <strong>fix region</strong> and a <strong>gt region</strong>. The VCF meta region is located at the top of the file and contains meta-data describing the body of the file. Each VCF meta line begins with a ‘##’. The information in the meta region defines the abbreviations used elsewhere in the file. It may also document software used to create the file as well as parameters used by this software. Below the metadata region, the data are tabular. The first eight columns of this table contain information about each variant. This data may be common over all variants, such as its chromosomal position, or a summary over all samples, such as quality metrics. These data are fixed, or the same, over all samples. The fix region is required in a VCF file, subsequent columns are optional but are common in our experience. Beginning at column ten is a column for every sample. The values in these columns are information for each sample and each variant. The organization of each cell containing a genotype and associated information is specified in column nine, the FORMAT column. The location of these three regions within a file can be represented by this cartoon:</p> <div class="figure" style="text-align: center"> <img src="reading_vcf_files/figure-html/unnamed-chunk-1-1.png" alt="Cartoon representation of VCF file organization" width="384" /> <p class="caption"> Cartoon representation of VCF file organization </p> </div> <p>The VCF file specification is flexible. This means that there are slots for certain types of data, but any particular software which creates a VCF file does not necessarily use them all. Similarly, authors have the opportunity to include new forms of data, forms which may not have been foreseen by the authors of the VCF specification. The result is that all VCF files do not contain the same information.</p> <p>For this example, we will use example data provided with the R package <em>vcfR</em> <span class="citation">(Knaus & Grünwald, 2017)</span>.</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(vcfR)</span> <span id="cb1-2"><a href="#cb1-2"></a><span class="kw">data</span>(vcfR_example)</span> <span id="cb1-3"><a href="#cb1-3"></a>vcf</span></code></pre></div> <pre class="r-output"><code>## ***** Object of Class vcfR ***** ## 18 samples ## 1 CHROMs ## 2,533 variants ## Object size: 3.2 Mb ## 8.497 percent missing data ## ***** ***** ***** </code></pre> <p>The function <code>library()</code> loads libraries, in this case the package <em>vcfR</em>. The function <code>data()</code> loads datasets that were included with R and its packages. Our usage of <code>data()</code> loads the objects ‘gff’, ‘dna’ and ‘vcf’ from the ‘vcfR_example’ dataset. Here we’re only interested in the object ‘vcf’ which contains example VCF data. When we call the object name with no function it invokes the ‘show’ method which prints some summary information to the console.</p> </div> <div id="the-meta-region" class="section level2"> <h2>The meta region</h2> <p>The meta region contains information about the file, its creation, as well as information to interpret abbreviations used elsewhere in the file. Each line of the meta region begins with a double pound sign (‘##’). The example which comes with <em>vcfR</em> is shown below. (Only the first seven lines are shown for brevity.)</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">strwrap</span>(vcf<span class="op">@</span>meta[<span class="dv">1</span><span class="op">:</span><span class="dv">7</span>])</span></code></pre></div> <pre class="r-output"><code>## [1] "##fileformat=VCFv4.1" ## [2] "##source=\"GATK haplotype Caller, phased with beagle4\"" ## [3] "##FILTER=<ID=LowQual,Description=\"Low quality\">" ## [4] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">" ## [5] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">" ## [6] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">" ## [7] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">" </code></pre> <p>The first line contains the version of the VCF format used in the file. This line is required. The second line specifies the software which created the VCF file. This is not required, so not all VCF files include it. When they do, the file becomes self documenting. Note that the alignment software is not included here because it was used upstream of the VCF file’s creation (aligners typically create *.SAM or *.BAM format files). Because the file can only include information about the software that created it, the entire pipeline does not get documented. Some VCF files may contain a line for every chromosome (or supercontig or contig depending on your genome), so they may become rather long. Here, the remaining lines contain INFO and FORMAT specifications which define abbreviations used in the fix and gt portions of the file.</p> <p>The meta region may include long lines that may not be easy to view. In <em>vcfR</em> we’ve created a function to help press this data.</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">queryMETA</span>(vcf)</span></code></pre></div> <pre class="r-output"><code>## [1] "FILTER=ID=LowQual" "FORMAT=ID=AD" "FORMAT=ID=DP" "FORMAT=ID=GQ" "FORMAT=ID=GT" ## [6] "FORMAT=ID=PL" "GATKCommandLine=ID=HaplotypeCaller" "INFO=ID=AC" "INFO=ID=AF" "INFO=ID=AN" ## [11] "INFO=ID=BaseQRankSum" "INFO=ID=ClippingRankSum" "INFO=ID=DP" "INFO=ID=DS" "INFO=ID=FS" ## [16] "INFO=ID=HaplotypeScore" "INFO=ID=InbreedingCoeff" "INFO=ID=MLEAC" "INFO=ID=MLEAF" "INFO=ID=MQ" ## [21] "INFO=ID=MQ0" "INFO=ID=MQRankSum" "INFO=ID=QD" "INFO=ID=ReadPosRankSum" "INFO=ID=SOR" ## [26] "1 contig=<IDs omitted from queryMETA" </code></pre> <p>When the function <code>queryMETA()</code> is called with only a <em>vcfR</em> object as a parameter, it attempts to summarize the meta information. Not all of the information is returned. For example, ‘contig’ elements are not returned. This is an attempt to summarize information that may be most useful for comprehension of the file’s contents.</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">queryMETA</span>(vcf, <span class="dt">element =</span> <span class="st">'DP'</span>)</span></code></pre></div> <pre class="r-output"><code>## [[1]] ## [1] "FORMAT=ID=DP" "Number=1" ## [3] "Type=Integer" "Description=Approximate read depth (reads with MQ=255 or with bad mates are filtered)" ## ## [[2]] ## [1] "INFO=ID=DP" "Number=1" ## [3] "Type=Integer" "Description=Approximate read depth; some reads may have been filtered" </code></pre> <p>When an element parameter is included, only the information about that element is returned. In this example the element ‘DP’ is returned. We see that this acronym is defined as both a ‘FORMAT’ and ‘INFO’ acronym. We can narrow down our query by including more information in the element parameter.</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="kw">queryMETA</span>(vcf, <span class="dt">element =</span> <span class="st">'FORMAT=<ID=DP'</span>)</span></code></pre></div> <pre class="r-output"><code>## [[1]] ## [1] "FORMAT=ID=DP" "Number=1" ## [3] "Type=Integer" "Description=Approximate read depth (reads with MQ=255 or with bad mates are filtered)" </code></pre> <p>Here we’ve isolated the definition of ‘DP’ as a ‘FORMAT’ element. Note that the function <code>queryMETA()</code> includes the parameter <code>nice</code> which by default is TRUE and attempts to present the data in a nicely formatted manner. However, our query is performed on the actual information in the ‘meta’ region. It is therefore sometimes appropriate to set <code>nice = FALSE</code> so that we can see the raw data. In the above example the angled bracket (‘<’) is omitted from the <code>nice = TRUE</code> representation but is essential to distinguishing the ‘FORMAT’ element from the ‘INFO’ element.</p> </div> <div id="the-fix-region" class="section level2"> <h2>The fix region</h2> <p>The fix region contains information for each variant which is sometimes summarized over all samples. The first eight columns of the fixed region are titled CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO. This is per variant information which is ‘fixed’, or the same, over all samples. The first two columns indicate the location of the variant by chromosome and position within that chromosome. Here, the ID field has not been used, so it consists of missing data (NA). The REF and ALT columns indicate the reference and alternate allelic states for a diploid sample. When multiple alternate allelic states are present they are delimited with commas. The QUAL column attempts to summarize the quality of each variant over all samples. The FILTER field is not used here but could contain information on whether a variant has passed some form of quality assessment.</p> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">head</span>(<span class="kw">getFIX</span>(vcf))</span></code></pre></div> <pre class="r-output"><code>## CHROM POS ID REF ALT QUAL FILTER ## [1,] "Supercontig_1.50" "2" NA "T" "A" "44.44" NA ## [2,] "Supercontig_1.50" "246" NA "C" "G" "144.21" NA ## [3,] "Supercontig_1.50" "549" NA "A" "C" "68.49" NA ## [4,] "Supercontig_1.50" "668" NA "G" "C" "108.07" NA ## [5,] "Supercontig_1.50" "765" NA "A" "C" "92.78" NA ## [6,] "Supercontig_1.50" "780" NA "G" "T" "58.38" NA </code></pre> <p>The eigth column, titled INFO, is a semicolon delimited list of information. It can be rather long and cumbersome. The function <code>getFIX()</code> will suppress this column by default. Each abbreviation in the INFO column should be defined in the meta section. We can validate this by querying the meta portion, as we did in the ‘meta’ section above.</p> </div> <div id="the-gt-region" class="section level2"> <h2>The gt region</h2> <p>The gt (genotype) region contains information about each variant for each sample. The values for each variant and each sample are colon delimited. Multiple types of data for each genotype may be stored in this manner. The format of the data is specified by the FORMAT column (column nine). Here we see that we have information for GT, AD, DP, GQ and PL. The definition of these acronyms can be referenced by querying the the meta region, as demonstrated previously. Every variant does not necessarily have the same information (e.g., SNPs and indels may be handled differently), so the rows are best treated independently. Different variant callers may include different information in this region.</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a>vcf<span class="op">@</span>gt[<span class="dv">1</span><span class="op">:</span><span class="dv">6</span>, <span class="dv">1</span><span class="op">:</span><span class="dv">4</span>]</span></code></pre></div> <pre class="r-output"><code>## FORMAT BL2009P4_us23 DDR7602 IN2009T1_us22 ## [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835" "0|0:12,0:12:39:0,39,585" "0|0:37,0:37:99:0,114,1709" ## [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114" NA "0|1:2,1:3:16:16,0,48" ## [3,] "GT:AD:DP:GQ:PL" NA NA "0|0:2,0:2:6:0,6,51" ## [4,] "GT:AD:DP:GQ:PL" "0|0:1,0:1:3:0,3,44" NA "1|1:0,1:1:3:25,3,0" ## [5,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49" "0|0:1,0:1:3:0,3,34" "0|0:1,0:1:3:0,3,31" ## [6,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49" "0|0:1,0:1:3:0,3,34" "0|0:3,0:3:9:0,9,85" </code></pre> </div> <div id="vcfr" class="section level2"> <h2>vcfR</h2> <p>Using the R package <em>vcfR</em>, we can read VCF format files into memory using the function <code>read.vcfR()</code>. Once in memory we can use the <code>head()</code> method to summarize the information in the three VCF regions.</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a>vcf <-<span class="st"> </span><span class="kw">read.vcfR</span>(<span class="st">"pinfsc50_filtered.vcf.gz"</span>)</span></code></pre></div> <pre class="r-output"><code>## Scanning file to determine attributes. ## File attributes: ## meta lines: 29 ## header_line: 30 ## variant count: 2190 ## column count: 27 ## Meta line 29 read in. ## All meta lines processed. ## gt matrix initialized. ## Character matrix gt created. ## Character matrix gt rows: 2190 ## Character matrix gt cols: 27 ## skip: 0 ## nrows: 2190 ## row_num: 0 ## Processed variant 1000 Processed variant 2000 Processed variant: 2190 ## All variants processed </code></pre> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a><span class="kw">head</span>(vcf)</span></code></pre></div> <pre class="r-output"><code>## [1] "***** Object of class 'vcfR' *****" ## [1] "***** Meta section *****" ## [1] "##fileformat=VCFv4.1" ## [1] "##source=\"GATK haplotype Caller, phased with beagle4\"" ## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">" ## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]" ## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read [Truncated]" ## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">" ## [1] "First 6 rows." ## [1] ## [1] "***** Fixed section *****" ## CHROM POS ID REF ALT QUAL FILTER ## [1,] "Supercontig_1.50" "80058" NA "T" "G,TACTG" "3480.23" NA ## [2,] "Supercontig_1.50" "80063" NA "C" "T" "3016.89" NA ## [3,] "Supercontig_1.50" "80067" NA "A" "C" "3555.08" NA ## [4,] "Supercontig_1.50" "80073" NA "C" "A" "104.72" NA ## [5,] "Supercontig_1.50" "80074" NA "A" "G" "2877.74" NA ## [6,] "Supercontig_1.50" "80089" NA "A" "ACG" "2250.92" NA ## [1] ## [1] "***** Genotype section *****" ## FORMAT BL2009P4_us23 DDR7602 IN2009T1_us22 LBUS5 ## [1,] "GT:AD:DP:GQ:PL" "1|0:25,3,0:28:45:45,0,1120,129,1134,1300" "1|0:19,7,0:26:99:237,0,777,300,804,1181" "0|1:29,6,0:35:99:162,0,1229,252,1252,1512" "0|1:19,7,0:26:99:237,0,777,300,804,1181" ## [2,] "GT:AD:DP:GQ:PL" "1|0:29,3:32:30:30,0,1335" "1|0:20,7:27:99:234,0,819" "0|1:27,7:34:99:204,0,1232" "0|1:20,7:27:99:234,0,819" ## [3,] "GT:AD:DP:GQ:PL" "1|0:31,3:34:27:27,0,1372" "1|0:19,6:25:99:189,0,864" "0|1:27,6:33:99:210,0,1155" "0|1:19,6:25:99:189,0,864" ## [4,] "GT:AD:DP:GQ:PL" "0|0:30,0:30:99:0,102,1530" "0|0:26,0:26:87:0,87,1305" "0|0:33,0:33:99:0,99,1485" "0|0:26,0:26:87:0,87,1305" ## [5,] "GT:AD:DP:GQ:PL" "0|0:30,0:30:93:0,93,1395" "1|0:21,4:25:99:147,0,867" "0|1:27,6:33:99:171,0,1116" "0|1:21,4:25:99:147,0,867" ## [6,] "GT:AD:DP:GQ:PL" "0|0:33,0:33:99:0,99,1485" "1|0:20,2:22:18:18,0,918" "0|1:27,7:34:99:213,0,1113" "0|1:20,2:22:18:18,0,918" ## NL07434 ## [1,] "0|1:45,19,0:64:99:643,0,1782,793,1866,2825" ## [2,] "0|1:42,18:60:99:655,0,1748" ## [3,] "0|1:41,16:57:99:584,0,1737" ## [4,] "0|0:56,0:56:99:0,172,2565" ## [5,] "0|1:39,16:55:99:629,0,1709" ## [6,] "0|1:34,12:46:99:393,0,1518" ## [1] "First 6 columns only." ## [1] ## [1] "Unique GT formats:" ## [1] "GT:AD:DP:GQ:PL" ## [1] </code></pre> <p>After we have made any manipulations of the file we can save it as a VCF file with the function <code>write.vcf()</code>.</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1"></a><span class="kw">write.vcf</span>(vcf, <span class="st">"myVCFdata_filtered.vcf.gz"</span>)</span></code></pre></div> <p><code>write.vcf()</code>will write a file to your active directory. We now have a summary of our VCF file which we can use to help understand what forms of information are contained within it. This information can be further explored with plotting functions and used to filter the VCF file for high quality variants as we will see in the next section.</p> </div> <div id="exercises" class="section level2"> <h2>Exercises</h2> <p><strong>1)</strong> How would we find more information about <code>read.vcfR()</code>?</p> <button class="btn btn-danger" data-toggle="collapse" data-target="#hide_buttonunnamed-chunk-12"> Show solution </button> <div id="hide_buttonunnamed-chunk-12" class="collapse"> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a>?read.vcfR</span></code></pre></div> </div> <p><br /></p> <p><strong>2)</strong> How would we learn what the acronym “AD” stands for?</p> <button class="btn btn-danger" data-toggle="collapse" data-target="#hide_buttonunnamed-chunk-13"> Show solution </button> <div id="hide_buttonunnamed-chunk-13" class="collapse"> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1"></a><span class="kw">queryMETA</span>(vcf, <span class="dt">element =</span> <span class="st">'AD'</span>)</span></code></pre></div> <pre class="r-output"><code>## [[1]] ## [1] "FORMAT=ID=AD" "Number=." ## [3] "Type=Integer" "Description=Allelic depths for the ref and alt alleles in the order listed" </code></pre> </div> <p><br /></p> <p><strong>3)</strong> We used the <code>head()</code> function to view the <strong>first</strong> few lines of <code>fix</code> data. How would we view the <strong>last</strong> few lines of <code>fix</code> data?</p> <button class="btn btn-danger" data-toggle="collapse" data-target="#hide_buttonunnamed-chunk-14"> Show solution </button> <div id="hide_buttonunnamed-chunk-14" class="collapse"> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a><span class="kw">tail</span>(vcf<span class="op">@</span>fix)</span></code></pre></div> <pre class="r-output"><code>## CHROM POS ID REF ALT QUAL FILTER ## [2185,] "Supercontig_1.50" "1041878" NA "T" "C" "219.68" NA ## [2186,] "Supercontig_1.50" "1041989" NA "G" "A" "803.29" NA ## [2187,] "Supercontig_1.50" "1041997" NA "GA" "G" "472.52" NA ## [2188,] "Supercontig_1.50" "1042000" NA "A" "T" "481.57" NA ## [2189,] "Supercontig_1.50" "1042001" NA "T" "C" "93.70" NA ## [2190,] "Supercontig_1.50" "1042002" NA "C" "T" "487.57" NA ## INFO ## [2185,] "AC=1;AF=0.028;AN=36;BaseQRankSum=-0.961;ClippingRankSum=-0.868;DP=502;FS=0.000;InbreedingCoeff=-0.0289;MLEAC=1;MLEAF=0.028;MQ=59.12;MQ0=0;MQRankSum=-3.092;QD=11.56;ReadPosRankSum=0.765;SOR=0.665" ## [2186,] "AC=2;AF=0.056;AN=36;BaseQRankSum=-4.021;ClippingRankSum=-1.419;DP=482;FS=0.000;InbreedingCoeff=0.9586;MLEAC=2;MLEAF=0.056;MQ=54.82;MQ0=0;MQRankSum=-0.944;QD=27.70;ReadPosRankSum=-2.468;SOR=0.823" ## [2187,] "AC=5;AF=0.139;AN=36;BaseQRankSum=1.722;ClippingRankSum=1.829;DP=490;FS=5.543;InbreedingCoeff=-0.1625;MLEAC=5;MLEAF=0.139;MQ=54.37;MQ0=0;MQRankSum=0.295;QD=6.14;ReadPosRankSum=-0.116;SOR=1.724" ## [2188,] "AC=5;AF=0.139;AN=36;BaseQRankSum=1.506;ClippingRankSum=-0.134;DP=475;FS=5.618;InbreedingCoeff=-0.1625;MLEAC=5;MLEAF=0.139;MQ=54.23;MQ0=0;MQRankSum=-0.193;QD=6.78;ReadPosRankSum=0.529;SOR=1.719" ## [2189,] "AC=1;AF=0.028;AN=36;BaseQRankSum=-1.590;ClippingRankSum=-0.469;DP=477;FS=3.099;InbreedingCoeff=-0.0304;MLEAC=1;MLEAF=0.028;MQ=54.21;MQ0=0;MQRankSum=1.586;QD=4.46;ReadPosRankSum=-3.047;SOR=1.591" ## [2190,] "AC=5;AF=0.139;AN=36;BaseQRankSum=1.731;ClippingRankSum=-0.674;DP=477;FS=5.655;InbreedingCoeff=-0.1625;MLEAC=5;MLEAF=0.139;MQ=54.21;MQ0=0;MQRankSum=0.245;QD=6.87;ReadPosRankSum=0.985;SOR=1.716" </code></pre> </div> <p><br /></p> <p><strong>4)</strong> There is a column in the <code>fix</code> portion of the data called <code>QUAL</code>. It is not defined in the <code>meta</code> portion of the data because it is defined in the <a href="http://samtools.github.io/hts-specs/">VCF specification</a>. It stands for ‘quality’. Does <code>QUAL</code> appear useful to us? Why or why not?</p> <button class="btn btn-danger" data-toggle="collapse" data-target="#hide_buttonunnamed-chunk-15"> Show solution </button> <div id="hide_buttonunnamed-chunk-15" class="collapse"> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1"></a><span class="kw">plot</span>(vcf)</span></code></pre></div> <p><img src="reading_vcf_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1"></a><span class="co"># Alternate ggplot2 answer.</span></span> <span id="cb15-2"><a href="#cb15-2"></a><span class="kw">library</span>(ggplot2)</span> <span id="cb15-3"><a href="#cb15-3"></a><span class="kw">qplot</span>(<span class="kw">getQUAL</span>(vcf), <span class="dt">geom =</span> <span class="st">"histogram"</span>)</span></code></pre></div> <img src="reading_vcf_files/figure-html/unnamed-chunk-15-2.png" width="672" /> </div> <p><br /></p> <p><strong>5)</strong> How would we query the sample names?</p> <button class="btn btn-danger" data-toggle="collapse" data-target="#hide_buttonunnamed-chunk-16"> Show solution </button> <div id="hide_buttonunnamed-chunk-16" class="collapse"> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1"></a><span class="kw">colnames</span>(vcf<span class="op">@</span>gt)</span></code></pre></div> <pre class="r-output"><code>## [1] "FORMAT" "BL2009P4_us23" "DDR7602" "IN2009T1_us22" "LBUS5" "NL07434" "P10127" "P10650" "P11633" "P12204" "P13527" "P1362" ## [13] "P13626" "P17777us22" "P6096" "P7722" "RS2009P1_us8" "blue13" "t30-4" </code></pre> </div> <p><br /></p> <p>Note that the first column is <code>FORMAT</code>. This tells us the format for data for each variant. According to the VCF specification this can be different for each variant.</p> </div> <div id="references" class="section level1 unnumbered"> <h1>References</h1> <div id="refs" class="references"> <div id="ref-danecek2011variant"> <p>Danecek P., Auton A., Abecasis G., Albers CA., Banks E., DePristo MA., Handsaker RE., Lunter G., Marth GT., Sherry ST. et al. 2011. The variant call format and VCFtools. <em>Bioinformatics</em> 27:2156–2158. Available at: <a href="https://doi.org/10.1093/bioinformatics/btr330">https://doi.org/10.1093/bioinformatics/btr330</a></p> </div> <div id="ref-knaus2017vcfr"> <p>Knaus BJ., Grünwald NJ. 2017. <span class="math inline">\({V}cfr\)</span>: A package to manipulate and visualize variant call format data in R. <em>Molecular Ecology Resources</em> 17:44–53. Available at: <a href="http://dx.doi.org/10.1111/1755-0998.12549">http://dx.doi.org/10.1111/1755-0998.12549</a></p> </div> </div> </div> </div> <script> // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.odd').parent('tbody').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); }); </script> <!-- tabsets --> <script> $(document).ready(function () { window.buildTabsets("TOC"); }); $(document).ready(function () { $('.tabset-dropdown > .nav-tabs > li').click(function () { $(this).parent().toggleClass('nav-tabs-open') }); }); </script> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>