{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Population stratification illustration using slicing of remote VCF with Bioconductor\n", "\n", "We will acquire a modest collection of SNP calls from chr17, and project the genotype configurations via principal components to expose population substructure.\n", "\n", "## Create references to the EBI VCF repository\n", "\n", "Bioconductor’s ldblock package includes utilities for working with collections of VCF, typically decomposed by chromosome." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "library(BiocManager)\n", "inplist = rownames(installed.packages())\n", "if (!(\"snpStats\" %in% inplist)) install(\"snpStats\")\n", "if (!(\"ldblock\" %in% inplist)) install(\"vjcitn/ldblock\")\n", "if (!(\"terravar\" %in% inplist)) install(\"vjcitn/terravar\")\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "library(ldblock)\n", "library(GenomicFiles)\n", "library(VariantAnnotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read a slice of VCF corresponding to a region of chr17\n", "\n", "EBI has updated 1000 genomes calls. We will interrogate a tabix-indexed VCF for chr17 via HTTP. An object that organizes multiple chromosomes is created using stack1kg from the ldblock package." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "st = stack1kg(\"17\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a ScanVcfParam to define a slice of genome." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "sp = ScanVcfParam(geno=\"GT\", \n", " which=GRanges(\"17\", IRanges(32e6,33.75e6)))\n", "myread = readVcfStack(st, param=sp) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`readVcfStack` produces a CollapsedVCF instance in memory. See the documentation for the VariantAnnotation package for details on this data structure." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "class: CollapsedVCF \n", "dim: 49194 2548 \n", "rowRanges(vcf):\n", " GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER\n", "info(vcf):\n", " DataFrame with 12 columns: AF, AC, NS, AN, EAS_AF, EUR_AF, AFR_AF, AMR_AF,...\n", "info(header(vcf)):\n", " Number Type Description \n", " AF A Float Estimated allele frequency in the range (0,1) \n", " AC A Integer Total number of alternate alleles in called geno...\n", " NS 1 Integer Number of samples with data \n", " AN 1 Integer Total number of alleles in called genotypes \n", " EAS_AF A Float Allele frequency in the EAS populations calculat...\n", " EUR_AF A Float Allele frequency in the EUR populations calculat...\n", " AFR_AF A Float Allele frequency in the AFR populations calculat...\n", " AMR_AF A Float Allele frequency in the AMR populations calculat...\n", " SAS_AF A Float Allele frequency in the SAS populations calculat...\n", " VT . String indicates what type of variant the line represents \n", " EX_TARGET 0 Flag indicates whether a variant is within the exon p...\n", " DP 1 Integer Approximate read depth; some reads may have been...\n", "geno(vcf):\n", " SimpleList of length 1: GT\n", "geno(header(vcf)):\n", " Number Type Description \n", " GT 1 String Phased Genotype" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "myread" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transform the VCF content to rare allele counts\n", "\n", "We use tools in David Clayton’s snpStats package to achieve a compact representation of (possibly uncertain) genotype calls. This enables us to filter to SNP with MAF exceeding a given threshold, and, later, to compute a PCA." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "non-single nucleotide variations are set to NA\n" ] }, { "data": { "text/html": [ "2909" ], "text/latex": [ "2909" ], "text/markdown": [ "2909" ], "text/plain": [ "[1] 2909" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "library(snpStats)\n", "mymat = genotypeToSnpMatrix(myread)\n", "## non-single nucleotide variations are set to NA\n", "cs = col.summary(mymat[[1]])\n", "sum(cs[,\"MAF\"]>.1, na.rm=TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a matrix of B-allele counts\n", "\n", "We'll use SNP with MAF exceeding 10%, and build the sample x SNP matrix." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "kpsnp = which(cs[,\"MAF\"]>.1)\n", "cmat = matrix(0, nr=nrow(mymat[[1]]), nc=length(kpsnp))\n", "for (i in seq_len(length(kpsnp))) {\n", " cmat[,i] = as(mymat[[1]][,kpsnp[i]], \"numeric\")\n", " }\n", "rownames(cmat) = rownames(mymat[[1]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute principal components and plot" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "pp = prcomp(cmat)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "