{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BayesC0\n", "##Rohan L. Fernando, Hao Cheng and Dorian Garrick\n", "## June 2015" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*You are free to use this code in your work, but please acknowledge the JWAS package and website until the associated paper, \"JWAS: Julia Implementation of Whole-Genome Analyses Software\", is published.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Simulating Genotypes and Phenotypes" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "using(Distributions)\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nObs = 100\n", "nMarkers = 1000\n", "X = sample([0,1,2],(nObs,nMarkers))\n", "α = randn(nMarkers)\n", "a = X*α\n", "stdGen = std(a)\n", "a = a/stdGen\n", "y = a + randn(nObs)\n", "saveAlpha = α\n", "nothing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Centering Genotype Covariates" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "meanXCols = mean(X,1)\n", "X = X - ones(nObs,1)*meanXCols;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Priors" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "seed = 10 # set the seed for the random number generator\n", "chainLength = 2000 # number of iterations\n", "probFixed = 0 # parameter \"pi\" the probability SNP effect is zero\n", "dfEffectVar = 4 # hyper parameter (degrees of freedom) for locus effect variance \n", "nuRes = 4 # hyper parameter (degrees of freedom) for residual variance\n", "varGenotypic = 1 # used to derive hyper parameter (scale) for locus effect variance \n", "varResidual = 1 # used to derive hyper parameter (scale) for locus effect variance \n", "scaleVar = varGenotypic*(dfEffectVar-2)/dfEffectVar # scale factor for locus effects\n", "scaleRes = varResidual*(nuRes-2)/nuRes # scale factor for residual variance\n", "nothing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function for Sampling Marker Effects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we make an array, *xArray*, that contains the columns of $\\mathbf{X}$. This is done without duplicating each column of $\\mathbf{X}$." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "get_column (generic function with 1 method)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function get_column(X,nRows,j)\n", " indx = 1 + (j-1)*nRows\n", " ptr = pointer(X,indx)\n", " pointer_to_array(ptr,nRows)\n", "end" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "xpx = [(X[:,i]'X[:,i])[1]::Float64 for i=1:nMarkers]\n", "xArray = Array(Array{Float64,1},nMarkers)\n", "for i=1:nMarkers\n", " xArray[i] = get_column(X,nObs,i)\n", "end" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Array{Float64,1}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(xArray[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing the adjusted right-hand-side efficiently\n", "\n", "We want to compute:\n", "$$\n", "rhs = \\mathbf{X}_j'(\\mathbf{y}_{corr} + \\mathbf{X}_j\\mathbf{\\alpha}_j),\n", "$$\n", "where $\\mathbf{y}_{corr} = \\mathbf{y} - \\mathbf{X}\\mathbf{\\alpha}.$ This is more efficiently obtained as\n", "$$\n", "rhs = \\mathbf{X}_j'\\mathbf{y}_{corr} + \\mathbf{X}_j'\\mathbf{X}_j\\mathbf{\\alpha}_j,\n", "$$\n", "using the diagonals of $\\mathbf{X}'\\mathbf{X}$ that have already been computed (line 4 of the function below)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sampleEffects! (generic function with 1 method)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function sampleEffects!(nMarkers,xArray,xpx,yCorr,α,meanAlpha,vare,varEffects)\n", " nObs = size(X,1)\n", " for j=1:nMarkers\n", " rhs::Float64 = dot(xArray[j],yCorr) + xpx[j]*α[j] \n", " lhs::Float64 = xpx[j] + vare/varEffects\n", " invLhs::Float64 = 1.0/lhs\n", " mean::Float64 = invLhs*rhs\n", " oldAlpha::Float64 = α[j] \n", " α[j] = mean + randn()*sqrt(invLhs*vare)\n", " BLAS.axpy!(oldAlpha-α[j],xArray[j],yCorr) \n", " end\n", " nothing\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function for BayesC0\n", "\n", "The intercept is sampled first and the sampleEffects! function is called to sample the marker effects." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "BayesC0! (generic function with 1 method)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chi1=Chisq(nObs+nuRes)\n", "chi2=Chisq(dfEffectVar+nMarkers)\n", "\n", "function BayesC0!(numIter,nMarkers,X,xpx,yCorr,mu,meanMu,α,meanAlpha,vare,varEffects)\n", " for i=1:numIter\n", " # sample residula variance\n", " vare = (dot(yCorr,yCorr)+nuRes*scaleRes)/rand(chi1)\n", " \n", " # sample intercept\n", " yCorr = yCorr+mu\n", " rhs = sum(yCorr) \n", " invLhs = 1.0/(nObs) \n", " mean = rhs*invLhs \n", " mu = mean + randn()*sqrt(invLhs*vare)\n", " yCorr = yCorr - mu \n", " meanMu = meanMu + (mu - meanMu)/i\n", " \n", " # sample effects\n", " sampleEffects!(nMarkers,xArray,xpx,yCorr,α,meanAlpha,vare,varEffects)\n", " meanAlpha = meanAlpha + (α - meanAlpha)/i\n", " \n", " #sameple locus effect variance\n", " varEffects = (scaleVar*dfEffectVar + dot(α,α))/rand(chi2) \n", " \n", " if (i%1000)==0\n", " yhat = meanMu+X*meanAlpha\n", " resCorr = cor(a,yhat)\n", " println (\"Correlation of between true and predicted breeding value: \", resCorr)\n", " end\n", " end\n", "end " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Run BayesC0" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correlation of between true and predicted breeding value: 0.6344163465804081\n", "Correlation of between true and predicted breeding value: 0.6346902997156448\n", "elapsed time: 0.216529693 seconds (53211800 bytes allocated, 11.48% gc time)\n" ] } ], "source": [ "meanMu = 0\n", "meanAlpha = zeros(nMarkers) \n", "\n", "#initial valus\n", "vare = 1\n", "varEffects = 1\n", "mu = mean(y)\n", "yCorr = y - mu\n", "alpha = fill(0.0,nMarkers)\n", "\n", "#run it\n", "@time BayesC0!(chainLength,nMarkers,X,xpx,yCorr,mu,meanMu,alpha,meanAlpha,vare,varEffects)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Compare Runtime with R Implementation" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " user system elapsed \n", " 48.341 1.328 49.707 \n" ] } ], "source": [ ";Rscript RBayesC0/BayesC0.R" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "222.76497695852535" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "48.34/.217" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# This code is for illustrative purposes and not efficient for large problems\n", "# Real life data analysis (using the same file formats) is available at \n", "# bigs.ansci.iastate.edu/login.html based on GenSel cpp software implementation\n", "# \n", "# Rohan Fernando (rohan@iastate.edu)\n", "# Dorian Garrick (dorian@iastate.edu) \n", "# copyright August 2012\n", "\n", "# Parameters\n", "setwd(\"RBayesC0\")\n", "seed = 10 # set the seed for the random number generator\n", "chainLength = 2000 # number of iterations\n", "dfEffectVar = 4 # hyper parameter (degrees of freedom) for locus effect variance \n", "nuRes = 4 # hyper parameter (degrees of freedom) for residual variance\n", "varGenotypic = 1 # used to derive hyper parameter (scale) for locus effect variance \n", "varResidual = 1 # used to derive hyper parameter (scale) for residual variance \n", "windowSize = 10 # number of consecutive markers in a genomic window\n", "outputFrequency = 100 # frequency for reporting performance and for computing genetic variances\n", "\n", "markerFileName = \"genotypes.dat\"\n", "trainPhenotypeFileName = \"trainPhenotypes.dat\"\n", "testPhenotypeFileName = \"testPhenotypes.dat\"\n", "\n", "set.seed(seed)\n", "\n", "genotypeFile = read.table(markerFileName, header=TRUE) # this is not efficient for large files!\n", "trainPhenotypeFile = read.table(trainPhenotypeFileName, skip=1)[,1:2] # skip the header as R dislikes # character\n", "testPhenotypeFile = read.table(testPhenotypeFileName, skip=1)[,1:2] # skip the header as R dislikes # character\n", "commonTrainingData = merge(trainPhenotypeFile, genotypeFile, by.x=1, by.y=1) # Only use animals with genotype and phenotype\n", "commonTestData = merge(testPhenotypeFile, genotypeFile, by.x=1, by.y=1) # Only use animals with genotype and phenotype\n", "\n", "\n", "remove(genotypeFile) # Free up space \n", "remove(trainPhenotypeFile) # Free up space \n", "remove(testPhenotypeFile) # Free up space \n", "animalID = unname(as.matrix(commonTrainingData[,1])) # First field is animal identifier\n", "y = commonTrainingData[, 2] # Second field is trait values\n", "Z = commonTrainingData[, 3: ncol(commonTrainingData)] # Remaining fields are GenSel-coded genotypes \n", "Z = unname(as.matrix((Z + 10)/10)); # Recode genotypes to 0, 1, 2 (number of B alleles)\n", "markerID = colnames(commonTrainingData)[3:ncol(commonTrainingData)] # Remember the marker locus identifiers\n", "remove(commonTrainingData)\n", "\n", "testID = unname(as.matrix(commonTestData[,1])) # First field is animal identifier\n", "yTest = commonTestData[, 2] # Second field is trait values\n", "ZTest = commonTestData[, 3: ncol(commonTestData)] # Remaining fields are GenSel-coded genotypes \n", "ZTest = unname(as.matrix((ZTest + 10)/10)); # Recode genotypes to 0, 1, 2 (number of B alleles)\n", "remove(commonTestData)\n", "\n", "nmarkers = ncol(Z) # number of markers\n", "nrecords = nrow(Z) # number of animals\n", "\n", "# center the genotype matrix to accelerate mixing \n", "markerMeans = colMeans(Z) # compute the mean for each marker\n", "Z = t(t(Z) - markerMeans) # deviate covariate from its mean\n", "p = markerMeans/2.0 # compute frequency B allele for each marker\n", "mean2pq = mean(2*p*(1-p)) # compute mean genotype variance\n", "\n", "varEffects = varGenotypic/(nmarkers*mean2pq) # variance of locus effects is computed from genetic variance \n", " #(e.g. Fernando et al., Acta Agriculturae Scand Section A, 2007; 57: 192-195)\n", "scaleVar = varEffects*(dfEffectVar-2)/dfEffectVar; # scale factor for locus effects\n", "scaleRes = varResidual*(nuRes-2)/nuRes # scale factor for residual variance\n", "\n", "numberWindows = nmarkers/windowSize # number of genomic windows\n", "numberSamples = chainLength/outputFrequency # number of samples of genetic variances\n", "\n", "\n", "alpha = array(0.0, nmarkers) # reserve a vector to store sampled locus effects \n", "meanAlpha = array(0.0, nmarkers) # reserve a vector to accumulate the posterior mean of locus effects\n", "modelFreq = array(0.0, nmarkers) # reserve a vector to store model frequency\n", "mu = mean(y) # starting value for the location parameter \n", "meanMu = 0 # reserve a scalar to accumulate the posterior mean\n", "geneticVar = array(0,numberSamples) # reserve a vector to store sampled genetic variances\n", " # reserve a matrix to store sampled proportion proportion of variance due to window \n", "windowVarProp = matrix(0,nrow=numberSamples,ncol=numberWindows)\n", "sampleCount = 0 # initialize counter for number of samples of genetic variances\n", " \n", "\n", "\n", "# adjust y for the fixed effect (ie location parameter)\n", "ycorr = y - mu\n", "\n", "\n", "ZPZ=t(Z)%*%Z\n", "zpz=diag(ZPZ)\n", "\n", "ptime=proc.time()\n", "# mcmc sampling\n", "for (iter in 1:chainLength){\n", "\t\n", "# sample residual variance\n", "\tvare = ( t(ycorr)%*%ycorr + nuRes*scaleRes )/rchisq(1,nrecords + nuRes) \n", "\t\n", "# sample intercept\n", "\tycorr = ycorr + mu # Unadjust y for the previous sample of mu\n", "\trhs = sum(ycorr) # Form X'y\n", "\tinvLhs = 1.0/nrecords # Form (X'X)-1\n", "\tmean = rhs*invLhs # Solve (X'X) mu = X'y \n", "\tmu = rnorm(1,mean,sqrt(invLhs*vare)) # Sample new location parameter \n", "\tycorr = ycorr - mu # Adjust y for the new sample of mu\n", "\tmeanMu = meanMu + mu # Accumulate the sum to compute posterior mean\n", "\t\n", "# sample effect for each locus\n", "\tfor (locus in 1:nmarkers){\n", "\t\t\n", "\t\trhs=t(Z[,locus])%*%ycorr +zpz[locus]*alpha[locus]\n", "\t\tmmeLhs = zpz[locus] + vare/varEffects # Form the coefficient matrix of MME\n", "\t\tinvLhs = 1.0/mmeLhs # Invert the coefficient matrix \n", "\t\tmean = invLhs*rhs # Solve the MME for locus effect\n", "\t\toldAlpha=alpha[locus]\n", "\t\talpha[locus]= rnorm(1,mean,sqrt(invLhs*vare)) # Sample the locus effect from data\n", "\t\tycorr = ycorr + Z[,locus]*(oldAlpha-alpha[locus]); # Adjust the data for this locus effect\n", "\t\tmeanAlpha[locus] = meanAlpha[locus] + alpha[locus]; # Accumulate the sum for posterior mean\n", "\t}\n", "\t\t\n", "\t# sample the common locus effect variance\t\t\n", "\tvarEffects = ( scaleVar*dfEffectVar + sum(alpha^2) )/rchisq(1,dfEffectVar+nmarkers) \n", "\t\t\n", "}\n", "\n", "proc.time()-ptime\n" ] } ], "source": [ ";cat RBayesC0/BayesC0.R" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.3.7", "language": "julia", "name": "julia 0.3" }, "language_info": { "name": "julia", "version": "0.3.7" } }, "nbformat": 4, "nbformat_minor": 0 }