{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BayesC0\n",
    "##Rohan L. Fernando, Hao Cheng and Dorian Garrick\n",
    "## June 2015"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*You are free to use this code in your work, but please acknowledge the JWAS package and website until the associated paper, \"JWAS: Julia Implementation of Whole-Genome Analyses Software\", is published.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simulating Genotypes and Phenotypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "using(Distributions)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nObs     = 100\n",
    "nMarkers = 1000\n",
    "X = sample([0,1,2],(nObs,nMarkers))\n",
    "α  = randn(nMarkers)\n",
    "a  = X*α\n",
    "stdGen = std(a)\n",
    "a = a/stdGen\n",
    "y = a + randn(nObs)\n",
    "saveAlpha = α\n",
    "nothing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####Centering Genotype Covariates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "meanXCols = mean(X,1)\n",
    "X = X - ones(nObs,1)*meanXCols;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Priors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "seed            =   10    # set the seed for the random number generator\n",
    "chainLength     =  2000   # number of iterations\n",
    "probFixed       =    0    # parameter \"pi\" the probability SNP effect is zero\n",
    "dfEffectVar     =    4    # hyper parameter (degrees of freedom) for locus effect variance          \n",
    "nuRes           =    4    # hyper parameter (degrees of freedom) for residual variance\n",
    "varGenotypic    =    1    # used to derive hyper parameter (scale) for locus effect variance \n",
    "varResidual     =    1    # used to derive hyper parameter (scale) for locus effect variance \n",
    "scaleVar        =   varGenotypic*(dfEffectVar-2)/dfEffectVar    # scale factor for locus effects\n",
    "scaleRes        =   varResidual*(nuRes-2)/nuRes                 # scale factor for residual variance\n",
    "nothing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Function for Sampling Marker Effects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we make an array, *xArray*, that contains the columns of $\\mathbf{X}$. This is done without duplicating each column of $\\mathbf{X}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "get_column (generic function with 1 method)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function get_column(X,nRows,j)\n",
    "    indx = 1 + (j-1)*nRows\n",
    "    ptr = pointer(X,indx)\n",
    "    pointer_to_array(ptr,nRows)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xpx = [(X[:,i]'X[:,i])[1]::Float64 for i=1:nMarkers]\n",
    "xArray = Array(Array{Float64,1},nMarkers)\n",
    "for i=1:nMarkers\n",
    "    xArray[i] = get_column(X,nObs,i)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array{Float64,1}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "typeof(xArray[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the adjusted right-hand-side efficiently\n",
    "\n",
    "We want to compute:\n",
    "$$\n",
    "rhs = \\mathbf{X}_j'(\\mathbf{y}_{corr} + \\mathbf{X}_j\\mathbf{\\alpha}_j),\n",
    "$$\n",
    "where $\\mathbf{y}_{corr} = \\mathbf{y} - \\mathbf{X}\\mathbf{\\alpha}.$ This is more efficiently obtained as\n",
    "$$\n",
    "rhs = \\mathbf{X}_j'\\mathbf{y}_{corr} + \\mathbf{X}_j'\\mathbf{X}_j\\mathbf{\\alpha}_j,\n",
    "$$\n",
    "using the diagonals of $\\mathbf{X}'\\mathbf{X}$ that have already been computed (line 4 of the function below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sampleEffects! (generic function with 1 method)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function sampleEffects!(nMarkers,xArray,xpx,yCorr,α,meanAlpha,vare,varEffects)\n",
    "    nObs = size(X,1)\n",
    "    for j=1:nMarkers\n",
    "        rhs::Float64      = dot(xArray[j],yCorr) + xpx[j]*α[j]   \n",
    "        lhs::Float64      = xpx[j] + vare/varEffects\n",
    "        invLhs::Float64   = 1.0/lhs\n",
    "        mean::Float64     = invLhs*rhs\n",
    "        oldAlpha::Float64 = α[j] \n",
    "        α[j] = mean + randn()*sqrt(invLhs*vare)\n",
    "        BLAS.axpy!(oldAlpha-α[j],xArray[j],yCorr)  \n",
    "    end\n",
    "    nothing\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Function for BayesC0\n",
    "\n",
    "The intercept is sampled first and the sampleEffects! function is called to sample the marker effects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BayesC0! (generic function with 1 method)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chi1=Chisq(nObs+nuRes)\n",
    "chi2=Chisq(dfEffectVar+nMarkers)\n",
    "\n",
    "function BayesC0!(numIter,nMarkers,X,xpx,yCorr,mu,meanMu,α,meanAlpha,vare,varEffects)\n",
    "    for i=1:numIter\n",
    "        # sample residula variance\n",
    "        vare = (dot(yCorr,yCorr)+nuRes*scaleRes)/rand(chi1)\n",
    "        \n",
    "        # sample intercept\n",
    "        yCorr  = yCorr+mu\n",
    "        rhs    = sum(yCorr)                   \n",
    "        invLhs = 1.0/(nObs)               \n",
    "        mean   = rhs*invLhs   \n",
    "        mu     = mean + randn()*sqrt(invLhs*vare)\n",
    "        yCorr  = yCorr - mu                    \n",
    "        meanMu = meanMu + (mu - meanMu)/i\n",
    "        \n",
    "        # sample effects\n",
    "        sampleEffects!(nMarkers,xArray,xpx,yCorr,α,meanAlpha,vare,varEffects)\n",
    "        meanAlpha = meanAlpha + (α - meanAlpha)/i\n",
    "        \n",
    "        #sameple locus effect variance\n",
    "        varEffects = (scaleVar*dfEffectVar + dot(α,α))/rand(chi2)  \n",
    "               \n",
    "        if (i%1000)==0\n",
    "            yhat = meanMu+X*meanAlpha\n",
    "            resCorr = cor(a,yhat)\n",
    "            println (\"Correlation of between true and predicted breeding value: \", resCorr)\n",
    "        end\n",
    "    end\n",
    "end    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Run BayesC0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correlation of between true and predicted breeding value: 0.6344163465804081\n",
      "Correlation of between true and predicted breeding value: 0.6346902997156448\n",
      "elapsed time: 0.216529693 seconds (53211800 bytes allocated, 11.48% gc time)\n"
     ]
    }
   ],
   "source": [
    "meanMu    = 0\n",
    "meanAlpha = zeros(nMarkers)   \n",
    "\n",
    "#initial valus\n",
    "vare = 1\n",
    "varEffects = 1\n",
    "mu = mean(y)\n",
    "yCorr = y - mu\n",
    "alpha = fill(0.0,nMarkers)\n",
    "\n",
    "#run it\n",
    "@time BayesC0!(chainLength,nMarkers,X,xpx,yCorr,mu,meanMu,alpha,meanAlpha,vare,varEffects)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Compare Runtime with R Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   user  system elapsed \n",
      " 48.341   1.328  49.707 \n"
     ]
    }
   ],
   "source": [
    ";Rscript RBayesC0/BayesC0.R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "222.76497695852535"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "48.34/.217"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#   This code is for illustrative purposes and not efficient for large problems\n",
      "#   Real life data analysis (using the same file formats) is available at \n",
      "#   bigs.ansci.iastate.edu/login.html based on GenSel cpp software implementation\n",
      "# \n",
      "#               Rohan Fernando      (rohan@iastate.edu)\n",
      "#               Dorian Garrick      (dorian@iastate.edu) \n",
      "#               copyright August 2012\n",
      "\n",
      "# Parameters\n",
      "setwd(\"RBayesC0\")\n",
      "seed            =   10    # set the seed for the random number generator\n",
      "chainLength     =  2000    # number of iterations\n",
      "dfEffectVar     =    4    # hyper parameter (degrees of freedom) for locus effect variance          \n",
      "nuRes           =    4    # hyper parameter (degrees of freedom) for residual variance\n",
      "varGenotypic    =    1    # used to derive hyper parameter (scale) for locus effect variance \n",
      "varResidual     =    1    # used to derive hyper parameter (scale) for residual variance \n",
      "windowSize      =   10    # number of consecutive markers in a genomic window\n",
      "outputFrequency =   100    # frequency for reporting performance and for computing genetic variances\n",
      "\n",
      "markerFileName         = \"genotypes.dat\"\n",
      "trainPhenotypeFileName = \"trainPhenotypes.dat\"\n",
      "testPhenotypeFileName  = \"testPhenotypes.dat\"\n",
      "\n",
      "set.seed(seed)\n",
      "\n",
      "genotypeFile         = read.table(markerFileName, header=TRUE)                 # this is not efficient for large files!\n",
      "trainPhenotypeFile   = read.table(trainPhenotypeFileName, skip=1)[,1:2]        # skip the header as R dislikes # character\n",
      "testPhenotypeFile    = read.table(testPhenotypeFileName,  skip=1)[,1:2]        # skip the header as R dislikes # character\n",
      "commonTrainingData   = merge(trainPhenotypeFile, genotypeFile, by.x=1, by.y=1) # Only use animals with genotype and phenotype\n",
      "commonTestData       = merge(testPhenotypeFile,  genotypeFile, by.x=1, by.y=1) # Only use animals with genotype and phenotype\n",
      "\n",
      "\n",
      "remove(genotypeFile)                                                # Free up space \n",
      "remove(trainPhenotypeFile)                                          # Free up space \n",
      "remove(testPhenotypeFile)                                           # Free up space \n",
      "animalID = unname(as.matrix(commonTrainingData[,1]))                # First field is animal identifier\n",
      "y        = commonTrainingData[, 2]                                  # Second field is trait values\n",
      "Z        = commonTrainingData[, 3: ncol(commonTrainingData)]        # Remaining fields are GenSel-coded genotypes \n",
      "Z        = unname(as.matrix((Z + 10)/10));                          # Recode genotypes to 0, 1, 2 (number of B alleles)\n",
      "markerID = colnames(commonTrainingData)[3:ncol(commonTrainingData)] # Remember the marker locus identifiers\n",
      "remove(commonTrainingData)\n",
      "\n",
      "testID = unname(as.matrix(commonTestData[,1]))                  # First field is animal identifier\n",
      "yTest        = commonTestData[, 2]                              # Second field is trait values\n",
      "ZTest        = commonTestData[, 3: ncol(commonTestData)]        # Remaining fields are GenSel-coded genotypes \n",
      "ZTest        = unname(as.matrix((ZTest + 10)/10));              # Recode genotypes to 0, 1, 2 (number of B alleles)\n",
      "remove(commonTestData)\n",
      "\n",
      "nmarkers = ncol(Z)                                              # number of markers\n",
      "nrecords = nrow(Z)                                              # number of animals\n",
      "\n",
      "# center the genotype matrix to accelerate mixing \n",
      "markerMeans = colMeans(Z)                             # compute the mean for each marker\n",
      "Z = t(t(Z) - markerMeans)                             # deviate covariate from its mean\n",
      "p = markerMeans/2.0                                   # compute frequency B allele for each marker\n",
      "mean2pq = mean(2*p*(1-p))                             # compute mean genotype variance\n",
      "\n",
      "varEffects  = varGenotypic/(nmarkers*mean2pq)         # variance of locus effects is computed from genetic variance \n",
      "                                                      #(e.g. Fernando et al., Acta Agriculturae Scand Section A, 2007; 57: 192-195)\n",
      "scaleVar    = varEffects*(dfEffectVar-2)/dfEffectVar; # scale factor for locus effects\n",
      "scaleRes    = varResidual*(nuRes-2)/nuRes             # scale factor for residual variance\n",
      "\n",
      "numberWindows = nmarkers/windowSize                   # number of genomic windows\n",
      "numberSamples = chainLength/outputFrequency           # number of samples of genetic variances\n",
      "\n",
      "\n",
      "alpha           = array(0.0, nmarkers) # reserve a vector to store sampled locus effects \n",
      "meanAlpha       = array(0.0, nmarkers) # reserve a vector to accumulate the posterior mean of locus effects\n",
      "modelFreq       = array(0.0, nmarkers) # reserve a vector to store model frequency\n",
      "mu              = mean(y)              # starting value for the location parameter \n",
      "meanMu          = 0                    # reserve a scalar to accumulate the posterior mean\n",
      "geneticVar      = array(0,numberSamples) # reserve a vector to store sampled genetic variances\n",
      "                                       # reserve a matrix to store sampled proportion proportion of variance due to window \n",
      "windowVarProp   = matrix(0,nrow=numberSamples,ncol=numberWindows)\n",
      "sampleCount     = 0                    # initialize counter for number of samples of genetic variances\n",
      "                  \n",
      "\n",
      "\n",
      "# adjust y for the fixed effect (ie location parameter)\n",
      "ycorr = y - mu\n",
      "\n",
      "\n",
      "ZPZ=t(Z)%*%Z\n",
      "zpz=diag(ZPZ)\n",
      "\n",
      "ptime=proc.time()\n",
      "# mcmc sampling\n",
      "for (iter in 1:chainLength){\n",
      "\t\n",
      "# sample residual variance\n",
      "\tvare = ( t(ycorr)%*%ycorr + nuRes*scaleRes )/rchisq(1,nrecords + nuRes)  \n",
      "\t\n",
      "# sample intercept\n",
      "\tycorr = ycorr + mu                    # Unadjust y for the previous sample of mu\n",
      "\trhs    = sum(ycorr)                   # Form X'y\n",
      "\tinvLhs = 1.0/nrecords                 # Form (X'X)-1\n",
      "\tmean = rhs*invLhs                     # Solve (X'X) mu = X'y                    \n",
      "\tmu = rnorm(1,mean,sqrt(invLhs*vare))  # Sample new location parameter                     \n",
      "\tycorr = ycorr - mu                    # Adjust y for the new sample of mu\n",
      "\tmeanMu = meanMu + mu                  # Accumulate the sum to compute posterior mean\n",
      "\t\n",
      "# sample effect for each locus\n",
      "\tfor (locus in 1:nmarkers){\n",
      "\t\t\n",
      "\t\trhs=t(Z[,locus])%*%ycorr +zpz[locus]*alpha[locus]\n",
      "\t\tmmeLhs = zpz[locus] + vare/varEffects                        # Form the coefficient matrix of MME\n",
      "\t\tinvLhs = 1.0/mmeLhs                                   # Invert the coefficient matrix \n",
      "\t\tmean = invLhs*rhs                                     # Solve the MME for locus effect\n",
      "\t\toldAlpha=alpha[locus]\n",
      "\t\talpha[locus]= rnorm(1,mean,sqrt(invLhs*vare))         # Sample the locus effect from data\n",
      "\t\tycorr = ycorr + Z[,locus]*(oldAlpha-alpha[locus]);               # Adjust the data for this locus effect\n",
      "\t\tmeanAlpha[locus] = meanAlpha[locus] + alpha[locus];   # Accumulate the sum for posterior mean\n",
      "\t}\n",
      "\t\t\n",
      "\t# sample the common locus effect variance\t\t\n",
      "\tvarEffects = ( scaleVar*dfEffectVar + sum(alpha^2) )/rchisq(1,dfEffectVar+nmarkers)  \n",
      "\t\t\n",
      "}\n",
      "\n",
      "proc.time()-ptime\n"
     ]
    }
   ],
   "source": [
    ";cat RBayesC0/BayesC0.R"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.3.7",
   "language": "julia",
   "name": "julia 0.3"
  },
  "language_info": {
   "name": "julia",
   "version": "0.3.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}