{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LMM predictions and prediction intervals\n",
    "\n",
    "Below we will fit a linear mixed model using the Ruby gem [mixed\\_models](https://github.com/agisga/mixed_models) and demonstrate the available prediction methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data and linear mixed model\n",
    "\n",
    "We use the same data and model formulation as in several previous examples, where we have looked at various parameter estimates ([1](http://nbviewer.ipython.org/github/agisga/mixed_models/blob/master/notebooks/LMM_model_fitting.ipynb)) and demostrated many types hypotheses tests as well as confidence intervals ([2](http://nbviewer.ipython.org/github/agisga/mixed_models/blob/master/notebooks/LMM_tests_and_intervals.ipynb)).\n",
    "\n",
    "The data set, which is simulated, contains two numeric variables *Age* and *Aggression*, and two categorical variables *Location* and *Species*. These data are available for 100 (human and alien) individuals.\n",
    "\n",
    "We model the *Aggression* level of an individual of *Species* $spcs$ who is at the *Location* $lctn$ as:\n",
    "\n",
    "$$Aggression = \\beta_{0} + \\gamma_{spcs} + Age \\cdot \\beta_{1} + b_{lctn,0} + Age \\cdot b_{lctn,1} + \\epsilon,$$\n",
    "\n",
    "where $\\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each *Location*).\n",
    "\n",
    "We fit this model in `mixed_models` using a syntax familiar from the `R` package `lme4`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table><tr><th colspan=\"5\">Daru::DataFrame:47316264430760  rows: 5  cols: 4</th></tr><tr><th></th><th>coef</th><th>sd</th><th>z_score</th><th>WaldZ_p_value</th></tr><tr><td>intercept</td><td>1016.2867207023459</td><td>60.19727495769054</td><td>16.882603430415077</td><td>0.0</td></tr><tr><td>Age</td><td>-0.06531615342788907</td><td>0.0898848636725299</td><td>-0.7266646547504374</td><td>0.46743141066211646</td></tr><tr><td>Species_lvl_Human</td><td>-499.69369529020855</td><td>0.2682523406941929</td><td>-1862.774781375937</td><td>0.0</td></tr><tr><td>Species_lvl_Ood</td><td>-899.5693213535765</td><td>0.28144708140043684</td><td>-3196.2289922406003</td><td>0.0</td></tr><tr><td>Species_lvl_WeepingAngel</td><td>-199.58895804200702</td><td>0.27578357795259995</td><td>-723.7158917283725</td><td>0.0</td></tr></table>"
      ],
      "text/plain": [
       "\n",
       "#<Daru::DataFrame:47316264430760 @name = 2b874891-0fa8-4cca-9dc1-44a797bac352 @size = 5>\n",
       "                 coef         sd    z_score WaldZ_p_va \n",
       " intercept 1016.28672 60.1972749 16.8826034        0.0 \n",
       "       Age -0.0653161 0.08988486 -0.7266646 0.46743141 \n",
       "Species_lv -499.69369 0.26825234 -1862.7747        0.0 \n",
       "Species_lv -899.56932 0.28144708 -3196.2289        0.0 \n",
       "Species_lv -199.58895 0.27578357 -723.71589        0.0 \n"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "require 'mixed_models'\n",
    "\n",
    "alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'\n",
    "# mixed_models expects that all variable names in the data frame are ruby Symbols:\n",
    "alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })\n",
    "\n",
    "model_fit = LMM.from_formula(formula: \"Aggression ~ Age + Species + (Age | Location)\", \n",
    "                             data: alien_species)\n",
    "model_fit.fix_ef_summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predictions and prediction intervals\n",
    "\n",
    "Often, the objective of a statistical model is the prediction of future observations based on new data input.\n",
    "\n",
    "We consider the following new data set containing age, geographic location and species for ten individuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table><tr><th colspan=\"4\">Daru::DataFrame:47316263806300  rows: 10  cols: 3</th></tr><tr><th></th><th>Age</th><th>Species</th><th>Location</th></tr><tr><td>0</td><td>209</td><td>Dalek</td><td>OodSphere</td></tr><tr><td>1</td><td>90</td><td>Ood</td><td>Earth</td></tr><tr><td>2</td><td>173</td><td>Ood</td><td>Asylum</td></tr><tr><td>3</td><td>153</td><td>Human</td><td>Asylum</td></tr><tr><td>4</td><td>255</td><td>WeepingAngel</td><td>OodSphere</td></tr><tr><td>5</td><td>256</td><td>WeepingAngel</td><td>Asylum</td></tr><tr><td>6</td><td>37</td><td>Dalek</td><td>Earth</td></tr><tr><td>7</td><td>146</td><td>WeepingAngel</td><td>Earth</td></tr><tr><td>8</td><td>127</td><td>WeepingAngel</td><td>Asylum</td></tr><tr><td>9</td><td>41</td><td>Ood</td><td>Asylum</td></tr></table>"
      ],
      "text/plain": [
       "\n",
       "#<Daru::DataFrame:47316263806300 @name = c92169c8-786d-4326-b9bd-0d0c197a2f88 @size = 10>\n",
       "                  Age    Species   Location \n",
       "         0        209      Dalek  OodSphere \n",
       "         1         90        Ood      Earth \n",
       "         2        173        Ood     Asylum \n",
       "         3        153      Human     Asylum \n",
       "         4        255 WeepingAng  OodSphere \n",
       "         5        256 WeepingAng     Asylum \n",
       "         6         37      Dalek      Earth \n",
       "         7        146 WeepingAng      Earth \n",
       "         8        127 WeepingAng     Asylum \n",
       "         9         41        Ood     Asylum \n"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'\n",
    "newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })\n",
    "newdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Point estimates\n",
    "\n",
    "Based on the fitted linear mixed model we can predict the aggression levels for the inidividuals, where we can specify whether the random effects estimates should be included in the calculations or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predictions of aggression levels on a new data set:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[1070.9125752531208, 182.45206492790737, -17.06446875476354, 384.7881586199103, 876.1240725686446, 674.7113391148862, 1092.6985606350866, 871.1508855262363, 687.4629975728096, -4.016260100144294]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "puts \"Predictions of aggression levels on a new data set:\"\n",
    "pred =  model_fit.predict(newdata: newdata, with_ran_ef: true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table><tr><th colspan=\"5\">Daru::DataFrame:47316262633840  rows: 10  cols: 4</th></tr><tr><th></th><th>Age</th><th>Species</th><th>Location</th><th>Predicted_Agression</th></tr><tr><td>0</td><td>209</td><td>Dalek</td><td>OodSphere</td><td>1070.9125752531208</td></tr><tr><td>1</td><td>90</td><td>Ood</td><td>Earth</td><td>182.45206492790737</td></tr><tr><td>2</td><td>173</td><td>Ood</td><td>Asylum</td><td>-17.06446875476354</td></tr><tr><td>3</td><td>153</td><td>Human</td><td>Asylum</td><td>384.7881586199103</td></tr><tr><td>4</td><td>255</td><td>WeepingAngel</td><td>OodSphere</td><td>876.1240725686446</td></tr><tr><td>5</td><td>256</td><td>WeepingAngel</td><td>Asylum</td><td>674.7113391148862</td></tr><tr><td>6</td><td>37</td><td>Dalek</td><td>Earth</td><td>1092.6985606350866</td></tr><tr><td>7</td><td>146</td><td>WeepingAngel</td><td>Earth</td><td>871.1508855262363</td></tr><tr><td>8</td><td>127</td><td>WeepingAngel</td><td>Asylum</td><td>687.4629975728096</td></tr><tr><td>9</td><td>41</td><td>Ood</td><td>Asylum</td><td>-4.016260100144294</td></tr></table>"
      ],
      "text/plain": [
       "\n",
       "#<Daru::DataFrame:47316262633840 @name = e668eb36-b2da-4270-b771-4c8ff3a482f0 @size = 10>\n",
       "                  Age    Species   Location Predicted_ \n",
       "         0        209      Dalek  OodSphere 1070.91257 \n",
       "         1         90        Ood      Earth 182.452064 \n",
       "         2        173        Ood     Asylum -17.064468 \n",
       "         3        153      Human     Asylum 384.788158 \n",
       "         4        255 WeepingAng  OodSphere 876.124072 \n",
       "         5        256 WeepingAng     Asylum 674.711339 \n",
       "         6         37      Dalek      Earth 1092.69856 \n",
       "         7        146 WeepingAng      Earth 871.150885 \n",
       "         8        127 WeepingAng     Asylum 687.462997 \n",
       "         9         41        Ood     Asylum -4.0162601 \n"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'\n",
    "newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })\n",
    "newdata[:Predicted_Agression] = pred\n",
    "newdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Interval estimates\n",
    "\n",
    "Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.\n",
    "\n",
    "Two types of such interval estimates are currently available in `LMM`. On the one hand, a *confidence interval* is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a *prediction interval* is an interval estimate of a future observation (for further explanation of this distinction see for example <https://stat.ethz.ch/education/semesters/ss2010/seminar/06_Handout.pdf>)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "88% confidence intervals for the predictions:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table><tr><th colspan=\"4\">Daru::DataFrame:47316259596660  rows: 10  cols: 3</th></tr><tr><th></th><th>pred</th><th>lower88</th><th>upper88</th></tr><tr><td>0</td><td>1002.6356446359171</td><td>906.275473617091</td><td>1098.995815654743</td></tr><tr><td>1</td><td>110.83894554025937</td><td>17.15393113018095</td><td>204.5239599503378</td></tr><tr><td>2</td><td>105.41770480574462</td><td>10.164687937713381</td><td>200.67072167377586</td></tr><tr><td>3</td><td>506.59965393767027</td><td>411.8519191795299</td><td>601.3473886958107</td></tr><tr><td>4</td><td>800.0421435362272</td><td>701.9091174988788</td><td>898.1751695735755</td></tr><tr><td>5</td><td>799.9768273827992</td><td>701.8009453018722</td><td>898.1527094637263</td></tr><tr><td>6</td><td>1013.870023025514</td><td>920.443931319159</td><td>1107.296114731869</td></tr><tr><td>7</td><td>807.1616042598671</td><td>712.571759209002</td><td>901.7514493107321</td></tr><tr><td>8</td><td>808.402611174997</td><td>714.191640124036</td><td>902.613582225958</td></tr><tr><td>9</td><td>114.03943705822599</td><td>20.614034870631627</td><td>207.46483924582034</td></tr></table>"
      ],
      "text/plain": [
       "\n",
       "#<Daru::DataFrame:47316259596660 @name = 8d245d6c-289f-4825-b646-12c920e2015d @size = 10>\n",
       "                 pred    lower88    upper88 \n",
       "         0 1002.63564 906.275473 1098.99581 \n",
       "         1 110.838945 17.1539311 204.523959 \n",
       "         2 105.417704 10.1646879 200.670721 \n",
       "         3 506.599653 411.851919 601.347388 \n",
       "         4 800.042143 701.909117 898.175169 \n",
       "         5 799.976827 701.800945 898.152709 \n",
       "         6 1013.87002 920.443931 1107.29611 \n",
       "         7 807.161604 712.571759 901.751449 \n",
       "         8 808.402611 714.191640 902.613582 \n",
       "         9 114.039437 20.6140348 207.464839 \n"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "puts \"88% confidence intervals for the predictions:\"\n",
    "ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)\n",
    "Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "88% prediction intervals for the predictions:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table><tr><th colspan=\"4\">Daru::DataFrame:47316258683700  rows: 10  cols: 3</th></tr><tr><th></th><th>pred</th><th>lower88</th><th>upper88</th></tr><tr><td>0</td><td>1002.6356446359171</td><td>809.9100501459104</td><td>1195.3612391259237</td></tr><tr><td>1</td><td>110.83894554025937</td><td>-76.53615884686141</td><td>298.2140499273802</td></tr><tr><td>2</td><td>105.41770480574462</td><td>-85.09352864481423</td><td>295.92893825630347</td></tr><tr><td>3</td><td>506.59965393767027</td><td>317.0988995529618</td><td>696.1004083223787</td></tr><tr><td>4</td><td>800.0421435362272</td><td>603.7713980881146</td><td>996.3128889843398</td></tr><tr><td>5</td><td>799.9768273827992</td><td>603.6203777073699</td><td>996.3332770582285</td></tr><tr><td>6</td><td>1013.870023025514</td><td>827.0127232317805</td><td>1200.7273228192475</td></tr><tr><td>7</td><td>807.1616042598671</td><td>617.9767304115936</td><td>996.3464781081406</td></tr><tr><td>8</td><td>808.402611174997</td><td>619.9754792487822</td><td>996.8297431012118</td></tr><tr><td>9</td><td>114.03943705822599</td><td>-72.8161447158925</td><td>300.8950188323445</td></tr></table>"
      ],
      "text/plain": [
       "\n",
       "#<Daru::DataFrame:47316258683700 @name = ef48b9a6-5554-4203-b935-bbb4eee42fe1 @size = 10>\n",
       "                 pred    lower88    upper88 \n",
       "         0 1002.63564 809.910050 1195.36123 \n",
       "         1 110.838945 -76.536158 298.214049 \n",
       "         2 105.417704 -85.093528 295.928938 \n",
       "         3 506.599653 317.098899 696.100408 \n",
       "         4 800.042143 603.771398 996.312888 \n",
       "         5 799.976827 603.620377 996.333277 \n",
       "         6 1013.87002 827.012723 1200.72732 \n",
       "         7 807.161604 617.976730 996.346478 \n",
       "         8 808.402611 619.975479 996.829743 \n",
       "         9 114.039437 -72.816144 300.895018 \n"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "puts \"88% prediction intervals for the predictions:\"\n",
    "pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)\n",
    "Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remark**: You might notice that `#predict` with `with_ran_ef: true` produces some values outside of the confidence intervals, because the confidence intervals are computed from `#predict` with `with_ran_ef: false`.\n",
    "However, `#predict` with `with_ran_ef: false` should always give values which lie in the center of the confidence or prediction intervals."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Ruby 2.3.1",
   "language": "ruby",
   "name": "ruby"
  },
  "language_info": {
   "file_extension": ".rb",
   "mimetype": "application/x-ruby",
   "name": "ruby",
   "version": "2.3.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}