{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LMM usage example\n", "\n", "Below we will fit a linear mixed model using the Ruby gem [mixed\\_models](https://github.com/agisga/mixed_models), and demostrate many inference and prediction methods available for objects of class `LMM`.\n", "\n", "\n", "## Table of Contents\\*\n", "\n", "* Example Data\n", "\n", "* Linear Mixed Model\n", "\n", "* Some model attributes\n", "\n", "* Fitted values and residuals\n", "\n", "* Fixed effects hypotheses tests and confidence intervals\n", "\n", "* Predictions and prediction intervals\n", "\n", "\\* linking the titles to the respective sections messes up the display in nbviewer for some reason; would appreciate any hint on how to make it work..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example data\n", "\n", "The data set, which is simulated, contains two numeric variables *Age* and *Aggression*, and two categorical variables *Location* and *Species*. These data are available for 100 (human and alien) individuals.\n", "\n", "We will fit the model with the method `LMM#from_formula`, which mimics the behaviour of the function `lmer` from the `R` package `lme4`.\n", "\n", "The data is supplied to `LMM#from_formula` as a `Daru::DataFrame` (from the excellent Ruby gem [daru](https://github.com/v0dro/daru.git)). We load the data, and display the first 10 lines with:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:46955646780340 rows: 10 cols: 4
AggressionAgeSpeciesLocation
0877.54242028595204.95DalekAsylum
1852.52839218820639.88WeepingAngelOodSphere
2388.791416909388107.34HumanAsylum
3170.010124622982210.01OodOodSphere
41078.31219494376270.22DalekOodSphere
5164.924992952256157.65OodOodSphere
6865.838374677443136.15WeepingAngelOodSphere
71052.36035549029241.31DalekEarth
8-8.5725199338256786.84OodAsylum
91070.71900405899206.7DalekOodSphere
" ], "text/plain": [ "\n", "#\n", " Aggression Age Species Location \n", " 0 877.542420 204.95 Dalek Asylum \n", " 1 852.528392 39.88 WeepingAng OodSphere \n", " 2 388.791416 107.34 Human Asylum \n", " 3 170.010124 210.01 Ood OodSphere \n", " 4 1078.31219 270.22 Dalek OodSphere \n", " 5 164.924992 157.65 Ood OodSphere \n", " 6 865.838374 136.15 WeepingAng OodSphere \n", " 7 1052.36035 241.31 Dalek Earth \n", " 8 -8.5725199 86.84 Ood Asylum \n", " 9 1070.71900 206.7 Dalek OodSphere \n" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'\n", "alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'\n", "# mixed_models expects that all variable names in the data frame are ruby Symbols:\n", "alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })\n", "alien_species.head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear mixed model\n", "\n", "We model the *Aggression* level of an individual as a linear function of the *Age* (*Aggression* decreases with *Age*), with a different constant added for each *Species* (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in *Aggression* due to the *Location* that an individual is at. Additionally, there is a random fluctuation in how *Age* affects *Aggression* at each different *Location*. \n", "\n", "Thus, the *Aggression* level of an individual of *Species* $spcs$ who is at the *Location* $lctn$ can be expressed as:\n", "$$Aggression = \\beta_{0} + \\gamma_{spcs} + Age \\cdot \\beta_{1} + b_{lctn,0} + Age \\cdot b_{lctn,1} + \\epsilon,$$\n", "where $\\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each *Location*). That is, we have a linear mixed model with fixed effects $\\beta_{0}, \\beta_{1}, \\gamma_{Dalek}, \\gamma_{Ood}, \\dots$, and random effects $b_{Asylum,0}, b_{Asylum,1}, b_{Earth,0},\\dots$.\n", "\n", "We fit this model in `mixed_models` using a syntax familiar from the `R` package `lme4`, and display the estimated fixed and random effects coefficients:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fixed effects:\n", "{:intercept=>1016.2867207023459, :Age=>-0.06531615342788907, :Species_lvl_Human=>-499.69369529020855, :Species_lvl_Ood=>-899.5693213535765, :Species_lvl_WeepingAngel=>-199.58895804200702}\n", "Random effects:\n", "{:intercept_Asylum=>-116.68080676073654, :Age_Asylum=>-0.0335339121374082, :intercept_Earth=>83.86571636827462, :Age_Earth=>-0.136139966451407, :intercept_OodSphere=>32.81508999155884, :Age_OodSphere=>0.16967387859160207}\n" ] } ], "source": [ "require 'mixed_models'\n", "model_fit = LMM.from_formula(formula: \"Aggression ~ Age + Species + (Age | Location)\", \n", " data: alien_species)\n", "puts \"Fixed effects:\"\n", "puts model_fit.fix_ef\n", "puts \"Random effects:\"\n", "puts model_fit.ran_ef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some model attributes\n", "\n", "Apart from the fixed and random effects coefficients (seen above), we can access many attributes of the fitted model. Among others:\n", "\n", "* `fix_ef_names` and `ran_ef_names` are Arrays of names of the fixed and random effects.\n", "\n", "* `reml` is an indicator whether the profiled REML criterion or the profiled deviance function was optimized by the model fitting algorithm.\n", "\n", "* `formula` returns the R-like formula used to fit the model as a String.\n", "\n", "* `model_data`, `optimization_result` and `dev_fun` store the various model matrices in an `LMMData` object, the results of the utilized optimization algorithm, and the corresponding objective function as a `Proc`.\n", "\n", "* `sigma2` is the residual variance (unless `weights` was specified in the model fit).\n", "\n", "* `sigma_mat` is the covariance matrix of the multivariate normal random effects vector.\n", "\n", "We can look at some of these parameters for our example model:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "REML criterion used: \ttrue\n", "Residual variance: \t0.9496833447256825\n", "Formula: \tAggression ~ Age + Species + (Age | Location)\n", "Variance of the intercept due to 'location' (i.e. variance of b0): \t10870.932406181171\n", "Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \t0.024233390356817094\n", "Covariance of b0 and b1: \t-0.9716403033290799\n" ] } ], "source": [ "puts \"REML criterion used: \\t#{model_fit.reml}\"\n", "puts \"Residual variance: \\t#{model_fit.sigma2}\"\n", "puts \"Formula: \\t\" + model_fit.formula\n", "puts \"Variance of the intercept due to 'location' (i.e. variance of b0): \\t#{model_fit.sigma_mat[0,0]}\"\n", "puts \"Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \\t#{model_fit.sigma_mat[1,1]}\"\n", "puts \"Covariance of b0 and b1: \\t#{model_fit.sigma_mat[0,1]}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some further convenience methods are:\n", "\n", "* `sigma` returns the square root of `sigma2`.\n", "\n", "* `theta` returns the optimal solution of the minimization of the deviance function or the REML criterion (whichever was used to fit the model).\n", "\n", "* `deviance` returns the value of the deviance function or the REML criterion at the optimal solution.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Residual standard deviation: \t0.9745169802141379\n", "REML criterion: \t333.7155391015166\n" ] } ], "source": [ "puts \"Residual standard deviation: \\t#{model_fit.sigma}\"\n", "puts \"REML criterion: \\t#{model_fit.deviance}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitted values and residual\n", "\n", "There are methods to get the fitted values of the response variable, with or without inclusion of random effects, as well as the model residuals." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitted values at the population level:\n" ] }, { "data": { "text/plain": [ "[1002.9001750573, 814.0929544616347, 509.58198950318774, 103.00035396737837, 998.6369897230617, 106.42030776086267, 807.8049683711317, 1000.525279718662, 111.04534458509147, 1002.7858717888012, 1008.0797460241316, 498.79959889531176, 100.90435860387743, 111.03358767747443, 102.37593154060778, 812.6239941710414, 97.21856806594167, 800.9297900613121, 510.0032786927976, 502.6950542857511, 503.7329279637203, 999.8864877381372, 497.2875299434561, 1003.3325679929926, 804.1328942254158, 811.9113949371432, 99.91285939484214, 803.9356394420636, 998.4671677241491, 99.50267395131493, 500.71140270614603, 1014.38536747606, 112.41241167633723, 1010.303107886817, 114.52212343205804, 813.8107886788262, 811.7526766843134, 498.6990120190328, 1005.0836940663943, 110.67826780282678, 501.00205958890024, 810.6305451684223, 511.318092861301, 1000.7623773556052, 508.8465296155897, 1008.3730155530228, 1011.249538949987, 102.4595362169955, 1010.9490846442187, 1007.0346875692853, 506.2286581861998, 502.15031756616247, 813.0589997528712, 499.75256157382466, 812.9329395767553, 498.25682166032595, 514.1586923738798, 499.76366531990743, 510.7733561417124, 798.2159038863833, 511.4147607683742, 506.48077853843154, 1012.1763751671288, 1013.9856326170814, 101.35504006252984, 1013.1593832762185, 1006.1091576752121, 106.82265526597848, 1002.7754212042528, 806.3314359497986, 1002.5860043593118, 514.9699189994543, 1007.2214917680891, 806.255016050288, 502.0582217898292, 1011.9458091455283, 804.6221122145907, 997.8786691817638, 802.2648522373782, 1011.1705064043393, 806.8657220848387, 1011.1763848581478, 798.7684785443834, 1009.0999843406752, 806.0166120902761, 102.78481066106633, 99.6640048502818, 506.1835900403346, 998.8949885291019, 499.4357782296994, 814.6050331045093, 1012.1182437905779, 109.60251075586939, 501.6904918460301, 98.0232630761733, 106.71945574356243, 99.98274767900989, 114.889853375857, 798.1950027172865, 106.6208283518863]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"Fitted values at the population level:\"\n", "model_fit.fitted(with_ran_ef: false)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model residuals:\n" ] }, { "data": { "text/plain": [ "[-1.8041727180516318, -1.1462465432206272, -0.5102357042338213, -1.438530578977577, 1.0108397561168658, -1.059491760131607, 2.1172177445057514, 0.8212947077424815, -0.024972828168071004, 0.04645157374602604, -0.2521512808752959, 0.5337610637689068, -0.5306298541011536, -0.8746282732136024, -0.5173146938876982, -1.076703511382675, -0.37352823345474917, 0.4342880104853748, -0.008686417241847266, 0.9633162152576915, -1.8591988675074163, -0.06035456182189591, -0.8747549066542888, 2.156663624530097, -0.12014582672816232, 1.5012618544112684, 0.5916114379897799, -1.4808531218288863, -0.16478112799086375, 2.0314537895963163, 0.010773844423852097, 0.8539551421497436, -1.2568149547061997, 0.07403493882179646, -0.11003692956987265, 1.027133585975207, 1.2893893800322758, -0.14846806796293777, -0.4342951733146947, 1.7157974749037237, 1.017366682325644, -0.6984875406660649, 0.16845508494236583, -1.423216161059372, -0.14893911781359748, 0.1337897922242064, -0.21024649931018757, 0.40638495225081783, -0.40370493998443635, -1.5086471211782282, 0.22131784542784771, -0.3343567440182369, -1.3354988341066019, 1.1064057146309096, 0.3472979813135453, 0.21318943374672017, -0.44914337321006315, -0.3144066453174901, -0.8024013676736104, -1.138768647945767, -0.270266988866922, 0.32795235324385885, -0.07521784973812373, 1.8354848431990831, 0.5871444360181073, -0.30368225978327246, -0.8505153742187304, -0.5012451945714531, -0.8060105059505531, 0.5330080478594255, 1.4260343762148295, -1.050545171033491, -0.6140910982546757, -0.2504994695145797, 0.6581363886527924, 0.4752780907743954, 0.7314233385786792, -0.7084492435404854, -1.2272348687666863, -0.7943531062586544, 0.24601716633890192, 0.45095282087356736, 0.2410007816512234, 0.5882243348621614, -0.045955678732184424, 2.3011995430929595, -0.805550441639582, 1.2857127416019694, 0.9605154956996103, -0.05181584374145132, -0.04108522357819311, -0.4196304756428617, 1.2953557251604657, 0.3168318472534679, -1.6086909528895887, 0.39499568456598766, -1.2512383164572825, -0.1762592429552683, 0.09344137531570595, 1.2049892111455733]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"Model residuals:\"\n", "model_fit.residuals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values, and checking for unexpected patterns. We use the gem [gnuplotrb](https://github.com/dilcom/gnuplotrb) for plotting." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-200\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 200\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 400\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 600\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 800\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1000\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1200\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\t\n", "\t\tResiduals\n", "\t\n", "\n", "\n", "\t\n", "\t\tFitted\n", "\t\n", "\n", "\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"Fitted\", :ylabel => \"Residuals\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>], @cmd=\"plot \">" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'gnuplotrb'\n", "include GnuplotRB\n", "\n", "x, y = model_fit.fitted, model_fit.residuals\n", "fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],\n", " xlabel: 'Fitted', ylabel: 'Residuals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the residuals look more or less like noise, which is good.\n", "\n", "We can further analyze the validity of the linear mixed model somewhat, by checking if the residuals appear to be approximately normally distributed." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t 0.02\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.04\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.06\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.08\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.12\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.14\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.16\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.18\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"fill solid 0.5\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:notitle => true, :with => \"boxes\"]>], @cmd=\"plot \">" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bin_width = (y.max - y.min)/10.0\n", "bins = (y.min..y.max).step(bin_width).to_a\n", "rel_freq = Array.new(bins.length-1){0.0}\n", "y.each do |r|\n", " 0.upto(bins.length-2) do |i|\n", " if r >= bins[i] && r < bins[i+1] then\n", " rel_freq[i] += 1.0/y.length\n", " end\n", " end\n", "end\n", "bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }\n", " \n", "residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],\n", " style: 'fill solid 0.5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The histogram does not appear to be too different from a bell shaped curve, although it might be slightly skewed to the right.\n", "\n", "We can further explore the validity of the normality assumption by looking at the Q-Q plot of the residuals." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t-2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\t\n", "\t\tObserved quantiles\n", "\t\n", "\n", "\n", "\t\n", "\t\tNormal theoretical quantiles\n", "\t\n", "\n", "\n", "\t\n", "\t\tQ-Q plot of the residuals\n", "\t\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "\t\n", "\tgnuplot_plot_2\n", "\n", "\t\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"Normal theoretical quantiles\", :ylabel => \"Observed quantiles\", :title => \"Q-Q plot of the residuals\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>, # true, :with => \"lines\"]>], @cmd=\"plot \">" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'distribution'\n", "\n", "observed = model_fit.residuals.sort\n", "n = observed.length\n", "theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}\n", "qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],\n", " ['x', with: 'lines', notitle: true],\n", " xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',\n", " title: 'Q-Q plot of the residuals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The straight line in the above plot is simply the diagonal. We see that the observed quantiles aggree with the theoretical values fairly well, as expected from a \"good\" model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fixed effects hypotheses tests and confidence intervals\n", "\n", "Often statistical models are used in order to determine which of the predictor variables have a significant relationship with the response variable. `LMM` has a number of methods to aid with this kind of statistical inference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Variances and covariances of the fixed effects coefficient estimates\n", "\n", "The covariance matrix of the fixed effects estimates is returned by `LMM#fix_ef_cov_mat`, and the standard deviations of the fixed effects coefficients are returned by `LMM#fix_ef_sd`. Methods for hypotheses tests and confidence intervals can be based on these values." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>60.19727495769054, :Age=>0.0898848636725299, :Species_lvl_Human=>0.2682523406941929, :Species_lvl_Ood=>0.28144708140043684, :Species_lvl_WeepingAngel=>0.27578357795259995}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_sd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wald tests and confidence intervals\n", "\n", "The [Wald Z test statistics](https://en.wikipedia.org/wiki/Wald_test#Test_on_a_single_parameter) for the fixed effects coefficients can be computed with:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>16.882603430415077, :Age=>-0.7266646547504374, :Species_lvl_Human=>-1862.774781375937, :Species_lvl_Ood=>-3196.2289922406003, :Species_lvl_WeepingAngel=>-723.7158917283725}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the above Wald Z test statistics, we can carry out hypotheses tests for each fixed effects terms $\\beta_{i}$ or $\\gamma_{species}$, testing the null $H_{0} : \\beta_{i} = 0$ against the alternative $H_{a} : \\beta_{i} \\neq 0$, or respectively the null $H_{0} : \\gamma_{species} = 0$ against the alternative $H_{a} : \\gamma_{species} \\neq 0$.\n", "\n", "The corresponding (approximate) p-values are obtained with:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>0.0, :Age=>0.46743141066211646, :Species_lvl_Human=>0.0, :Species_lvl_Ood=>0.0, :Species_lvl_WeepingAngel=>0.0}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_p(method: :wald)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the aggression level of each species is significantly different from the base level (which is the species `Dalek` in this model), while statistically we don't have enough evidence to conclude that the age of an individual is a good predictor of his/her/its aggression level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the Wald method for confidence intervals as well. For example 90% confidence intervals for each fixed effects coefficient estimate can be computed as follows." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>[917.2710134723027, 1115.302427932389], :Age=>[-0.21316359921454495, 0.0825312923587668], :Species_lvl_Human=>[-500.1349311310106, -499.2524594494065], :Species_lvl_Ood=>[-900.0322606117453, -899.1063820954076], :Species_lvl_WeepingAngel=>[-200.04258166587707, -199.13533441813698]}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conf_int = model_fit.fix_ef_conf_int(level: 0.9, method: :wald)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For greated visual clarity we can put the coefficient estimates and the confidence intervals into a `Daru::DataFrame`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:46955645328660 rows: 5 cols: 3
lower90upper90coef
intercept917.27101347230271115.3024279323891016.2867207023459
Age-0.213163599214544950.0825312923587668-0.06531615342788907
Species_lvl_Human-500.1349311310106-499.2524594494065-499.69369529020855
Species_lvl_Ood-900.0322606117453-899.1063820954076-899.5693213535765
Species_lvl_WeepingAngel-200.04258166587707-199.13533441813698-199.58895804200702
" ], "text/plain": [ "\n", "#\n", " lower90 upper90 coef \n", " intercept 917.271013 1115.30242 1016.28672 \n", " Age -0.2131635 0.08253129 -0.0653161 \n", "Species_lv -500.13493 -499.25245 -499.69369 \n", "Species_lv -900.03226 -899.10638 -899.56932 \n", "Species_lv -200.04258 -199.13533 -199.58895 \n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = Daru::DataFrame.rows(conf_int.values, order: [:lower90, :upper90], index: model_fit.fix_ef_names)\n", "df[:coef] = model_fit.fix_ef.values\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predictions and prediction intervals\n", "\n", "One might also fit a statistical model in order to predict future observations based on new data input.\n", "\n", "We consider the following new data set containing age, geographic location and species for ten individuals." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:46955644970560 rows: 10 cols: 3
AgeSpeciesLocation
0209DalekOodSphere
190OodEarth
2173OodAsylum
3153HumanAsylum
4255WeepingAngelOodSphere
5256WeepingAngelAsylum
637DalekEarth
7146WeepingAngelEarth
8127WeepingAngelAsylum
941OodAsylum
" ], "text/plain": [ "\n", "#\n", " Age Species Location \n", " 0 209 Dalek OodSphere \n", " 1 90 Ood Earth \n", " 2 173 Ood Asylum \n", " 3 153 Human Asylum \n", " 4 255 WeepingAng OodSphere \n", " 5 256 WeepingAng Asylum \n", " 6 37 Dalek Earth \n", " 7 146 WeepingAng Earth \n", " 8 127 WeepingAng Asylum \n", " 9 41 Ood Asylum \n" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'\n", "newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })\n", "newdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Point estimates\n", "\n", "Based on the fitted linear mixed model we can predict the aggression levels for the inidividuals, where we can specify whether the random effects estimates should be included in the calculations or not." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predictions of aggression levels on a new data set:\n" ] }, { "data": { "text/plain": [ "[1070.9125752531208, 182.45206492790737, -17.06446875476354, 384.7881586199103, 876.1240725686446, 674.7113391148862, 1092.6985606350866, 871.1508855262363, 687.4629975728096, -4.016260100144294]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"Predictions of aggression levels on a new data set:\"\n", "pred = model_fit.predict(newdata: newdata, with_ran_ef: true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:46955644185960 rows: 10 cols: 4
AgeSpeciesLocationPredicted_Agression
0209DalekOodSphere1070.9125752531208
190OodEarth182.45206492790737
2173OodAsylum-17.06446875476354
3153HumanAsylum384.7881586199103
4255WeepingAngelOodSphere876.1240725686446
5256WeepingAngelAsylum674.7113391148862
637DalekEarth1092.6985606350866
7146WeepingAngelEarth871.1508855262363
8127WeepingAngelAsylum687.4629975728096
941OodAsylum-4.016260100144294
" ], "text/plain": [ "\n", "#\n", " Age Species Location Predicted_ \n", " 0 209 Dalek OodSphere 1070.91257 \n", " 1 90 Ood Earth 182.452064 \n", " 2 173 Ood Asylum -17.064468 \n", " 3 153 Human Asylum 384.788158 \n", " 4 255 WeepingAng OodSphere 876.124072 \n", " 5 256 WeepingAng Asylum 674.711339 \n", " 6 37 Dalek Earth 1092.69856 \n", " 7 146 WeepingAng Earth 871.150885 \n", " 8 127 WeepingAng Asylum 687.462997 \n", " 9 41 Ood Asylum -4.0162601 \n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'\n", "newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })\n", "newdata[:Predicted_Agression] = pred\n", "newdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Interval estimates\n", "\n", "Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.\n", "\n", "Two types of such interval estimates are currently available in `LMM`. On the one hand, a *confidence interval* is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a *prediction interval* is an interval estimate of a future observation (for further explanation of this distinction see for example )." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "88% confidence intervals for the predictions:\n" ] }, { "data": { "text/html": [ "
Daru::DataFrame:46955643614720 rows: 10 cols: 3
predlower88upper88
01002.6356446359171906.2754736170911098.995815654743
1110.8389455402593717.15393113018095204.5239599503378
2105.4177048057446210.164687937713381200.67072167377586
3506.59965393767027411.8519191795299601.3473886958107
4800.0421435362272701.9091174988788898.1751695735755
5799.9768273827992701.8009453018722898.1527094637263
61013.870023025514920.4439313191591107.296114731869
7807.1616042598671712.571759209002901.7514493107321
8808.402611174997714.191640124036902.613582225958
9114.0394370582259920.614034870631627207.46483924582034
" ], "text/plain": [ "\n", "#\n", " pred lower88 upper88 \n", " 0 1002.63564 906.275473 1098.99581 \n", " 1 110.838945 17.1539311 204.523959 \n", " 2 105.417704 10.1646879 200.670721 \n", " 3 506.599653 411.851919 601.347388 \n", " 4 800.042143 701.909117 898.175169 \n", " 5 799.976827 701.800945 898.152709 \n", " 6 1013.87002 920.443931 1107.29611 \n", " 7 807.161604 712.571759 901.751449 \n", " 8 808.402611 714.191640 902.613582 \n", " 9 114.039437 20.6140348 207.464839 \n" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"88% confidence intervals for the predictions:\"\n", "ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)\n", "Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "88% prediction intervals for the predictions:\n" ] }, { "data": { "text/html": [ "
Daru::DataFrame:46955643188840 rows: 10 cols: 3
predlower88upper88
01002.6356446359171809.91005014591041195.3612391259237
1110.83894554025937-76.53615884686141298.2140499273802
2105.41770480574462-85.09352864481423295.92893825630347
3506.59965393767027317.0988995529618696.1004083223787
4800.0421435362272603.7713980881146996.3128889843398
5799.9768273827992603.6203777073699996.3332770582285
61013.870023025514827.01272323178051200.7273228192475
7807.1616042598671617.9767304115936996.3464781081406
8808.402611174997619.9754792487822996.8297431012118
9114.03943705822599-72.8161447158925300.8950188323445
" ], "text/plain": [ "\n", "#\n", " pred lower88 upper88 \n", " 0 1002.63564 809.910050 1195.36123 \n", " 1 110.838945 -76.536158 298.214049 \n", " 2 105.417704 -85.093528 295.928938 \n", " 3 506.599653 317.098899 696.100408 \n", " 4 800.042143 603.771398 996.312888 \n", " 5 799.976827 603.620377 996.333277 \n", " 6 1013.87002 827.012723 1200.72732 \n", " 7 807.161604 617.976730 996.346478 \n", " 8 808.402611 619.975479 996.829743 \n", " 9 114.039437 -72.816144 300.895018 \n" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"88% prediction intervals for the predictions:\"\n", "pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)\n", "Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Ruby 2.3.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.3.1" } }, "nbformat": 4, "nbformat_minor": 0 }