{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fitting a linear mixed model and accessing the parameter estimates \n", "\n", "Below we demonstrate how to fit linear mixed models using the Ruby gem [mixed\\_models](https://github.com/agisga/mixed_models). We also show how to access various parameters estimated by the linear mixed model, and we use them to assess the goodness of fit of the resulting model.\n", "\n", "We will fit the model with the user friendly method `LMM#from_formula`, which mimics the behaviour of the function `lmer` from the `R` package `lme4`. The formula language used by `mixed_models` is in fact a subset of the `lme4` formula interface. It allows to fit all of the same models, but does not allow for cetain shortcuts, namely the symbols `*`, `||` and `/`. However, for all of these shortcuts longer alternative formulations exist in all cases (e.g. you need to use the equivalent formulation `a + b + a:b` instead of `a*b`). An expanation of the `lme4` formula interface can be found in the [`lme4` vignette](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf). Additionally, there is a multitude of posts on stackexchange ([such as this](http://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet)) discussing the formula language of `lme4`. \n", "\n", "\n", "## Table of Contents\n", "\n", "* Example data\n", "\n", "* Linear mixed model\n", "\n", "* Model attributes\n", "\n", "* Fitted values and residuals\n", "\n", "* Assessing the quality of the model fit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example data\n", "\n", "The data set, which is simulated, contains two numeric variables *Age* and *Aggression*, and two categorical variables *Location* and *Species*. These data are available for 100 (human and alien) individuals.\n", "\n", "The data is supplied to `LMM#from_formula` as a `Daru::DataFrame` (from the excellent Ruby gem [daru](https://github.com/v0dro/daru.git)). We load the data, and display the first 10 lines with:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47084194581820 rows: 10 cols: 4
AggressionAgeSpeciesLocation
0877.54242028595204.95DalekAsylum
1852.52839218820639.88WeepingAngelOodSphere
2388.791416909388107.34HumanAsylum
3170.010124622982210.01OodOodSphere
41078.31219494376270.22DalekOodSphere
5164.924992952256157.65OodOodSphere
6865.838374677443136.15WeepingAngelOodSphere
71052.36035549029241.31DalekEarth
8-8.5725199338256786.84OodAsylum
91070.71900405899206.7DalekOodSphere
" ], "text/plain": [ "\n", "#\n", " Aggression Age Species Location \n", " 0 877.542420 204.95 Dalek Asylum \n", " 1 852.528392 39.88 WeepingAng OodSphere \n", " 2 388.791416 107.34 Human Asylum \n", " 3 170.010124 210.01 Ood OodSphere \n", " 4 1078.31219 270.22 Dalek OodSphere \n", " 5 164.924992 157.65 Ood OodSphere \n", " 6 865.838374 136.15 WeepingAng OodSphere \n", " 7 1052.36035 241.31 Dalek Earth \n", " 8 -8.5725199 86.84 Ood Asylum \n", " 9 1070.71900 206.7 Dalek OodSphere \n" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'\n", "alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'\n", "# mixed_models expects that all variable names in the data frame are ruby Symbols:\n", "alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })\n", "alien_species.head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear mixed model\n", "\n", "We model the *Aggression* level of an individual as a linear function of the *Age* (*Aggression* decreases with *Age*), with a different constant added for each *Species* (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in *Aggression* due to the *Location* that an individual is at. Additionally, there is a random fluctuation in how *Age* affects *Aggression* at each different *Location*. \n", "\n", "Thus, the *Aggression* level of an individual of *Species* $spcs$ who is at the *Location* $lctn$ can be expressed as:\n", "$$Aggression = \\beta_{0} + \\gamma_{spcs} + Age \\cdot \\beta_{1} + b_{lctn,0} + Age \\cdot b_{lctn,1} + \\epsilon,$$\n", "where $\\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each *Location*). That is, we have a linear mixed model with fixed effects $\\beta_{0}, \\beta_{1}, \\gamma_{Dalek}, \\gamma_{Ood}, \\dots$, and random effects $b_{Asylum,0}, b_{Asylum,1}, b_{Earth,0},\\dots$.\n", "\n", "We fit this model in `mixed_models` using a syntax familiar from the `R` package `lme4` (as described above), and display the estimated fixed and random effects coefficients:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fixed effects:\n", "{:intercept=>1016.2867207023459, :Age=>-0.06531615342788907, :Species_lvl_Human=>-499.69369529020855, :Species_lvl_Ood=>-899.5693213535765, :Species_lvl_WeepingAngel=>-199.58895804200702}\n", "Random effects:\n", "{:intercept_Asylum=>-116.68080676073654, :Age_Asylum=>-0.0335339121374082, :intercept_Earth=>83.86571636827462, :Age_Earth=>-0.136139966451407, :intercept_OodSphere=>32.81508999155884, :Age_OodSphere=>0.16967387859160207}\n" ] } ], "source": [ "require 'mixed_models'\n", "model_fit = LMM.from_formula(formula: \"Aggression ~ Age + Species + (Age | Location)\", \n", " data: alien_species)\n", "puts \"Fixed effects:\"\n", "puts model_fit.fix_ef\n", "puts \"Random effects:\"\n", "puts model_fit.ran_ef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A summary of some important information about the fixed and random effects of the model can be conveniently displayed with the methods `LMM#fix_ef_summary` and `LMM#ran_ef_summary`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`LMM#fix_ef_summary` contains the fixed effects coefficient estimates, the standard deviations of the estimates, as well as the corresponding z scores and Wald Z p-values testing the significance of each fixed effects term." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47084187305780 rows: 5 cols: 4
coefsdz_scoreWaldZ_p_value
intercept1016.286720702345960.1972749576905416.8826034304150770.0
Age-0.065316153427889070.0898848636725299-0.72666465475043740.46743141066211646
Species_lvl_Human-499.693695290208550.2682523406941929-1862.7747813759370.0
Species_lvl_Ood-899.56932135357650.28144708140043684-3196.22899224060030.0
Species_lvl_WeepingAngel-199.588958042007020.27578357795259995-723.71589172837250.0
" ], "text/plain": [ "\n", "#\n", " coef sd z_score WaldZ_p_va \n", " intercept 1016.28672 60.1972749 16.8826034 0.0 \n", " Age -0.0653161 0.08988486 -0.7266646 0.46743141 \n", "Species_lv -499.69369 0.26825234 -1862.7747 0.0 \n", "Species_lv -899.56932 0.28144708 -3196.2289 0.0 \n", "Species_lv -199.58895 0.27578357 -723.71589 0.0 \n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`LMM#ran_ef_summary` summarizes the correlation structure of the random effects terms, that is, the correlation matrix of the vector $(b_{lctn,0}, b_{lctn,1})^T$ in the present data analysis." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47084184227680 rows: 2 cols: 2
LocationLocation_Age
Location104.26376362946607-0.05986390378124856
Location_Age-0.059863903781248560.15567077553868963
" ], "text/plain": [ "\n", "#\n", " Location Location_A \n", " Location 104.263763 -0.0598639 \n", "Location_A -0.0598639 0.15567077 \n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.ran_ef_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In particular, we see that the standard deviation of the constant term corresponding to the variable *Location* is rather big (104.26...), and therefore *Location* must contribute hugely to the variance in *Aggression*. The variability of the effect of *Age* with respect to *Location*, however, seems rather small (0.15567...) in comparison." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model attributes\n", "\n", "Apart from the fixed and random effects coefficients (seen above), we can access many attributes of the fitted model. Among others:\n", "\n", "* `fix_ef_names` and `ran_ef_names` are Arrays of names of the fixed and random effects.\n", "\n", "* `reml` is an indicator whether the profiled REML criterion or the profiled deviance function was optimized by the model fitting algorithm.\n", "\n", "* `formula` returns the R-like formula used to fit the model as a String.\n", "\n", "* `model_data`, `optimization_result` and `dev_fun` store the various model matrices in an `LMMData` object, the results of the utilized optimization algorithm, and the corresponding objective function as a `Proc`.\n", "\n", "* `sigma2` is the residual variance (unless `weights` was specified in the model fit).\n", "\n", "* `sigma_mat` is the covariance matrix of the multivariate normal random effects vector.\n", "\n", "We can look at some of these parameters for our example model:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "REML criterion used: \ttrue\n", "Residual variance: \t0.9496833447256825\n", "Formula: \tAggression ~ Age + Species + (Age | Location)\n", "Variance of the intercept due to 'location' (i.e. variance of b0): \t10870.932406181171\n", "Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \t0.024233390356817094\n", "Covariance of b0 and b1: \t-0.9716403033290799\n" ] } ], "source": [ "puts \"REML criterion used: \\t#{model_fit.reml}\"\n", "puts \"Residual variance: \\t#{model_fit.sigma2}\"\n", "puts \"Formula: \\t\" + model_fit.formula\n", "puts \"Variance of the intercept due to 'location' (i.e. variance of b0): \\t#{model_fit.sigma_mat[0,0]}\"\n", "puts \"Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \\t#{model_fit.sigma_mat[1,1]}\"\n", "puts \"Covariance of b0 and b1: \\t#{model_fit.sigma_mat[0,1]}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some further convenience methods are (apart from `#fix_ef_summary` and `#ran_ef_summary` as seen above):\n", "\n", "* `#sigma` returns the square root of `sigma2`.\n", "\n", "* `#theta` returns the optimal solution of the minimization of the deviance function or the REML criterion (whichever was used to fit the model).\n", "\n", "* `#deviance` returns the value of the deviance function or the REML criterion at the optimal solution.\n", "\n", "* `#ran_ef_cov` is a version of `#ran_ef_summary` that returns the covariance structure of the random effects rather than the correlation structure.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Residual standard deviation: \t0.9745169802141379\n", "REML criterion: \t333.7155391015166\n" ] } ], "source": [ "puts \"Residual standard deviation: \\t#{model_fit.sigma}\"\n", "puts \"REML criterion: \\t#{model_fit.deviance}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitted values and residual\n", "\n", "There are methods to get the fitted (i.e. predicted) values of the response variable, with or without inclusion of random effects, as well as the model residuals (which are differences between the true and the predicted values)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitted values at the population level:\n" ] }, { "data": { "text/plain": [ "[1002.9001750573, 814.0929544616347, 509.58198950318774, 103.00035396737837, 998.6369897230617, 106.42030776086267, 807.8049683711317, 1000.525279718662, 111.04534458509147, 1002.7858717888012, 1008.0797460241316, 498.79959889531176, 100.90435860387743, 111.03358767747443, 102.37593154060778, 812.6239941710414, 97.21856806594167, 800.9297900613121, 510.0032786927976, 502.6950542857511, 503.7329279637203, 999.8864877381372, 497.2875299434561, 1003.3325679929926, 804.1328942254158, 811.9113949371432, 99.91285939484214, 803.9356394420636, 998.4671677241491, 99.50267395131493, 500.71140270614603, 1014.38536747606, 112.41241167633723, 1010.303107886817, 114.52212343205804, 813.8107886788262, 811.7526766843134, 498.6990120190328, 1005.0836940663943, 110.67826780282678, 501.00205958890024, 810.6305451684223, 511.318092861301, 1000.7623773556052, 508.8465296155897, 1008.3730155530228, 1011.249538949987, 102.4595362169955, 1010.9490846442187, 1007.0346875692853, 506.2286581861998, 502.15031756616247, 813.0589997528712, 499.75256157382466, 812.9329395767553, 498.25682166032595, 514.1586923738798, 499.76366531990743, 510.7733561417124, 798.2159038863833, 511.4147607683742, 506.48077853843154, 1012.1763751671288, 1013.9856326170814, 101.35504006252984, 1013.1593832762185, 1006.1091576752121, 106.82265526597848, 1002.7754212042528, 806.3314359497986, 1002.5860043593118, 514.9699189994543, 1007.2214917680891, 806.255016050288, 502.0582217898292, 1011.9458091455283, 804.6221122145907, 997.8786691817638, 802.2648522373782, 1011.1705064043393, 806.8657220848387, 1011.1763848581478, 798.7684785443834, 1009.0999843406752, 806.0166120902761, 102.78481066106633, 99.6640048502818, 506.1835900403346, 998.8949885291019, 499.4357782296994, 814.6050331045093, 1012.1182437905779, 109.60251075586939, 501.6904918460301, 98.0232630761733, 106.71945574356243, 99.98274767900989, 114.889853375857, 798.1950027172865, 106.6208283518863]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"Fitted values at the population level:\"\n", "model_fit.fitted(with_ran_ef: false)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model residuals:\n" ] }, { "data": { "text/plain": [ "[-1.8041727180516318, -1.1462465432206272, -0.5102357042338213, -1.438530578977577, 1.0108397561168658, -1.059491760131607, 2.1172177445057514, 0.8212947077424815, -0.024972828168071004, 0.04645157374602604, -0.2521512808752959, 0.5337610637689068, -0.5306298541011536, -0.8746282732136024, -0.5173146938876982, -1.076703511382675, -0.37352823345474917, 0.4342880104853748, -0.008686417241847266, 0.9633162152576915, -1.8591988675074163, -0.06035456182189591, -0.8747549066542888, 2.156663624530097, -0.12014582672816232, 1.5012618544112684, 0.5916114379897799, -1.4808531218288863, -0.16478112799086375, 2.0314537895963163, 0.010773844423852097, 0.8539551421497436, -1.2568149547061997, 0.07403493882179646, -0.11003692956987265, 1.027133585975207, 1.2893893800322758, -0.14846806796293777, -0.4342951733146947, 1.7157974749037237, 1.017366682325644, -0.6984875406660649, 0.16845508494236583, -1.423216161059372, -0.14893911781359748, 0.1337897922242064, -0.21024649931018757, 0.40638495225081783, -0.40370493998443635, -1.5086471211782282, 0.22131784542784771, -0.3343567440182369, -1.3354988341066019, 1.1064057146309096, 0.3472979813135453, 0.21318943374672017, -0.44914337321006315, -0.3144066453174901, -0.8024013676736104, -1.138768647945767, -0.270266988866922, 0.32795235324385885, -0.07521784973812373, 1.8354848431990831, 0.5871444360181073, -0.30368225978327246, -0.8505153742187304, -0.5012451945714531, -0.8060105059505531, 0.5330080478594255, 1.4260343762148295, -1.050545171033491, -0.6140910982546757, -0.2504994695145797, 0.6581363886527924, 0.4752780907743954, 0.7314233385786792, -0.7084492435404854, -1.2272348687666863, -0.7943531062586544, 0.24601716633890192, 0.45095282087356736, 0.2410007816512234, 0.5882243348621614, -0.045955678732184424, 2.3011995430929595, -0.805550441639582, 1.2857127416019694, 0.9605154956996103, -0.05181584374145132, -0.04108522357819311, -0.4196304756428617, 1.2953557251604657, 0.3168318472534679, -1.6086909528895887, 0.39499568456598766, -1.2512383164572825, -0.1762592429552683, 0.09344137531570595, 1.2049892111455733]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "puts \"Model residuals:\"\n", "model_fit.residuals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assessing the quality of the model fit\n", "\n", "We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values, and checking for unexpected patterns. We use the gem [gnuplotrb](https://github.com/dilcom/gnuplotrb) for plotting." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-200\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 200\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 400\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 600\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 800\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1000\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1200\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\t\n", "\t\tResiduals\n", "\t\n", "\n", "\n", "\t\n", "\t\tFitted\n", "\t\n", "\n", "\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"Fitted\", :ylabel => \"Residuals\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>], @cmd=\"plot \">" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'gnuplotrb'\n", "include GnuplotRB\n", "\n", "x, y = model_fit.fitted, model_fit.residuals\n", "fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],\n", " xlabel: 'Fitted', ylabel: 'Residuals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the residuals look more or less like noise, which is good.\n", "\n", "We can further analyze the validity of the linear mixed model somewhat, by checking if the residuals appear to be approximately normally distributed." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t 0.02\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.04\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.06\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.08\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.12\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.14\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.16\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.18\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\t\n", "\t\t\n", "\t\n", "\t\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"fill solid 0.5\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:notitle => true, :with => \"boxes\"]>], @cmd=\"plot \">" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bin_width = (y.max - y.min)/10.0\n", "bins = (y.min..y.max).step(bin_width).to_a\n", "rel_freq = Array.new(bins.length-1){0.0}\n", "y.each do |r|\n", " 0.upto(bins.length-2) do |i|\n", " if r >= bins[i] && r < bins[i+1] then\n", " rel_freq[i] += 1.0/y.length\n", " end\n", " end\n", "end\n", "bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }\n", " \n", "residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],\n", " style: 'fill solid 0.5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The histogram does not appear to be too different from a bell shaped curve, although it might be slightly skewed to the right.\n", "\n", "We can further explore the validity of the normality assumption by looking at the Q-Q plot of the residuals." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "Gnuplot\n", "Produced by GNUPLOT 5.0 patchlevel 3 \n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\t\n", "\t \n", "\t \n", "\t\n", "\n", "\n", "\n", "\n", "\t\t\n", "\t\t-2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t-0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 0.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 1.5\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2\n", "\t\n", "\n", "\n", "\t\t\n", "\t\t 2.5\n", "\t\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\t\n", "\t\tObserved quantiles\n", "\t\n", "\n", "\n", "\t\n", "\t\tNormal theoretical quantiles\n", "\t\n", "\n", "\n", "\t\n", "\t\tQ-Q plot of the residuals\n", "\t\n", "\n", "\tgnuplot_plot_1\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "\t\n", "\tgnuplot_plot_2\n", "\n", "\t\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "# \"Normal theoretical quantiles\", :ylabel => \"Observed quantiles\", :title => \"Q-Q plot of the residuals\"], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>, # true, :with => \"lines\"]>], @cmd=\"plot \">" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'distribution'\n", "\n", "observed = model_fit.residuals.sort\n", "n = observed.length\n", "theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}\n", "qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],\n", " ['x', with: 'lines', notitle: true],\n", " xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',\n", " title: 'Q-Q plot of the residuals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The straight line in the above plot is simply the diagonal. We see that the observed quantiles aggree with the theoretical values fairly well, as expected from a \"good\" model." ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.3.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.3.1" } }, "nbformat": 4, "nbformat_minor": 0 }