{ "metadata": { "name": "", "signature": "sha256:5b33d116834aaf77cd394c7959d16b3a68926a99ff8ead6d821e47b75c2e0836" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Regression diagnostics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example file shows how to use a few of the ``statsmodels`` regression diagnostic tests in a real-life context. You can learn about more tests and find out more information abou the tests here on the [Regression Diagnostics page.](http://statsmodels.sourceforge.net/stable/diagnostic.html) \n", "\n", "Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online ``statsmodels`` documentation. For presentation purposes, we use the ``zip(name,test)`` construct to pretty-print short descriptions in the examples below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimate a regression model" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "from statsmodels.compat import lzip\n", "import statsmodels\n", "import numpy as np\n", "import pandas as pd\n", "import statsmodels.formula.api as smf\n", "import statsmodels.stats.api as sms\n", "import matplotlib.pyplot as plt\n", "\n", "# Load data\n", "url = 'http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'\n", "dat = pd.read_csv(url)\n", "\n", "# Fit regression model (using the natural log of one of the regressaors)\n", "results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()\n", "\n", "# Inspect the results\n", "print(results.summary())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normality of the residuals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jarque-Bera test:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']\n", "test = sms.jarque_bera(results.resid)\n", "lzip(name, test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Omni test:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "name = ['Chi^2', 'Two-tail probability']\n", "test = sms.omni_normtest(results.resid)\n", "lzip(name, test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Influence tests\n", "\n", "Once created, an object of class ``OLSInfluence`` holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from statsmodels.stats.outliers_influence import OLSInfluence\n", "test_class = OLSInfluence(results)\n", "test_class.dfbetas[:5,:]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explore other options by typing ``dir(influence_test)``\n", "\n", "Useful information on leverage can also be plotted:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from statsmodels.graphics.regressionplots import plot_leverage_resid2\n", "fig, ax = plt.subplots(figsize=(8,6))\n", "fig = plot_leverage_resid2(results, ax = ax)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other plotting options can be found on the [Graphics page.](http://statsmodels.sourceforge.net/stable/graphics.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multicollinearity\n", "\n", "Condition number:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.linalg.cond(results.model.exog)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heteroskedasticity tests\n", "\n", "Breush-Pagan test:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "name = ['Lagrange multiplier statistic', 'p-value', \n", " 'f-value', 'f p-value']\n", "test = sms.het_breushpagan(results.resid, results.model.exog)\n", "lzip(name, test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Goldfeld-Quandt test" ] }, { "cell_type": "code", "collapsed": false, "input": [ "name = ['F statistic', 'p-value']\n", "test = sms.het_goldfeldquandt(results.resid, results.model.exog)\n", "lzip(name, test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linearity\n", "\n", "Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "name = ['t value', 'p value']\n", "test = sms.linear_harvey_collier(results)\n", "lzip(name, test)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }