{ "metadata": { "name": "", "signature": "sha256:01f649fcdf6c5cf3869fb9023ad91d38fa92857e406da36c6cc98be773a3d5fc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Statistics in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two major modules for doing statistical analyses in Python:\n", "\n", "* [Scipy](http://docs.scipy.org/doc/scipy/reference/stats.html) - basic statistics and distribution fitting\n", "* [Statsmodels](http://statsmodels.sourceforge.net/stable/index.html) - advanced statistical modeling focused on linear models\n", "(including ANOVA, multiple regression, generalized linear models, etc.)\n", "\n", "To see the full functionality of these modules you'll need to look through\n", "their pages, but here are a few examples to show you that a lot of the\n", "standard statistical tests and models you need to perform can be easily\n", "done using Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imports\n", "-------\n", "You'll want the `stats` module from Scipy and the `api` and `formula.api`\n", "modulesfrom statsmodels. We'll also go ahead and import Numpy for use in\n", "generating data and Matplotlib for some graphing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy.stats as stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Descriptive Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`stats.describe()` gives basic descriptive statistics on any list of numbers. It returns (in the following order) the size of the data, it's min and max, the mean, the variance, skewness, and kurtosis." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = [1, 2, 3, 4, 5]\n", "stats.describe(x)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "(5, (1, 5), 3.0, 2.5, 0.0, -1.3)" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get this kind of information for columns in a DataFrame using the `.describe()` method." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.DataFrame([[1, 2, 3.5], [2, 2.4, 3.1], [3, 1.8, 2.5]], columns=['a', 'b', 'c'])\n", "print data\n", "data.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " a b c\n", "0 1 2.0 3.5\n", "1 2 2.4 3.1\n", "2 3 1.8 2.5\n", "\n", "[3 rows x 3 columns]\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
count 3.0 3.000000 3.000000
mean 2.0 2.066667 3.033333
std 1.0 0.305505 0.503322
min 1.0 1.800000 2.500000
25% 1.5 1.900000 2.800000
50% 2.0 2.000000 3.100000
75% 2.5 2.200000 3.300000
max 3.0 2.400000 3.500000
\n", "

8 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " a b c\n", "count 3.0 3.000000 3.000000\n", "mean 2.0 2.066667 3.033333\n", "std 1.0 0.305505 0.503322\n", "min 1.0 1.800000 2.500000\n", "25% 1.5 1.900000 2.800000\n", "50% 2.0 2.000000 3.100000\n", "75% 2.5 2.200000 3.300000\n", "max 3.0 2.400000 3.500000\n", "\n", "[8 rows x 3 columns]" ] } ], "prompt_number": 3 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "T-tests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard 1-sample and 2-sample t-tests (as well as many other basic statistical tests) are available in `scipy.stats`.\n", "T-tests return two values, the t-statistic and the p-value. First, let's generate some data that is normally distributed around zero." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x1 = np.random.randn(100, 1)\n", "x2 = np.random.randn(100, 1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "One-sample t-test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To determine if the mean of `x1` is different from some number use a one-sample t-test.\n", "We'll compare the mean of `x1` to both 0 (which it shouldn't be different from) and to 1 (which it should be different from)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tstat, pval = stats.ttest_1samp(x1, 0)\n", "print \"Comparison of the mean of x1 to 0.\\nT-statistic = %s; P-value = %s.\" % (tstat, pval)\n", "\n", "tstat, pval = stats.ttest_1samp(x1, 1)\n", "print \"Comparison of the mean of x1 to 5.\\nT-statistic = %s; P-value = %s.\" % (tstat, pval)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Comparison of the mean of x1 to 0.\n", "T-statistic = [ 1.17526772]; P-value = [ 0.24270646].\n", "Comparison of the mean of x1 to 5.\n", "T-statistic = [-9.46674519]; P-value = [ 1.59375523e-15].\n" ] } ], "prompt_number": 5 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Two-sample t-test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To determine if the mean of two different sets of numbers are different from one another wuse a two-sample t-test. We'll compare the means of `x1` and `x2`, which should be different from one another since they should both be about zero." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tstat, pval = stats.ttest_ind(x1, x2)\n", "print \"Comparison of the means of x1 and x2.\\nT-statistic = %s; P-value = %s.\" % (tstat, pval)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Comparison of the means of x1 and x2.\n", "T-statistic = [ 1.54942032]; P-value = [ 0.12287759].\n" ] } ], "prompt_number": 6 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Distribution fitting and analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`scipy.stats` also includes a powerful general system for working with statistical distribution.\n", "Let's generate some random normally distributed numbers to analyze." ] }, { "cell_type": "code", "collapsed": false, "input": [ "u, sigma = 4, 2\n", "random_numbers = stats.norm.rvs(u, sigma, size=50)\n", "random_numbers" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "array([ -2.69438161e-02, 2.92683231e+00, 6.59328450e+00,\n", " 5.09170230e+00, 4.99668460e+00, 5.50528668e+00,\n", " 7.50782290e+00, 4.63995490e+00, 3.17923549e+00,\n", " 7.95144616e+00, 2.19070256e+00, 6.40680973e+00,\n", " 4.65721099e+00, 1.77901732e+00, 2.63237191e+00,\n", " 4.49198740e+00, 5.56909943e+00, -1.83542925e-01,\n", " 6.42398538e+00, 6.55107469e+00, 3.45866471e+00,\n", " 3.66296653e+00, 4.46982893e+00, 1.11183439e-01,\n", " 3.14154367e+00, 3.35583976e+00, 4.99528367e+00,\n", " 3.56179539e+00, 5.21900454e+00, 2.35695456e+00,\n", " 4.38756723e+00, 5.83026447e+00, 1.46130374e+00,\n", " 1.21498908e+00, 7.73665200e+00, -2.39077287e-03,\n", " 3.45132547e+00, 4.62012917e+00, 8.59613520e-01,\n", " 3.73481950e+00, 7.26835158e+00, 3.26567510e+00,\n", " -6.84996675e-01, 3.19824323e+00, 6.74505956e+00,\n", " 3.50754549e+00, 4.60034178e+00, 4.63561359e+00,\n", " 6.11115427e+00, 5.12909660e+00])" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then fit distributions to this data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "stats.norm.fit(random_numbers)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "(4.0057489130411792, 2.1683842014426378)" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do simple univariate OLS regression in Scipy,\n", "but for anything more complicated you'll need statsmodels,\n", "so we'll just do the basics in statsmodels as well.\n", "\n", "First, we'll generate some data and plot it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#generate some data\n", "x = np.array(range(20))\n", "y = 3 + 0.5 * x + 2 * np.random.randn(20)\n", "data = pd.DataFrame(zip(x, y), columns=['x', 'y'])\n", "\n", "#plot the data\n", "plt.plot(data['x'], data['y'], 'bo')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD/CAYAAADoiI2GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAECNJREFUeJzt3X2MI/ddx/H3Ze9qSyENVUqrVkSccCgVqqBFVNcGCK6a\n3b32CKlOPAmK1CAECLBXKhDI7p1uo2T/qKpEjV2aopZCqSCRUoUQznDnpT2flTYKT0Wl4iFkEA8V\nBZWKEiWqjXa7/DG+Pe/i3NqzMx7/xu+XZMWe88M3I9/nxt/f7zcDkiRJkiRJkiRJkiRJkiRJL+kE\ncGnftp8APptDLZKkgaMH/PndwLuBF4a2vQn46cwqkiSl4jRwC/D04PFNwB8DbxjaJkmaUceJw/o6\n4Ang9UPbJEkz7DhxWL8Z+AJxP/xp4H+AB/MrS5Lm20E98GF/Ttw6AfgW4FHgvaOeWKlUdqIoOmRp\nkjR3IuK29ViuG/N5O/seHxmx7WoFUcTOzo63lG7nzp3LvYai3NyXxdif589fplJZHcRQfKtUVjl/\n/nLu++QwN6AybniPG+D/DNw6xjZJmopGo00UbezZFkUbNJubOVWUj3GPwCVpZvT7o7u/vd7ClCvJ\nlwEegGq1mncJheG+TFde+7NU2hq5vVzennIl+TqS0fvuDPo5kjRSq9Wl0WjT7x+lVNqiXl/i1Knb\nxn7tysrFPW2USmWVhx46OfZ7zKIjR47ABLk8ySwUSUrFqACOojWAsQL4ynOazbP0eguUy9vUamGH\ndxIegUuauuXlM7Tb94/YfpYLF+7LoaLZMOkRuD1wSVPnIGQ6DHBJU+cgZDoMcElTV68vUams7dlW\nqaxSqy3mVFGY7IFLykWr1aXZ3BwahFycu0HI/SbtgRvgkubWYaYypvkeVziNUJLGcNipjGm9x2HY\nA5c0l9I4n0re52QxwCXNpTSmMuY9HdIAlzSX0pjKmPd0SANc0lxKYypj3tMhnYUiaW6lMZUxzemQ\nTiOUpEB5LhRJmhMGuCQFygCXpEAZ4JIUKANckgJlgEtSoAxwSQrUOAF+Arg0uP9GoDt4fAF4VUZ1\nSZIOcFCA3w18BCgNHn8A+CXgbcDjwK9lV5ok6VoOCvDngNNcXRn048DnB/ePAV/LqC5J0gEOuqDD\n48Dxocf/MfjvrcAvAt+fQU2SpDEkuSLPjwGrwDuBr7zUk9bX13fvV6tVqtVqgo+SpOLqdDp0Op3E\nrx/npCnHgUeAtwLvBn4WuBP472u8xpNZSRlI8/qLmj1ZXRNzh7hf/hDwL8StFYDLwPr45UlKKu/r\nL2r2eDpZKRDLy2dot+8fsf0sFy7cl0NFSpunk5UKKu/rL2r2GOBSIPK+/qJmjwGuudBqdVlePkO1\nus7y8hlarW7eJU0s7+svavYkmUYoBaUog39Xam02zw5df/FkUP8PSpeDmCo8B/8UCgcxpX0c/FNR\nGeAqPAf/VFQGuArPwT8VlT1wzYVWq0uzuTk0+Lfo4J9mzqQ9cANckmaEg5iSNCcMcEkKlAt5JE3M\n09rOBgNc0kSKsrK1CGyhSJpIo9HeE94AUbRBs7mZU0XzywCXNBFXts4OA1zSRFzZOjsMcEkTcWXr\n7HAhj6SJubI1G67ElKRAuRJTkuaEAS5JgTLAJSlQBrgkBWqcAD8BXBrcvwV4CugCHyK7QVBJ0gEO\nCvC7gY8ApcHjB4FV4Dbi8L4zu9IkSddyUIA/B5zm6pH2dxMffQP8CXB7RnVJkg5wUIA/Dgyvmx1u\nmbwA3Jh6RZKksUx6OtmvD92/AfjqSz1xfX199361WqVarU74UZJUbJ1Oh06nk/j14wxCHgceAd4K\nPAk8AFwGPgx8CnhsxGtciSlJE5p0Jea4R+BX0viXiQc1Xwb8LfDJSYqTJKXHc6FI0ozwXCiSNCcM\ncEkKlAEuSYEywCUpUAa4JAXKAJekQE26ElOaW61Wl0ajTb9/lFJpi3p9yetAKlcGuDSGVqvLyspF\nomhjd1sUxVdmN8SVF1so0hgajfae8AaIog2azc2cKpI8ApfG0u+P/qvS6y1MuZLDsxVUHAa4NIZS\naWvk9nJ5e8qVHI6toGKxhSKNoV5folJZ27OtUlmlVlvMqaJkbAUVi0fg0hiuHJ02m2fp9RYol7ep\n1U4Gd9RapFaQDHBpbKdO3RZcYO9XlFaQYrZQpDlSlFaQYp4PXJozrVaXZnNzqBW0GPwvi6KY9Hzg\nBrgkzQgv6CBJc8IAl6RAOQulwFxxJxWbAV5QrriTis9BzIJaXj5Du33/iO1nuXDhvhwqEvirSNc2\n6SCmR+AF5Yq72eOvIqXNQcyCcsXd7PE8JEpbkgC/DvgY8BTQBb491YqUClfczR5/FSltSVooS8D1\nwPcBtwMbwA+nWZQOrygnXyoSfxUpbUkC/GvAjcSN9huB/021IqWmCCdfKpJ6fYkoWtvTRol/FZ3M\nsSqFLMkslKPAnwKvAW4C7gCe3vccZ6FII3geEl3LNM6FskrcQlkDvhn4NPAG9h6J75w7d273QbVa\npVqtJvgoSSquTqdDp9PZfXzvvfdCxgG+ATwPvI84yL8AfAdxa+UKj8AlaULTOAL/RuC3gVcCx4AP\nAI/ue44BLkkT8nSykhQoTycrSXPCAJekQBngkhQoA1ySAmWAS1KgDHBJCpQBLkmBMsAlKVAGuCQF\nygCXpEAZ4JIUKANckgJlgEtSoAxwSQqUAS5JgTLAJSlQBrgkBcoAl6RAHc27AOkgrVaXRqNNv3+U\nUmmLen2JU6duy7ssKXcGuGZaq9VlZeUiUbSxuy2K1gAMcc09L2o8wzzyhOXlM7Tb94/YfpYLF+7L\noSIpO5Ne1Ngj8BnlkWes3x/9Fe31FqZciTR7HMScUY1Ge094A0TRBs3mZk4V5aNU2hq5vVzennIl\n0uwxwGfUrBx5tlpdlpfPUK2us7x8hlarO9XPr9eXqFTW9myrVFap1RanWoc0i5K2UO4B7gCOAR8E\nPp5aRQJm48hzFto4Vz6n2TxLr7dAubxNrXZyrtpIUpqqwJOD+9cD9454zo4O5/z5yzuVyuoO7Oze\nKpV7ds6fvzy1GpaW1vZ8/pXb8vKZqdUgzRNgotkfSY7Al4C/AZ4AXg78aoL30AFm4chzVto4kkZL\nEuDfBNwM/CDwrcRH469PsyjFTp26LddWwSy0cSS9tCQB/l/A3wFbwLNAD3jlYPuu9fX13fvVapVq\ntZq0RuWkXl8iitb29MDjAcSTOVYlFUen06HT6SR+fZKFPKeAFeJWymuBy8Dr2Nu7GbRzFLpWq0uz\nuTnUxll0AFHKyKQLeZKuxHwf8DbiaYj3APsnJxvgkjShaQX4QQxwSZrQpAHuQh5JCpQBLkmBMsAl\nKVAGuCQFygCXpEAZ4JIUKANckgJlgEtSoAxwSQqUAS5JgTLAJSlQBrgkBcoAl6RAGeCSFKikV6XP\nTKvVpdFo0+8fpVTaol5f8gICkjTCTAV4q9VlZeXinkt4RdEagCEuSfvMVAul0WjvCW+AKNqg2dx/\nwR9J0kwdgff7o8vp9RbGfo80WjC2cSSFYKYCvFTaGrm9XN4e6/VptGBs40gKxUy1UOr1JSqVtT3b\nKpVVarXFsV6fRgvGNo6kUMzUEfiVI9xm8yy93gLl8ja12smxj3zTaMGk8R6SNA0zFeAQh3jSVsVh\nWzBpvYckTcNMtVAO67AtmLTeQ5Km4UhG77uzs7OT0VtfW6vVpdncHGrBLCaahXLY95CkSR05cgQm\nyOXDBPirgL8E3g48u+/PcgtwSQrVpAGetIVyDPhN4MWEr5ckHVLSAH8/8DDwpRRrkSRNIEmAvwf4\nMtAePM6qjy5JuoYk0wjvAnaA24E3Ah8H7gT+c/hJ6+vru/er1SrVajVpjZJUSJ1Oh06nk/j1hz16\nvgT8HA5iStKhTWsQU5KUs8LNA5ekUHkELklzwgCXpEAZ4JIUKANckgJlgEtSoAxwSQqUAS5JgTLA\nJSlQBrgkBcoAl6RAzdxFjVU8rVaXRqNNv3+UUmmLen3JS9RJKTDAlalWq8vKykWiaGN3WxTFF402\nxKXDsYWiTDUa7T3hDRBFGzSbmzlVJBWHAa5M9fujf+T1egtTrkQqHgNcmSqVtkZuL5e3p1yJVDwG\nuDJVry9Rqazt2VaprFKrLeZUkVQcXtBBmWu1ujSbm/R6C5TL29Rqiw5gSiNMekEHA1ySZoRX5JGk\nOWGAS1KgDHBJCpQBLkmBMsAlKVAGuCQFKkmAHwM+AXSBZ4A7Uq1IkjSWJGcj/Engy8BPAa8A/hr4\nozSLkiQdLMlCnusHr3sBuAn4M6Cy7zku5JGkCU26kCfJEfiLg//eADwGrF3juZKkjCS9oMPNwOPA\nbwCPjnrC+vr67v1qtUq1Wk34UZJUTJ1Oh06nk/j1SVoorwY6wC8Al17iObZQJGlC0ziZ1UPAjwD/\nMLTtHUBv6LEBLkkT8myEkhQoz0YoSXMiswBfXj5Dq9XN6u0lae4lnYVyoHb7fqIonmHo1VckKX2Z\ntlCiaINmczPLj5CkuZV5D7zXW8j6IyRpLmUe4OXydtYfIUlzKdMAr1RWqdUWs/wISZpbmQ1iLi+f\npVY76QCmJGXEhTySNCOmcTZCjaHV6tJotOn3j1IqbVGvL/lrRFKqDPAMtFpdVlYuEkUbu9ucEy8p\nbS6lz0Cj0d4T3uCceEnpM8Az0O+P/mHjnHhJaTLAM1AqbY3c7px4SWkywDNQry9Rqey90pxz4iWl\nzWmEGWm1ujSbm/R6C5TL29Rqiw5gSromL+ggSYHygg6SNCcMcEkKlAEuSYEywCUpUAa4JAXKAJek\nQCUJ8OuADwOfBS4BlVQrkiSNJUmAvwt4GXAr8OvAA6lWpP+n0+nkXUJhuC/T5f7MV5IA/17gwuD+\nM8D3pFeORvEvSXrcl+lyf+YrSYC/HHh+6PF2wveRJB1CkuB9Hrhh33t8PZ1yJEnjSnIulNPAHcBd\nwFuAs8Cpfc95Dgc3JWlSEXBLlh9wBHgY+Mzg9rosP0ySJEmSJKlYXOCTvr8i3peXgN/KuZZQnSDe\nfxD3Fp8CusCHyO58+EU2vD/fBHyRq9/RH82rqAAdAz5B/F18hnhcMdfv52ngY4P7J4AnpvnhBVQm\nDnAldzfweeKDCoAngSuXRXqYeFGaxrd/f/4M8N78ygnae4AHB/dfAfwr8Ifk+P18gL3/An9xmh9e\nQCeAvwcuAp8aPNZkThMf1Tw9eDz8nfwh4INTryhs+/fnw8TfzcvAR4FvyKmuEF3P1f11E/Hsk38b\n+vOxvp9pLsBxgU+6XgTeDywDPw/8Hu7PST0ObA09Hv5J+gJw43TLCd7+/fkM8CvADwD/BJzLo6hA\nvUj8HbwBeAw4w96/32N9P9MMBBf4pOtZ4tAG+EfgK8Br8iunEIa/jzcAX82rkIL4A+Bzg/tPEPfE\nNb6bgU8Dvws8QoLvZ5oB/hngnYP7byHulSm5u7h6orDXEv/C+VJ+5RTC54iPFgHeQTxYpOQuAG8e\n3H878Bc51hKaVwNt4nGF3xlsy/X76QKfdB3l6ih1l/gfRU3uOFcH3b4N6AwefxRnoSRxnKv787uI\nZ01cAn4fe+CTeAj4d67O4LkEfCd+PyVJkiRJkiRJkiRJkiRJkiRJkqTZ83/wBRi9MlvpAQAAAABJ\nRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do a regression. We'll use formulas to specify regression models.\n", "This allows us to write our code in simple and intuitive ways,\n", "and means the we don't have to remember how to create design matrices for more complicated models.\n", "To do this we'll need to be using Pandas to manage the data.\n", "The regression function is name `ols` for \"ordinary least-squares\" the standard approach to regression.\n", "It takes two arguments. The first is a formula describing the regression we want to fit.\n", "In this case we just want to model the effect of `x` on `y` so we use `y ~ x`,\n", "where `x` and `y` are the names of columns in a data frame.\n", "The second argument is the name of the data frame that contains the columns `x` and `y`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = smf.ols('y ~ x', data).fit()\n", "print results.summary()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.785\n", "Model: OLS Adj. R-squared: 0.773\n", "Method: Least Squares F-statistic: 65.84\n", "Date: Sat, 04 Oct 2014 Prob (F-statistic): 2.00e-07\n", "Time: 11:41:12 Log-Likelihood: -40.471\n", "No. Observations: 20 AIC: 84.94\n", "Df Residuals: 18 BIC: 86.93\n", "Df Model: 1 \n", "==============================================================================\n", " coef std err t P>|t| [95.0% Conf. Int.]\n", "------------------------------------------------------------------------------\n", "Intercept 2.2481 0.832 2.704 0.015 0.501 3.995\n", "x 0.6071 0.075 8.114 0.000 0.450 0.764\n", "==============================================================================\n", "Omnibus: 1.724 Durbin-Watson: 2.053\n", "Prob(Omnibus): 0.422 Jarque-Bera (JB): 1.008\n", "Skew: -0.549 Prob(JB): 0.604\n", "Kurtosis: 2.940 Cond. No. 21.5\n", "==============================================================================\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, ``.summary()`` presents us with most of the information we would want about the regression.\n", "We can also pull this information out in individual pieces. For example," ] }, { "cell_type": "code", "collapsed": false, "input": [ "intercept, slope = results.params\n", "r2 = results.rsquared\n", "print slope, intercept, r2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.607133890043 2.24811810288 0.78529268393\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes it easy to store the key results of large numbers of analyses, or present\n", "the results in alternative ways, like graphs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.plot(data['x'], data['y'], 'bo')\n", "plt.hold(True)\n", "x = np.array([min(x), max(x)])\n", "y = intercept + slope * x\n", "plt.plot(x, y, 'r-')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD/CAYAAADoiI2GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHYpJREFUeJzt3Xl4VFWax/GvbBWlEVG7naGbAS0UUAFREYQWi0FSLCKK\nCy7Y7o5bQstOEiAq0VYEJCWjPSrTjmNjjz600pRCaEwRRUUFJLKJlooLLgiCLCaQ5M4fJ0CCgVRV\n6ta9VfX7PE8eK9daXsvi5dR7znsOiIiIiIiIiIiIiIiIiIiIiIiIiIgcVg+g+JBr1wJvORCLiIhU\na1LPvx8HjAB21bjWDbjZtohERCQuhgHtgberfz8BeBU4s8Y1ERFxqXaYZN0IeBnoWOOaiIi4WDtM\nsu4OrMHUw98GdgAznAtLRCS91VcDr+k9TOkEoC3wAjCqrjt6vV4rHA43MDQRkbQTxpStI9IowvtZ\nh/x+VB3XDkYQDmNZln7i9DNlyhTHY0iVH72XqfF+LliwFK83pzoNmR+vN4cFC5Y6/p405AfwRpq8\nI03gnwO9IrgmIpIQhYVFhMMFta6FwwUEAosTH4xlwSuvwIYNCX/pSEfgIiKuUV5ed/W3rKxxYgNZ\nuxYyMyEnB3buTOxrowSeFHw+n9MhpAy9l/Hl1Pvp8VTUeT0jozIxAWzbBllZ0LcvDB0Kq1dD9+6J\nee0alMCTgJJO/Oi9jK+GvJ/BYAl+fx4+Xz5+fx7BYEnEj83OzsTrza11zevNISurf8zxRKSiAmbP\nhk6dTOlk/Xq45x5oEs16kPhx5lVFJK0FgyWMHLmoVh07HDYJefDgPvU+fv99AoFJlJU1JiOjkqys\nARE9NmZLlsDIkXDSSfDPf0Lnzva9VoSOsul5reoZVRGRX/D78ygqmlrH9UksXPiAAxEdQTgMY8ZA\naSlMn25KJkfZkzqPMs8b8ZOrhCIiCeeaScgj2bkTJk6EHj3Mz9q1cOmltiXvWCiBi0jCOT4JeSRV\nVfDss9CxI3zzjRl5T5gAGRlOR/YLqoGLSMJlZ2cSDufWqoGbScgBDkYFvPMOZGdDo0bw97/Deec5\nG089VAMXEUcEgyUEAotrTEL2t3cS8ki+/tqMsouL4aGH4LrrTBJPsGhr4ErgIpK2Xvv7Yr4d9yDD\nPl9OsM3ZHP9IHgOuiO5bQDBYQmFhEeXlTfB4KsjOzoz5L6JoE7hKKCKSfiyLFblTOXPadHZXXMRZ\nrOXzz07GOyGXWUcfE3ECbuhyyIbSJKaIpJfSUujXj1aBx7mhYh5X8hKfczIQ/X4qTu/JogQuIunh\nhx/grrugf3+48kpuO/s/KObff3G3aJYyOr0cUglcRFLbvn1QWAinnw5Nm5r29zvvpElGVZ13j2Yp\no9PLIZXARSR1LVoEXbvCggUQCsGsWXD88UB89lNxbE+WalqFIiKp5+OPYfRoM9qeMQMuvrjODsp4\nLGWM53JILSMUkfT1008wdSrMmQPjx5umHI/H6agipr1QRCT9VFWZpN2xI2zdCmvWwNixSZW8Y6F1\n4CKS3JYtM9u8ejwwfz6ce67TESWMEriIJKcvvzRlkjfegEcegauvdtVOgYmgEoqIJJc9e+D++6Fb\nNzj1VHOY8DXXpF3yBo3ARSRZWBa8+KKpbffsCStWQNu2TkflKCVwEXG/VatMnXvnTnjuOejj0K6F\nLqMSioi41/ffw+23w8CBcP318P77St41RJLAewDF1bfPAkqqf18I/MamuEQkne3daxpwzjgDWrQw\nde7bboPGLjpyzQXqS+DjgKeA/YspHwPuAfoC84Dx9oUmImnp1VfNie9LlsCbb5qDhI87zumoXKm+\nGvgnwDDguerfrwa+rb7dFPjZprhEJN1s2ACjRplT4GfOhEGDnI7I9eobgc8Dam63tT959wLuBmba\nEZSIpJHt203ivuACs9Xrhx8qeUcollUow4EcYBCw9XB3ys/PP3Db5/Ph8/lieCkRSVmVlfDMMzB5\nMgwdCmvXwm/Sa1otFAoRCoVifnwkK9/bAXOB84ERwO3AUODHIzxGm1mJ2CCe5y86aulSsyzw2GPN\nFq/dujkdkSvYdSamhSm3zAI2YUorAEuB/MjDE5FYOX3+Ylxs2mQacZYvh2nT4Mor07KDMl60naxI\nkvD78ygqmlrH9UksXPiAAxFFYfdus1/J7Nlmi9cxY+CYY5yOynV0Kr1IinL6/MWYWBbMnQsTJphJ\nylWroE0bp6NKGUrgIknC6fMXo/b++6bOXVZmknjv3k5HlHLUSi9pIRgswe/Pw+fLx+/PIxgscTqk\nqDl9/mLEvv0Wbr4ZhgyBW26B995T8raJRuCS8lJi8o+DsQYCk2qcvzjAPf8N5eXm9PeHHzYJ/KOP\nzCoTsY0mMSXlJfXkXzKwLHPq+6hR0KmTaX0/9VSno0pKmsQUOURSTv4li3Xr4N57zek4jz8Ofr/T\nEaUV1cAl5SXd5F8y+PFHM0Hp88HgwbB6tZK3A5TAJeUlzeRfMqiogCeeMKe/79tnRuDZ2dC0qdOR\npSWVUCTluX7yL1m8/jr88Y9w4omweDF06eJ0RGlPk5gicmSffWY6J1euNBOUl12m9nebRDuJqRKK\niNRt1y7IzYXu3eGcc2D9ehg2TMnbRZTARaS2qipzcHDHjmZ1yerVkJMDGRlORyaHUA1cRA5avtys\nLrEseOkl6NmzzrulzLa2SU4JXERg82aYOBH++U946CEYMQIa1f0FPVU6W1OBSigi6ayszCTsLl3g\nt78151L+4Q+HTd4AhYVFtZI3QDhcQCCw2O5o5RAagYukI8uCl1+G0aPhrLPg3XfhlFMieqg6W91D\nCVwk3Xz4oVnP/f338NRT0K9fVA9XZ6t7qIQiki62boW774aLLoLLLzeHK0SZvEGdrW6iEbhIqtu3\nD558Eh54AK6+2qznPv74mJ9Ona3uoU5MkVS2eLEpl7RuDY89Bmec4XREcgTaTlZE4JNPzATl2rUw\nY4Y5HUcdlClHNXCRVLJzJ4wfbxpwevUyCfySS5S8U5QSuEgqqKqCv/wFOnSALVvMSpPx48HjcToy\nsZFKKCLJ7q23TPt706bwyitm8ylJC5GMwHsAxdW32wNvAiXAf2LfJKiI1Oerr+C66+Cqq8xE5bJl\nSt5ppr4EPg54Ctj/PWwGkAP0wSTvofaFJiJ1+vlnsySwa1fTPblhg0nkqnOnnfoS+CfAMA6OtM/G\njL4BXgMusikuETnU/h0CO3WC0lJYscIk8l/9yunIxCH11cDnAe1q/F7zr/hdQMt4ByQidfjgA1Mm\n2b4dnn0WLrzQ6YjEBaKdxKyqcbsFsP1wd8zPzz9w2+fz4fP5onwpEWHLFpg0yWw8dd99cOut0Fib\nRqWKUChEKBSK+fGRFM3aAXOB84H5wHRgKfAksAR4sY7HqBNTpCH27YPZs6GgwOzNPXkytGrldFRi\nM7s6Mfdn49GYSc1mwDrgpWiCE5EILFwI994LbdtCSYmpeYvUQXuhiLjFxo0wapT558yZMGiQVpak\nGZ1KL5JsduyAMWOgd2/o2xfWrIHBg5W8pV5K4CJOqayEp582p7/v2GES9+jR0KyZ05FJklArvYgT\n3njDtL83bw7BIJx9ttMRSRJSAhdJpC++gHHjzP4l06aZNniVSiRGKqGIJMKePZCfb0banTqZ9vfh\nw5W8pUE0Ahexk2XB3/5mRt29e8PKlfBv/+Z0VJIilMBFIhQMllBYWER5eRM8ngqyszOPfA7kihWm\nzr1nDzz/PFxwQeKClbSgBC4SgWCwhJEjFxEOFxy4Fg6bk9l/kcS/+w5yc83k5NSpcOONan8XW6gG\nLhKBwsKiWskbIBwuIBBYfPDC3r3w6KPm4OBWrUyd+5ZblLzFNhqBi0SgvLzuPyplZY1NnTsYNF2U\nHTqYFSannZbgCCMXdSlIXEsJXCQCHk9FnddPrfgOBg6ETZugsBAGDEhwZNGJqhQkrqcSikgEsrMz\n8XpzD/x+HD8yp2V3Hv9wrkngpaWuT94QYSlIkoZG4CIR2D86nV2YS7/PVnPTphA7z++H539ehV//\n2uHoInfEUpAkHSVwkQgNbl7F4G8XQOtW8OIyju/a1emQona4UlBGRmWCI5F4UAlFpD6ffQZXXGGW\nA06aBMXF5kDhJHRoKQjA680hK6u/QxFJQ2g/cJHD2b0bHnoInnjCHLAwejQcfbTTUTVYMFhCILCY\nsrLGZGRUkpXVXxOYLhHtfuBK4CKHsiz4619hwgRzePCf/gS/+53TUUkasOtINZH08N57pv193z6z\nh0mvXk5HJHJYqoGLAHzzDdx0EwwdCrffDsuXK3mL62kEnsLUcReB8nJ47DGzN/ett5r292OPdToq\nkYgogacoddzVw7Jg/nwzMXnmmfDOO9C+vdNRiURFk5gpyu/Po6hoah3XJ7Fw4QMOROQia9aYVSWb\nN5vRd//ELaHTtyI5Ek1iCqCOuzpt2wZTppjJycmT4Y47oEni/gjoW5HEmyYxU5Q67mqoqIDZs83p\n75YF69fDPfckNHmD9iGR+IslgTcC5gBvAiVAh7hGJHGhjrtqS5bAWWfBvHnm9uOPwwknOBKKvhVJ\nvMUyBMkEmgO/By4CCoAr4hmUNNz+r+SBwKQaHXcD0uerejgMY8aYXQKnTzfLAx0+QFjfiiTeYkng\nPwMtMYX2lsDeuEYkcTN4cJ/0Sdj77dwJDz4ITz1lEvjcuZCR4XRUgPlWFA7n1iqjmG9F7t+GVtwp\nlgS+DMgANgAnAEPiGpFILKqq4LnnICfHrCopLYXWrZ2Oqpa0/1YkcRfLd8ocTAklF/gd8DpwJrVH\n4taUKVMO/OLz+fD5fLFHKXIk77wD2dnQqJE5Fee885yOSCQioVCIUCh04Pf77rsPbN7MqgD4CXgY\nk8jXAKdjSiv7aR242O/rr82GU8XFZsOpa681SVwkSUW7DjyWT/s0oCfwBrAEmEjt5C1ir7IyKCgw\ne3K3bWva30eMUPKWtBNLDXw7cFm8AxGpl2WZ5YBjxsA555idA08+2emoRByjTkxJDqWlZpvXrVth\nzhzo29fpiEQcp++c4m4//AB33mlWlgwfDitXKnmLVFMCF3fatw9mzYJOnaBZM9P+nuC9S0TcTn8a\nxH0WLTK7BbZpA0uXwumnOx2RiCspgYt7fPwxjBplVpXMnAmDBzve/i7iZiqhiPN27ICxY+H886FP\nH7Nf98UXK3mL1EMJXJxTVQXPPGO2ed22zSTusWPB43E6MpGkoBKKOGPZMrMs0OOBf/wDzj3X6YhE\nko4SuCTWl1/C+PHw5pvw8MNw9dUqlYjESCUUSYw9e+D++6FbNzj1VLMs8JprlLxFGkAjcLGXZcGL\nL5rads+esGKF2b9ERBpMCVzss2qVqXPv3Gn26u6jfa9F4kklFIm/77+H22+HgQPh+uvh/feVvEVs\noAQu8bN3L8yYAWecAS1amIac226Dxjq0V8QOKqFIfLz6qml/b9/erDDp0MHpiERSnhK4NMyGDab9\n/dNPTfv7oEFORySSNlRCkdhs324S9wUXHDxEWMlbJKGUwCU6lZXwX/9l2t9374a1a03ppFkzpyMT\nSTsqoUjkli41ywJbtoTXXjNNOQkQDJZQWFhEeXkTPJ4KsrMzGTxYq1pElMClfps2mUacd9+FadPg\niisS1kEZDJYwcuQiwuGCA9fC4VwAJXFJeyqhuFgwWILfn4fPl4/fn0cwWJLYAHbvhsmTzQHCnTub\n9vcrr0xo+3thYVGt5A0QDhcQCCxOWAwibqURuEs5OvK0LJg712w61aeP6ahs08be1zyM8vK6P6Jl\nZVpbLqIRuEs5NvJ8/334/e9NQ84LL8DzzzuWvAE8noo6r2dkVCY4EhH3UQJ3qYSPPL/9Fm6+GS65\nBG691dS7e/d2vIyTnZ2J15tb65rXm0NWVv+ExiHiRrGWUCYCQ4CmwOPAs3GLSIAEjjzLy83p7488\nArfcYhpzjj0WcMcE4v7XCQQmUVbWmIyMSrKyBmgCUyRGPmB+9e3mwH113MeShlmwYKnl9eZYpiBt\nfrzeidaCBUvj8wJVVZY1f75ltW9vWUOGWNbGjb+4S2Zmbq3X3//j9+fFJwYRqQWwoknGsYzAM4EP\ngZeBY4GxMTyH1MPWkee6dab55ssvYfZsyMys826aQBRxt1gS+K+BNsDFwCmY0XjHeAYlxuDBfeJb\nKti2DfLzzQqTSZPgzjuhadPD3l0TiCLuFksC/wFYD1QAG4Ey4MTq6wfk5+cfuO3z+fD5fLHGKA1V\nUQFPPWWS9+WXm/XcJ55Y78OyszMJh3Nr1cDNBOIAG4MVSR+hUIhQKBTz42PpyBgMjMSUUloDS4HT\nqF27qS7niONefx3++EeTsB97DLp0ierhwWAJgcDiGmWc/ppAFLHJUaZJLuK8HGtL3cNAX8wyxInA\noYuTlcCd9umnpv191Sp49FG47DIdICzicolK4PVRAnfKrl3w0EPw5z+b7V5HjYKMDKejEpEIRJvA\n1ciTKqqqzMHBHTua1SWrV0NOjpK3SArTXiipYPlys82rZcFLL0HPnk5HJCIJoBF4Mtu8GW64AYYN\ng7vugrffVvIWSSNK4MmorMzUubt0gd/+1rS//+EP0Ej/O0XSiUooycSy4OWXYfRoOOsss+HUKac4\nHZWIOEQJPFl8+KFZz/3996Ypp18/pyMSEYfpO7fbbd0Kd98NF11kuihXrVLyFhFACdy99u2DQAA6\ndYLGjU37+113QRN9aRIRQ9nAjRYvNuWS1q2huBjOOMPpiETEhZTA3eSTT8wE5dq15kizIUPU/i4i\nh6USihv89JM5QLhnT+jd2yTwSy5R8haRI1ICd1JVFfz3f5v29y1bzEqTcePA43E6MhFJAiqhOOWt\nt0z7e9Om8Mor0L270xGJSJJRAk+0r74y5ZKSEnj4YbjmGpVKRCQmriuhBIMl+P15+Hz5+P15BIMl\nTocUHz//DA88AF27mu7JDRvg2muVvEUkZq4agQeDJYwcuajWEV7hcC5A8p4Cs3+HwLFjTZlkxQpo\n187pqEQkBbjqQAe/P4+ioql1XJ/EwoUPxCOuxPrgA7Oee/t2mDULLrzQ6YhExMWiPdDBVSPw8vK6\nwykraxzxcwSDJRQWFlFe3gSPp4Ls7MyoR+8Nfo4tWyAvz0xO3n8/3HKL6aYUEYkjVyVwj6eizusZ\nGZURPT4eJZgGPcfevTB7Njz4IIwYYercxx0X0euKiLiFFYsFC5ZaXm+OZQrH5sfrnWgtWLA0osdn\nZubWeuz+H78/L+IYYn6O116zrA4dLMvvt6x16yJ+PRGR/YCoas+uGoHvH+EGApMoK2tMRkYlWVkD\nIh49x6MEE/VzbNxoDg7euBFmzoRBg7SyREQSwlUJHEwSj3XFSUNLMFE9x44dZlngs8/ChAkwbx40\naxbx64iINJTr1oE3RHZ2Jl5vbq1rXm8OWVn94/cclZXw9NOm/X3HDlizxmxApeQtIgnmqmWE8RAM\nlhAILK5Rgukf0yqUOp/jjTdM+3vz5mZZ4Nln2/RfISLpKNplhA1J4L8BVgD9gI2H/DvHErgtvvjC\nbDL19tvwyCNw1VWqc4tI3EWbwGMtoTQF/gzsjvHxyWHPHsjPNyPtTp3MqTjDhyt5i4grxJrApwFP\nAN/EMRb3sCx44QVT5/7oI1i5EqZMgWOOcToyEZEDYlmFciOwBSgCJmJfHd0ZK1aYOveePfD883DB\nBU5HJCJSp1gS+E2YxeYXAWcBzwJDge9q3ik/P//AbZ/Ph8/nizXGxPjuO8jNhWAQpk6FG29U+7uI\n2CoUChEKhWJ+fENHz8XAf5DMk5h790JhIfzpT3DTTWYPk5YtnY5KRNJQUm9mlVCWZUbbo0ZBhw7m\nhJzTTnM6KhGRiKXcOvCIrF8P994LmzaZ9vcBA5yOSEQkYcsIk9OPP5r9ufv0gYEDobRUyVtEklZ6\nJPDKSnjySbMssLwc1q07eKCwiEiSSv0aeChkknWrVlBUZM6kFBFJAambwD/7zJxDuWIFPPooDBum\nDkoRSSmpV0LZtcssBezeHbp1M+WSyy9X8haRlJM6CbyqCv73f02d+/PPYfVq05hz9NFORyYiYovU\nKKG8+66pc1dUwP/9H/Tq5XREIiK2S+4R+DffmO7Jyy6DO+6A5cuVvEUkbSRnAi8rM63vnTvDv/yL\nOf39hhugUXL+54iIxCK5SiiWBfPnm/b3zp3NiNvrdToqERFHJE8CX7PGtL9v3myacvpHfs6lOCsY\nLKGwsIjy8iZ4PBVkZ2fGfHC1iBzk/gS+bZs5TOFvf4PJk02tu4n7wxYjGCxh5MhFhMMFB66Fw+bQ\naCVxkYZxb9G4ogJmzzbLAi3LbEB1zz1K3kmmsLCoVvIGCIcLCAQWOxSRSOpwZzZcssQsCzzpJHO7\nc2enI5IYlZfX/RErK9NhGSIN5a4EHg7DmDFml8Dp02HoUHVQJjmPp6LO6xkZlQmORCT1uKOEsnMn\nTJwIPXqYn7Vr4dJLlbxTQHZ2Jl5vbq1rXm8OWVmahBZpKGdH4FVV8NxzkJNjVpWUlkLr1o6GJPG1\nf6IyEJhEWVljMjIqycoaoAlMkThw7kSed96B7GzTfFNYCOedZ1MoIiLJwf1nYn79NUyYAMXFppvy\n2mvVQSkiEoPEZc6ff4aCAnOgQtu2pv19xAglbxGRGNk/ArcsmDfPrC455xx47z04+WTbX1ZEJNXZ\nm8BLS8167q1bYc4c6NvX1pcTEUkn9tUv7rzTrCwZPhxWrlTyFhGJM/tG4B6PqXO3amXbS4iIpLNY\nlhE2BeYAbQEPMBX4xyH3qX8ZoYiI1JKIZYTXAVuA64FWwAf8MoGLiIjNYhmBN69+3C7gBOBd4NBT\nFTQCFxGJUiJG4Lur/9kCeBHIPcJ9RUTEJrFOYrYB5gGzgRfqukN+fv6B2z6fD5/PF+NLiYikplAo\nRCgUivnxsZRQTgJCwF1A8WHuoxKKiEiUoi2hxJLAZwFXAh/VuDYQKKvxuxK4iEiUEpHAI6EELiIS\npWgTuHaSEhFJUrYlcL8/j2CwxK6nFxFJe7a10hcVTSUcNisMdfqKiEj82VpCCYcLCAQW2/kSIiJp\ny/YaeFlZY7tfQkQkLdmewDMyKu1+CRGRtGRrAvd6c8jK6m/nS4iIpC3bJjH9/klkZQ3QBKaIiE3U\nyCMi4hKJ2I1QIhAMllBYWER5eRM8ngqyszP1bURE4koJ3AbBYAkjRy4iHC44cE1r4kUk3tRKb4PC\nwqJayRu0Jl5E4k8J3Abl5XV/sdGaeBGJJyVwG3g8FXVe15p4EYknJXAbZGdn4vXWPmlOa+JFJN60\njNAmwWAJgcBiysoak5FRSVZWf01gisgR6UAHEZEkpQMdRETShBK4iEiSUgIXEUlSSuAiIklKCVxE\nJEkpgYuIJKlYEngj4EngLaAY8MY1IhERiUgsCfxSoBnQC5gATI9rRPILoVDI6RBSht7L+NL76axY\nEnhvYGH17eXAufELR+qiPyTxo/cyvvR+OiuWBH4s8FON3ytjfB4REWmAWBLvT0CLQ56jKj7hiIhI\npGLZC2UYMAS4CegJTAIGH3KfT9DkpohItMJAeztf4CjgCWBZ9c9pdr6YiIiIiIiIiEhqUYNP/K3E\nvJfFwDMOx5KsemDePzC1xTeBEuA/sW8//FRW8/3sBnzFwc/oVU4FlYSaAs9hPovLMfOKjn4+hwFz\nqm/3AF5O5IunoAxMApfYjQNKMYMKgPnA/mORnsA0pUnkDn0/bwVGORdOUrsRmFF9uxXwBfAKDn4+\np1P7b+CvEvniKagHsAFYBCyp/l2iMwwzqnm7+vean8lLgMcTHlFyO/T9fALz2VwKPA38yqG4klFz\nDr5fJ2BWn3xZ499H9PmMZwOOGnziazcwDfADdwDPo/czWvOAihq/1/xKugtomdhwkt6h7+dyYAxw\nIfApMMWJoJLUbsxnsAXwIpBH7T/fEX0+45kQ1OATXxsxSRvgY2Ar8K/OhZMSan4eWwDbnQokRfwd\nWFV9+2VMTVwi1wZ4HfgfYC4xfD7jmcCXAYOqb/fE1MokdjdxcKOw1phvON84F05KWIUZLQIMxEwW\nSewWAt2rb/cD3ncwlmRzElCEmVf4S/U1Rz+favCJryYcnKUuwfylKNFrx8FJt1OBUPXvT6NVKLFo\nx8H3sytm1UQx8FdUA4/GLGAzB1fwFANd0OdTRERERERERERERERERERERERERERERERExH3+H2Ek\nJ0Y+fw0dAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that in order to plot the regression line what we actually do is\n", "plot a line with the appropriate slope and intercept by:\n", "\n", "1. taking the minimum and maximum values of x\n", "2. calculating their values of y based on the regression results\n", "3. and plotting those two points with a straight line connecting them and no symbols" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Multiple-regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple-regression works the same way, but with additional terms in the formula." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "#generate some data\n", "x = np.random.randn(50)\n", "z = np.random.randn(50)\n", "noise = np.random.randn(50)\n", "y = 3 + 0.5 * x + 1.5 * z + noise\n", "\n", "data = pd.DataFrame(zip(x, y, z), columns=['x', 'y', 'z'])\n", "results = smf.ols('y ~ x + z', data).fit()\n", "print results.summary()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.765\n", "Model: OLS Adj. R-squared: 0.755\n", "Method: Least Squares F-statistic: 76.36\n", "Date: Sat, 04 Oct 2014 Prob (F-statistic): 1.72e-15\n", "Time: 11:41:12 Log-Likelihood: -65.717\n", "No. Observations: 50 AIC: 137.4\n", "Df Residuals: 47 BIC: 143.2\n", "Df Model: 2 \n", "==============================================================================\n", " coef std err t P>|t| [95.0% Conf. Int.]\n", "------------------------------------------------------------------------------\n", "Intercept 2.8254 0.133 21.294 0.000 2.558 3.092\n", "x 0.5812 0.124 4.699 0.000 0.332 0.830\n", "z 1.4573 0.123 11.868 0.000 1.210 1.704\n", "==============================================================================\n", "Omnibus: 2.028 Durbin-Watson: 1.739\n", "Prob(Omnibus): 0.363 Jarque-Bera (JB): 1.936\n", "Skew: 0.452 Prob(JB): 0.380\n", "Kurtosis: 2.663 Cond. No. 1.20\n", "==============================================================================\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes it easy to do more complicated designs, including interactions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = smf.ols('y ~ x + z + x*z', data).fit()\n", "print results.summary()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.766\n", "Model: OLS Adj. R-squared: 0.751\n", "Method: Least Squares F-statistic: 50.17\n", "Date: Sat, 04 Oct 2014 Prob (F-statistic): 1.52e-14\n", "Time: 11:41:12 Log-Likelihood: -65.585\n", "No. Observations: 50 AIC: 139.2\n", "Df Residuals: 46 BIC: 146.8\n", "Df Model: 3 \n", "==============================================================================\n", " coef std err t P>|t| [95.0% Conf. Int.]\n", "------------------------------------------------------------------------------\n", "Intercept 2.8336 0.135 21.021 0.000 2.562 3.105\n", "x 0.5671 0.128 4.432 0.000 0.310 0.825\n", "z 1.4522 0.124 11.690 0.000 1.202 1.702\n", "x:z 0.0541 0.110 0.493 0.624 -0.167 0.275\n", "==============================================================================\n", "Omnibus: 1.898 Durbin-Watson: 1.788\n", "Prob(Omnibus): 0.387 Jarque-Bera (JB): 1.844\n", "Skew: 0.425 Prob(JB): 0.398\n", "Kurtosis: 2.596 Cond. No. 1.46\n", "==============================================================================\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also include transformations in formulas using functions from Numpy." ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = smf.ols('y ~ x + np.log10(z)', data).fit()\n", "print results.summary()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.336\n", "Model: OLS Adj. R-squared: 0.278\n", "Method: Least Squares F-statistic: 5.825\n", "Date: Sat, 04 Oct 2014 Prob (F-statistic): 0.00898\n", "Time: 11:41:12 Log-Likelihood: -41.444\n", "No. Observations: 26 AIC: 88.89\n", "Df Residuals: 23 BIC: 92.66\n", "Df Model: 2 \n", "===============================================================================\n", " coef std err t P>|t| [95.0% Conf. Int.]\n", "-------------------------------------------------------------------------------\n", "Intercept 4.5062 0.261 17.288 0.000 3.967 5.045\n", "x 0.6071 0.232 2.616 0.015 0.127 1.087\n", "np.log10(z) 2.2221 0.786 2.826 0.010 0.596 3.849\n", "==============================================================================\n", "Omnibus: 2.703 Durbin-Watson: 1.307\n", "Prob(Omnibus): 0.259 Jarque-Bera (JB): 1.915\n", "Skew: 0.664 Prob(JB): 0.384\n", "Kurtosis: 2.960 Cond. No. 3.64\n", "==============================================================================\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ANOVA\n", "Using formulas also makes it easy to conduct statistical tests that are based on linear models.\n", "ANOVA is simply a linear model with appropriate dummy variables set up for each factor.\n", "To do this in statsmodels we simply use ``C()`` to tell the module that the\n", "variable of interest is categorical.\n", "\n", "This time, let's start by grabbing some data from the web." ] }, { "cell_type": "code", "collapsed": false, "input": [ "url = 'http://stats191.stanford.edu/data/rehab.csv'\n", "rehab_table = pd.read_table(url, delimiter=\",\")\n", "rehab_table" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FitnessTime
0 1 29
1 1 42
2 1 38
3 1 40
4 1 43
5 1 40
6 1 30
7 1 42
8 2 30
9 2 35
10 2 39
11 2 28
12 2 31
13 2 31
14 2 29
15 2 35
16 2 29
17 2 33
18 3 26
19 3 32
20 3 21
21 3 20
22 3 23
23 3 22
\n", "

24 rows \u00d7 2 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ " Fitness Time\n", "0 1 29\n", "1 1 42\n", "2 1 38\n", "3 1 40\n", "4 1 43\n", "5 1 40\n", "6 1 30\n", "7 1 42\n", "8 2 30\n", "9 2 35\n", "10 2 39\n", "11 2 28\n", "12 2 31\n", "13 2 31\n", "14 2 29\n", "15 2 35\n", "16 2 29\n", "17 2 33\n", "18 3 26\n", "19 3 32\n", "20 3 21\n", "21 3 20\n", "22 3 23\n", "23 3 22\n", "\n", "[24 rows x 2 columns]" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see if the time to rehabilitate an injury is related to the fitness category we do the same as above,\n", "but wrapping the predictor in ``C()``." ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = smf.ols('Time ~ C(Fitness)', rehab_table).fit()\n", "print results.summary()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: Time R-squared: 0.618\n", "Model: OLS Adj. R-squared: 0.581\n", "Method: Least Squares F-statistic: 16.96\n", "Date: Sat, 04 Oct 2014 Prob (F-statistic): 4.13e-05\n", "Time: 11:41:13 Log-Likelihood: -68.286\n", "No. Observations: 24 AIC: 142.6\n", "Df Residuals: 21 BIC: 146.1\n", "Df Model: 2 \n", "===================================================================================\n", " coef std err t P>|t| [95.0% Conf. Int.]\n", "-----------------------------------------------------------------------------------\n", "Intercept 38.0000 1.574 24.149 0.000 34.728 41.272\n", "C(Fitness)[T.2] -6.0000 2.111 -2.842 0.010 -10.390 -1.610\n", "C(Fitness)[T.3] -14.0000 2.404 -5.824 0.000 -18.999 -9.001\n", "==============================================================================\n", "Omnibus: 0.163 Durbin-Watson: 2.209\n", "Prob(Omnibus): 0.922 Jarque-Bera (JB): 0.211\n", "Skew: -0.163 Prob(JB): 0.900\n", "Kurtosis: 2.675 Cond. No. 3.80\n", "==============================================================================\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "While all of the information that we want is technically present in this table,\n", "we typically want it presented in more standard fashion for ANOVA.\n", "We can do this using the ``anova_lm`` function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from statsmodels.stats.anova import anova_lm\n", "\n", "anova_lm(results)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dfsum_sqmean_sqFPR(>F)
C(Fitness) 2 672 336.000000 16.961538 0.000041
Residual 21 416 19.809524 NaN NaN
\n", "

2 rows \u00d7 5 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ " df sum_sq mean_sq F PR(>F)\n", "C(Fitness) 2 672 336.000000 16.961538 0.000041\n", "Residual 21 416 19.809524 NaN NaN\n", "\n", "[2 rows x 5 columns]" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### And lots more\n", "\n", "Statsmodels includes much more advanced functionality including:\n", "\n", "* Generalized Least Squares (i.e., correlated error structures such as for spatial and comparative analysis)\n", "* Generalized Linear Models (i.e., non-normal error)\n", "* Robust Linear Models\n", "* Regression with Discrete Dependent Variable (e.g., logistic regression)\n", "* Time Series analysis" ] } ], "metadata": {} } ] }