{ "metadata": { "name": "", "signature": "sha256:d48ca7165bce5adbf6b30594f316a588aeadb01fd81e0b38b6374f24583b8e45" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Formulas: Fitting models using R-style formulas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n", "\n", "* [Patsy formula language description](http://patsy.readthedocs.org/)\n", "\n", "## Loading modules and functions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "import numpy as np\n", "import statsmodels.api as sm" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Import convention" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import explicitly from statsmodels.formula.api" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from statsmodels.formula.api import ols" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "sm.formula.ols" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use the following conventioin" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import statsmodels.formula.api as smf" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sm.OLS.from_formula" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](http://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n", "\n", "``dir(sm.formula)`` will print a list of available models. \n", "\n", "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## OLS regression using formulas\n", "\n", "To begin, we fit the linear model described on the [Getting Started](gettingstarted.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)\n", "res = mod.fit()\n", "print(res.summary())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables\n", "\n", "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n", "\n", "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()\n", "print(res.params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](contrasts.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators\n", "\n", "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n", "\n", "### Removing variables\n", "\n", "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()\n", "print(res.params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiplicative interactions\n", "\n", "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()\n", "res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()\n", "print(res1.params, '\\n')\n", "print(res2.params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions\n", "\n", "You can apply vectorized functions to the variables in your model: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()\n", "print(res.params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a custom function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def log_plus_1(x):\n", " return np.log(x) + 1.\n", "res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()\n", "print(res.params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any function that is in the calling namespace is available to the formula." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using formulas with models that do not (yet) support them\n", "\n", "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n", "can then be fed to the fitting function as `endog` and `exog` arguments. \n", "\n", "To generate ``numpy`` arrays: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import patsy\n", "f = 'Lottery ~ Literacy * Wealth'\n", "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n", "print(y[:5])\n", "print(X[:5])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate pandas data frames: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "f = 'Lottery ~ Literacy * Wealth'\n", "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n", "print(y[:5])\n", "print(X[:5])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(sm.OLS(y, X).fit().summary())" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }