{
 "metadata": {
  "name": "",
  "signature": "sha256:d48ca7165bce5adbf6b30594f316a588aeadb01fd81e0b38b6374f24583b8e45"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Formulas: Fitting models using R-style formulas"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n",
      "\n",
      "* [Patsy formula language description](http://patsy.readthedocs.org/)\n",
      "\n",
      "## Loading modules and functions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from __future__ import print_function\n",
      "import numpy as np\n",
      "import statsmodels.api as sm"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Import convention"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can import explicitly from statsmodels.formula.api"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from statsmodels.formula.api import ols"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sm.formula.ols"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Or you can use the following conventioin"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import statsmodels.formula.api as smf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sm.OLS.from_formula"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](http://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n",
      "\n",
      "``dir(sm.formula)`` will print a list of available models. \n",
      "\n",
      "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "## OLS regression using formulas\n",
      "\n",
      "To begin, we fit the linear model described on the [Getting Started](gettingstarted.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Fit the model:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)\n",
      "res = mod.fit()\n",
      "print(res.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Categorical variables\n",
      "\n",
      "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n",
      "\n",
      "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()\n",
      "print(res.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](contrasts.html)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Operators\n",
      "\n",
      "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n",
      "\n",
      "### Removing variables\n",
      "\n",
      "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()\n",
      "print(res.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Multiplicative interactions\n",
      "\n",
      "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()\n",
      "res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()\n",
      "print(res1.params, '\\n')\n",
      "print(res2.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Functions\n",
      "\n",
      "You can apply vectorized functions to the variables in your model: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()\n",
      "print(res.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Define a custom function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def log_plus_1(x):\n",
      "    return np.log(x) + 1.\n",
      "res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()\n",
      "print(res.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Any function that is in the calling namespace is available to the formula."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Using formulas with models that do not (yet) support them\n",
      "\n",
      "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n",
      "can then be fed to the fitting function as `endog` and `exog` arguments. \n",
      "\n",
      "To generate ``numpy`` arrays: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import patsy\n",
      "f = 'Lottery ~ Literacy * Wealth'\n",
      "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n",
      "print(y[:5])\n",
      "print(X[:5])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To generate pandas data frames: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = 'Lottery ~ Literacy * Wealth'\n",
      "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n",
      "print(y[:5])\n",
      "print(X[:5])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(sm.OLS(y, X).fit().summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}