{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### New to Plotly?\n", "Plotly's Python library is free and open source! [Get started](https://plotly.com/python/getting-started/) by dowloading the client and [reading the primer](https://plotly.com/python/getting-started/).\n", "
You can set up Plotly to work in [online](https://plotly.com/python/getting-started/#initialization-for-online-plotting) or [offline](https://plotly.com/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plotly.com/python/getting-started/#start-plotting-online).\n", "
We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Imports\n", "The tutorial below imports [NumPy](http://www.numpy.org/), [Pandas](https://plotly.com/pandas/intro-to-pandas-tutorial/), [SciPy](https://www.scipy.org/), and [Statsmodels](http://statsmodels.sourceforge.net/stable/)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import plotly.plotly as py\n", "import plotly.graph_objs as go\n", "from plotly.tools import FigureFactory as FF\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "\n", "import statsmodels\n", "import statsmodels.api as sm\n", "from statsmodels.formula.api import ols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### One-Way ANOVA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `Analysis of Variance Test` or an `ANOVA` is a generalization of the t-tests to more than 2 groups. Our null hypothesis states that there are equal means in the populations from which the groups of data were sampled. More succinctly:\n", "\n", "$$\n", "\\begin{align*}\n", "\\mu_1 = \\mu_2 = ... = \\mu_n\n", "\\end{align*}\n", "$$\n", "\n", "for $n$ groups of data. Our alternative hypothesis would be that any one of the equivalences in the above equation fail to be met." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sum_sq df F \\\n", "C(fcategory, Sum) 11.614700 2.0 0.276958 \n", "C(partner_status, Sum) 212.213778 1.0 10.120692 \n", "C(fcategory, Sum):C(partner_status, Sum) 175.488928 2.0 4.184623 \n", "Residual 817.763961 39.0 NaN \n", "\n", " PR(>F) \n", "C(fcategory, Sum) 0.759564 \n", "C(partner_status, Sum) 0.002874 \n", "C(fcategory, Sum):C(partner_status, Sum) 0.022572 \n", "Residual NaN \n" ] } ], "source": [ "moore = sm.datasets.get_rdataset(\"Moore\", \"car\", cache=True)\n", "\n", "data = moore.data\n", "data = data.rename(columns={\"partner.status\" :\"partner_status\"}) # make name pythonic\n", "\n", "moore_lm = ols('conformity ~ C(fcategory, Sum)*C(partner_status, Sum)', data=data).fit()\n", "table = sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame\n", "\n", "print(table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this ANOVA test, we are dealing with an `F-Statistic` and not a `p-value`. Their connection is integral as they are two ways of expressing the same thing. When we set a `significance level` at the start of our statistical tests (usually 0.05), we are saying that if our variable in question takes on the 5% ends of our distribution, then we can start to make the case that there is evidence against the null, which states that the data belongs to _this particular distribution_.\n", "\n", "The F value is the point such that the area of the curve past that point to the tail is just the p-value. Therefore:\n", "\n", "$$\n", "\\begin{align*}\n", "Pr(>F) = p\n", "\\end{align*}\n", "$$\n", "\n", "For more information on the choice of 0.05 for a significance level, check out [this page](http://www.investopedia.com/exam-guide/cfa-level-1/quantitative-methods/hypothesis-testing.asp)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us import some data for our next analysis. This time some data on tooth growth:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/tooth_growth_csv')\n", "df = data[0:10]\n", "\n", "table = FF.create_table(df)\n", "py.iplot(table, filename='tooth-data-sample')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Two-Way ANOVA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a `Two-Way ANOVA`, there are two variables to consider. The question is whether our variable in question (tooth length `len`) is related to the two other variables `supp` and `dose` by the equation:\n", "\n", "$$\n", "\\begin{align*}\n", "len = supp + dose + supp \\times dose\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sum_sq df F PR(>F)\n", "C(supp) 205.350000 1.0 15.571979 2.311828e-04\n", "C(dose) 2426.434333 2.0 91.999965 4.046291e-18\n", "C(supp):C(dose) 108.319000 2.0 4.106991 2.186027e-02\n", "Residual 712.106000 54.0 NaN NaN\n" ] } ], "source": [ "formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'\n", "model = ols(formula, data).fit()\n", "aov_table = statsmodels.stats.anova.anova_lm(model, typ=2)\n", "print(aov_table)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting git+https://github.com/plotly/publisher.git\n", " Cloning https://github.com/plotly/publisher.git to /var/folders/ld/6cl3s_l50wd40tdjq2b03jxh0000gp/T/pip-KRjKqE-build\n", "Installing collected packages: publisher\n", " Found existing installation: publisher 0.10\n", " Uninstalling publisher-0.10:\n", " Successfully uninstalled publisher-0.10\n", " Running setup.py install for publisher ... \u001b[?25l-\b \b\\\b \bdone\n", "\u001b[?25hSuccessfully installed publisher-0.10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/brandendunbar/Desktop/test/venv/lib/python2.7/site-packages/IPython/nbconvert.py:13: ShimWarning: The `IPython.nbconvert` package has been deprecated. You should import from nbconvert instead.\n", " \"You should import from nbconvert instead.\", ShimWarning)\n", "/Users/brandendunbar/Desktop/test/venv/lib/python2.7/site-packages/publisher/publisher.py:53: UserWarning: Did you \"Save\" this notebook before running this command? Remember to save, always save.\n", " warnings.warn('Did you \"Save\" this notebook before running this command? '\n" ] } ], "source": [ "from IPython.display import display, HTML\n", "\n", "display(HTML(''))\n", "display(HTML(''))\n", "\n", "! pip install git+https://github.com/plotly/publisher.git --upgrade\n", "import publisher\n", "publisher.publish(\n", " 'python-Anova.ipynb', 'python/anova/', 'Anova | plotly',\n", " 'Learn how to perform a one and two way ANOVA test using Python.',\n", " title='Anova in Python | plotly',\n", " name='Anova',\n", " language='python',\n", " page_type='example_index', has_thumbnail='false', display_as='statistics', order=8,\n", " ipynb= '~notebook_demo/108')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }