{ "metadata": { "name": "", "signature": "sha256:01f649fcdf6c5cf3869fb9023ad91d38fa92857e406da36c6cc98be773a3d5fc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Statistics in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two major modules for doing statistical analyses in Python:\n", "\n", "* [Scipy](http://docs.scipy.org/doc/scipy/reference/stats.html) - basic statistics and distribution fitting\n", "* [Statsmodels](http://statsmodels.sourceforge.net/stable/index.html) - advanced statistical modeling focused on linear models\n", "(including ANOVA, multiple regression, generalized linear models, etc.)\n", "\n", "To see the full functionality of these modules you'll need to look through\n", "their pages, but here are a few examples to show you that a lot of the\n", "standard statistical tests and models you need to perform can be easily\n", "done using Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imports\n", "-------\n", "You'll want the `stats` module from Scipy and the `api` and `formula.api`\n", "modulesfrom statsmodels. We'll also go ahead and import Numpy for use in\n", "generating data and Matplotlib for some graphing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy.stats as stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Descriptive Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`stats.describe()` gives basic descriptive statistics on any list of numbers. It returns (in the following order) the size of the data, it's min and max, the mean, the variance, skewness, and kurtosis." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = [1, 2, 3, 4, 5]\n", "stats.describe(x)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "(5, (1, 5), 3.0, 2.5, 0.0, -1.3)" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get this kind of information for columns in a DataFrame using the `.describe()` method." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.DataFrame([[1, 2, 3.5], [2, 2.4, 3.1], [3, 1.8, 2.5]], columns=['a', 'b', 'c'])\n", "print data\n", "data.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " a b c\n", "0 1 2.0 3.5\n", "1 2 2.4 3.1\n", "2 3 1.8 2.5\n", "\n", "[3 rows x 3 columns]\n" ] }, { "html": [ "
\n", " | a | \n", "b | \n", "c | \n", "
---|---|---|---|
count | \n", "3.0 | \n", "3.000000 | \n", "3.000000 | \n", "
mean | \n", "2.0 | \n", "2.066667 | \n", "3.033333 | \n", "
std | \n", "1.0 | \n", "0.305505 | \n", "0.503322 | \n", "
min | \n", "1.0 | \n", "1.800000 | \n", "2.500000 | \n", "
25% | \n", "1.5 | \n", "1.900000 | \n", "2.800000 | \n", "
50% | \n", "2.0 | \n", "2.000000 | \n", "3.100000 | \n", "
75% | \n", "2.5 | \n", "2.200000 | \n", "3.300000 | \n", "
max | \n", "3.0 | \n", "2.400000 | \n", "3.500000 | \n", "
8 rows \u00d7 3 columns
\n", "\n", " | Fitness | \n", "Time | \n", "
---|---|---|
0 | \n", "1 | \n", "29 | \n", "
1 | \n", "1 | \n", "42 | \n", "
2 | \n", "1 | \n", "38 | \n", "
3 | \n", "1 | \n", "40 | \n", "
4 | \n", "1 | \n", "43 | \n", "
5 | \n", "1 | \n", "40 | \n", "
6 | \n", "1 | \n", "30 | \n", "
7 | \n", "1 | \n", "42 | \n", "
8 | \n", "2 | \n", "30 | \n", "
9 | \n", "2 | \n", "35 | \n", "
10 | \n", "2 | \n", "39 | \n", "
11 | \n", "2 | \n", "28 | \n", "
12 | \n", "2 | \n", "31 | \n", "
13 | \n", "2 | \n", "31 | \n", "
14 | \n", "2 | \n", "29 | \n", "
15 | \n", "2 | \n", "35 | \n", "
16 | \n", "2 | \n", "29 | \n", "
17 | \n", "2 | \n", "33 | \n", "
18 | \n", "3 | \n", "26 | \n", "
19 | \n", "3 | \n", "32 | \n", "
20 | \n", "3 | \n", "21 | \n", "
21 | \n", "3 | \n", "20 | \n", "
22 | \n", "3 | \n", "23 | \n", "
23 | \n", "3 | \n", "22 | \n", "
24 rows \u00d7 2 columns
\n", "\n", " | df | \n", "sum_sq | \n", "mean_sq | \n", "F | \n", "PR(>F) | \n", "
---|---|---|---|---|---|
C(Fitness) | \n", "2 | \n", "672 | \n", "336.000000 | \n", "16.961538 | \n", "0.000041 | \n", "
Residual | \n", "21 | \n", "416 | \n", "19.809524 | \n", "NaN | \n", "NaN | \n", "
2 rows \u00d7 5 columns
\n", "