{ "metadata": { "name": "Khan Academy - Descriptive Statistics" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Khan Academy: Descriptive statistics\n", "https://www.khanacademy.org/math/probability/descriptive-statistics" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "%load_ext sympyprinting" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].\n", "For more information, type 'help(pylab)'.\n" ] } ], "prompt_number": 237 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Statistics Intro: mean, median, and mode\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/v/statistics-intro--mean--median-and-mode\n", "\n", "statistics deals with data\n", "\n", "descriptive statistics: can we somehow describe it with a smaller set of number\n", "\n", "inferential statistics: start to make ideas about the data\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 238 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How can we describe data?\n", "\n", "Se have a set of numbers, e.g. height of plants in garden in inches:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a = np.array([23, 29, 20, 32, 23, 21, 33, 25])\n", "print a.sum()\n", "a.size" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "206\n" ] }, { "output_type": "pyout", "png": "iVBORw0KGgoAAAANSUhEUgAAAAwAAAASCAYAAABvqT8MAAAABHNCSVQICAgIfAhkiAAAAOlJREFU\nKJHN0bErxHEcxvHXufQruispUgbUDQpllQw2Wa8sdgYLk/sPbCaTVRaLgTJQtyslGaRMig0by53l\ne/n265Oyebbv+/N5nudTX/6oSuk9jha6KDCAPdxF5mGcYixjk3hAoweq2XALj7jK2DvqWMAl9GXD\nKSwHzZ/oj07aSLcfYyixAjeYiwwFrpPpBes4wkq03FMNF8nUxRlGfzPs4hCreMraZqPlbZxn70Hs\no4Pb8nIFb5gJgjZT03QOR/z8bhT2gfkyfMViYKjhOQpbwz0mMlbHCZp5cq4l7OArnVjFAdpB83/R\nNzuhKkqWv4vmAAAAAElFTkSuQmCC\n", "prompt_number": 239, "text": [ "8" ] } ], "prompt_number": 239 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Average - \"typical\" or \"middle\" => central tendency\n", "\n", "somehow represent the \"center\" of the numbers in the set.\n", "\n", "\n", "Arithmetic Mean - sum of all numbers divided by the number (count) of numbers.\n", "\n", "(4 + 3 + 1 + 6 + 1 + 7) / 6 == 22 / 6 == 3 (4 / 6) = 3 (2 / 3) = 3.6\n", "\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mean = a.sum(dtype=float) / a.size\n", "print mean\n", "print np.mean(a)\n", "assert mean == np.mean(a)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "25.75\n", "25.75\n" ] } ], "prompt_number": 243 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Median - middle number. if you have an even number of numbers, you take the Arithmetic Mean of the two middle numbers.\n", "\n", "[4, 3, 1, 6, 1, 7] => [1, 1, 3, 4, 6, 7] => 3.5\n", "0 7 50 10,000, 1,000,000 => 50" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print a.sort()\n", "print np.median(a)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "24.0\n" ] } ], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mode - Most common number in the data set.\n", "\n", "[4, 3, 1, 6, 1, 7] => 1" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.bincount(a).argmax()\n", "np.angle(" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 50, "text": [ "23" ] } ], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([3, 2, 7, 9, 5, 1, 2])\n", "print np.median(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3.0\n" ] } ], "prompt_number": 52 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([5, 6, 6, 2, 9])\n", "print np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5.6\n" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([2, 5, 10, 9, 2, 9, 4, 9])\n", "np.bincount(c).argmax()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 54, "text": [ "9" ] } ], "prompt_number": 54 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([9, 7, 6, 6, 6, 8, 8, 4, 6, 2])\n", "print np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6.2\n" ] } ], "prompt_number": 56 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([10, 3, 2, 5, 1, 8, 1, 9, 7])\n", "print np.median(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5.0\n" ] } ], "prompt_number": 57 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([8, 6, 5, 9, 10, 1, 2, 4, 10])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 58, "text": [ "6.1111111111111107" ] } ], "prompt_number": 58 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([6, 2, 2, 5, 1, 2, 8, 8])\n", "np.bincount(c).argmax()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 59, "text": [ "2" ] } ], "prompt_number": 59 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([4, 7, 1, 9, 8, 6, 1])\n", "np.median(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 60, "text": [ "6.0" ] } ], "prompt_number": 60 }, { "cell_type": "code", "collapsed": false, "input": [ "b = (81 + 81 + 81 + 81 + 91) / 5" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 76 }, { "cell_type": "code", "collapsed": false, "input": [ "b" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 77, "text": [ "83" ] } ], "prompt_number": 77 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([85, 77, 94, 88, 91])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 78, "text": [ "87.0" ] } ], "prompt_number": 78 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([82, 82, 82, 82, 100, 100])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 79, "text": [ "88.0" ] } ], "prompt_number": 79 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([83, 98, 80, 81, 91, 95])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 80, "text": [ "88.0" ] } ], "prompt_number": 80 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([82, 82, 82, 90])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 81, "text": [ "84.0" ] } ], "prompt_number": 81 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([92, 88, 86, 95, 97, 76])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 82, "text": [ "89.0" ] } ], "prompt_number": 82 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([85, 85, 85, 100, 100])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 83, "text": [ "91.0" ] } ], "prompt_number": 83 }, { "cell_type": "code", "collapsed": false, "input": [ "c = np.array([84, 84, 84, 96])\n", "np.mean(c)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 89, "text": [ "87.0" ] } ], "prompt_number": 89 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Reading Box-and-Whisker Plots\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/Box-and-whisker%20plots/v/reading-box-and-whisker-plots\n", "\n", "##Constructing a box-and-whisker plot" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d = np.array([14, 6, 3, 2, 4, 15, 11, 8, 1, 7, 2, 1, 3, 4, 10, 22, 20], dtype=float)\n", "boxplot(d, vert=False, )" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 103, "text": [ "{'boxes': [],\n", " 'caps': [,\n", " ],\n", " 'fliers': [,\n", " ],\n", " 'medians': [],\n", " 'whiskers': [,\n", " ]}" ] }, { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAWsAAAD5CAYAAADhnxSEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAACO9JREFUeJzt3TFolecawPHnu+hmhxT0NDRCQA22Gk8CUosgKFU3bYsd\nLFSk2qVboYijmUodOmhxKlUcnawOGpyC4uLQCAWHFmogSHRQC9oOtvDe4dJE7zX2+iWe7zw5vx8c\nMMec48Pjy5/jZ5JTlVJKANDV/tX0AAD8M7EGSECsARIQa4AExBoggWULeXBVVYs1B0DPqPNFeAt+\nZV1K6fnbsWPHGp+hW252YRd28eJbXS6DACQg1gAJiPUi2L59e9MjdA27mGMXc+xi4aqygIsoVVUt\n6BoMQK+p202vrAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAES\nEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhA\nrAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGx\nBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQa\nIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuA\nBMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhArAESEGuABMQaIAGxBkhgWdMD\nvEqvvx7x8GHTU8yvRBVVlKbH6Ap9fREPHjQ9BXSvqpRSuxZVVcUCHv7KVVVEF4+XYMDOsQp6Rd1u\nugwCkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJdDzWVVV1+o8EeshSbYxX1gAJiDVAAmINkIBYAyTw\nwlgfOnQoWq1WDA8Pd2oeAJ7jhbH+9NNPY3x8vFOzADCPF8Z627Zt0dfX16lZAJjHgq9Zj42Nzd4m\nJib+r8dU1X9uY2PzPefc5zx9e9nPJ5fF+Dv3+T6/20xMTDzTybr+8c0HpqamYs+ePfHTTz/974Nr\n/BDtTr5hQdf/QPuuH7BzrILF0v1viuLNBwCWLLEGSOCFsf74449j69at8fPPP8fq1avjzJkznZoL\ngKd0/A1zXbN+StcP2DlWwWJxzRqAxog1QAJiDZCAWAMk0PFYd/OFfyC/pdoYr6wBEhBrgATEGiAB\nsQZIQKwBEhBrgASWNT3Aq9aNP4z8byW6e75O8oZE8GJLOtbd/+WWJbp+RKAruAwCkIBYAyQg1gAJ\niDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg\n1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBY\nAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmIN\nkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVA\nAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJ\niDVAAmINkIBYAyQg1gAJiDVAAmINkIBYAyQg1gAJiDVAAmINkIBYL4KJiYmmR+gadjHHLubYxcKJ\n9SJwEOfYxRy7mGMXCyfWAAmINUACVSml1H5wVS3mLAA9oU52l3X6DwTg5bkMApCAWAMkINYACdSO\n9fj4eKxfvz7WrVsXx48fX8yZ0hkcHIxNmzbF6OhovPPOO02P0zGHDh2KVqsVw8PDs/c9ePAgdu3a\nFUNDQ7F79+747bffGpyws563j7GxsRgYGIjR0dEYHR2N8fHxBifsjOnp6dixY0ds2LAhNm7cGCdP\nnoyI3jwb8+2i1rkoNfz1119lzZo15fbt2+XJkyel3W6XW7du1XmqJWFwcLDcv3+/6TE67urVq+XH\nH38sGzdunL3vyJEj5fjx46WUUr7++uty9OjRpsbruOftY2xsrHzzzTcNTtV5MzMzZXJyspRSyqNH\nj8rQ0FC5detWT56N+XZR51zUemV948aNWLt2bQwODsby5ctj//79ceHChTpPtWSUHvzKmG3btkVf\nX98z9128eDEOHjwYEREHDx6MH374oYnRGvG8fUT03tl44403YmRkJCIiVqxYEW+99VbcuXOnJ8/G\nfLuIePlzUSvWd+7cidWrV89+PDAwMDtAL6qqKnbu3BmbN2+O7777rulxGnXv3r1otVoREdFqteLe\nvXsNT9S8b7/9Ntrtdhw+fLgn/un/tKmpqZicnIwtW7b0/Nn4exfvvvtuRLz8uagVa98M86zr16/H\n5ORkXL58OU6dOhXXrl1reqSuUFVVz5+Vzz//PG7fvh03b96M/v7++PLLL5seqWMeP34c+/btixMn\nTsRrr732zO/12tl4/PhxfPTRR3HixIlYsWJFrXNRK9ZvvvlmTE9Pz348PT0dAwMDdZ5qSejv74+I\niJUrV8aHH34YN27caHii5rRarbh7925ERMzMzMSqVasanqhZq1atmg3TZ5991jNn488//4x9+/bF\ngQMH4oMPPoiI3j0bf+/ik08+md1FnXNRK9abN2+OX375JaampuLJkydx7ty52Lt3b52nSu+PP/6I\nR48eRUTE77//HleuXHnmqwF6zd69e+Ps2bMREXH27NnZw9mrZmZmZn99/vz5njgbpZQ4fPhwvP32\n2/HFF1/M3t+LZ2O+XdQ6F3X/l/PSpUtlaGiorFmzpnz11Vd1nya9X3/9tbTb7dJut8uGDRt6ahf7\n9+8v/f39Zfny5WVgYKCcPn263L9/v7z33ntl3bp1ZdeuXeXhw4dNj9kx/72P77//vhw4cKAMDw+X\nTZs2lffff7/cvXu36TFfuWvXrpWqqkq73S4jIyNlZGSkXL58uSfPxvN2cenSpVrnYkE/yAmAzvAd\njAAJiDVAAmINkIBYAyQg1gAJiDVAAv8GnoJsfn+if18AAAAASUVORK5CYII=\n" } ], "prompt_number": 103 }, { "cell_type": "code", "collapsed": false, "input": [ "d = np.array([3, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 10, 11], dtype=float)\n", "boxplot(d, vert=False)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 104, "text": [ "{'boxes': [],\n", " 'caps': [,\n", " ],\n", " 'fliers': [,\n", " ],\n", " 'medians': [],\n", " 'whiskers': [,\n", " ]}" ] }, { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD5CAYAAAAOXX+6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAACzJJREFUeJzt3U2IlfXfx/Hv8QGkIJ/Q0RpDiczmwXHKEgTJSqUII8Mi\nXejfoTbtIkJa5cq0CDJoFVYGkRBk2INRIuaQSNgkuSilUkwqI81KRzPt9190/73v/+1oNR69vuO8\nXnA245zLz5kZ3hyvc8arVkopAUBaA6oeAMC5CTVAckINkJxQAyQn1ADJDTqfO9dqtXrtAOg3/umb\n7c77GXUpJfXtiSeeqHyDnXbaaeN/br3h1AdAckINkNwlH+qZM2dWPeFvsbO+7KyvvrCzL2zsrVrp\n7UmT+PPFxPO4O0C/05tuXvLPqAH6OqEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6o\nAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEG\nSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBog\nOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDk\nhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZIT\naoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6oAZITaoDkhBogOaEGSE6o\nAZITaoDkBlU9AEaMiPjpp/oes0QtalHqe9A+YvjwiEOHql5BPdVKKb3+aa7VanEed4eIiKjVIur+\nY3RBDto39OOH3if0pptOfQAkJ9QAyQk1QHJCDZCcUAMkJ9QAyQk1QHJCnUStVqt6ApCUUAMkJ9QA\nyQk1QHJCDZDcOUPd0dERDQ0N0draerH2APD/nDPUS5Ysiffee+9ibQGgB+cM9YwZM2L48OEXawsA\nPTjvc9QzZy6LmTOXxb/+tSw2b95ch0n9V632523Zsp7/fNmy//2c/3vr659P/fWF73v/+fzNUav9\n2cllZ7vTX/jLCwfs3bs35s6dGzt37jzzzi4cUDf9+WvpwgH11Y8fep/gwgEAlyChBkjunKFesGBB\nTJ8+PXbv3h3jxo2Ll1566WLtAuB/uLhtEv35a+kcdX3144feJzhHDXAJEmqA5IQaIDmhBkhOqJPo\nry8kAn9NqAGSE2qA5IQaIDmhBkhOqAGSE2qA5AZVPQAi6n8BgXIBjtlXuCjTpUeoqdyFeQt5Ce9M\n51Lh1AdAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQ\nAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckIN\nkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVA\nckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJ\nCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn\n1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAckINkJxQAyQn1ADJCTVAcpd8qDdv\n3lz1hL/Fzvqys776ws6+sLG3hDoJO+vLzvrqCzv7wsbeuuRDDdDXCTVAcrVSSun1nWu1em4B6Bf+\naXYHXcy/DIB/zqkPgOSEGiA5oQZIrlehPn78eEybNi2mTJkSTU1N8fjjj9d7V92cOnUq2tvbY+7c\nuVVPOavx48fH5MmTo729PW6++eaq55zV4cOHY/78+XH99ddHU1NTbNu2repJZ9i1a1e0t7efvg0d\nOjSee+65qmf16Mknn4zm5uZobW2NhQsXxm+//Vb1pB6tWrUqWltbo6WlJVatWlX1nNM6OjqioaEh\nWltbT3/s0KFDMXv27Jg4cWLMmTMnDh8+XOHCnje+/vrr0dzcHAMHDoyurq6/d6DSS0ePHi2llPL7\n77+XadOmlc7Ozt4e6oJ65plnysKFC8vcuXOrnnJW48ePLwcPHqx6xl9atGhRWb16dSnlz+/74cOH\nK150bqdOnSpjxowp+/btq3rKGfbs2VMmTJhQjh8/Xkop5f777y8vv/xyxavOtHPnztLS0lKOHTtW\nTp48WWbNmlW+/PLLqmeVUkrZsmVL6erqKi0tLac/9thjj5WVK1eWUkpZsWJFWbp0aVXzSik9b/z8\n88/Lrl27ysyZM8snn3zyt47T61Mfl112WUREnDhxIk6dOhUjRozo7aEumP3798e7774bDz74YPp3\nqGTf9/PPP0dnZ2d0dHRERMSgQYNi6NChFa86t40bN8Y111wT48aNq3rKGa644ooYPHhwdHd3x8mT\nJ6O7uzuuuuqqqmed4Ysvvohp06bFkCFDYuDAgXHLLbfEG2+8UfWsiIiYMWNGDB8+/L8+tn79+li8\neHFERCxevDjefPPNKqad1tPGSZMmxcSJE//RcXod6j/++COmTJkSDQ0Nceutt0ZTU1NvD3XBPPLI\nI/H000/HgAG5T8XXarWYNWtWTJ06NV544YWq5/Roz549MWrUqFiyZEnccMMN8dBDD0V3d3fVs85p\n7dq1sXDhwqpn9GjEiBHx6KOPxtVXXx1XXnllDBs2LGbNmlX1rDO0tLREZ2dnHDp0KLq7u+Odd96J\n/fv3Vz3rrA4cOBANDQ0REdHQ0BAHDhyoeFF99LpgAwYMiB07dsT+/ftjy5Yt6X7P/u23347Ro0dH\ne3t7+merH330UXz66aexYcOGeP7556Ozs7PqSWc4efJkdHV1xcMPPxxdXV1x+eWXx4oVK6qedVYn\nTpyIt956K+67776qp/Toq6++imeffTb27t0b3377bRw5ciReffXVqmedYdKkSbF06dKYM2dO3Hnn\nndHe3p7+ic9/1Gq1S+aX8s77Kz506NC46667Yvv27fXYUzdbt26N9evXx4QJE2LBggWxadOmWLRo\nUdWzejR27NiIiBg1alTMmzcvPv7444oXnamxsTEaGxvjpptuioiI+fPn//0XQiqwYcOGuPHGG2PU\nqFFVT+nR9u3bY/r06TFy5MgYNGhQ3HvvvbF169aqZ/Woo6Mjtm/fHh9++GEMGzYsrrvuuqonnVVD\nQ0N8//33ERHx3XffxejRoyteVB+9CvWPP/54+tXUY8eOxQcffBDt7e11HXa+li9fHt98803s2bMn\n1q5dG7fddlu88sorVc86Q3d3d/z6668REXH06NF4//33/+sV4izGjBkT48aNi927d0fEn+d/m5ub\nK151dq+99losWLCg6hlnNWnSpNi2bVscO3YsSimxcePGlKcPIyJ++OGHiIjYt29frFu3Lu3ppIiI\nu+++O9asWRMREWvWrIl77rmn4kXn9rf/td+bVzI/++yz0t7eXtra2kpra2t56qmnenOYi2bz5s1p\n3/Xx9ddfl7a2ttLW1laam5vL8uXLq550Vjt27ChTp04tkydPLvPmzUv7ro8jR46UkSNHll9++aXq\nKee0cuXK0tTUVFpaWsqiRYvKiRMnqp7UoxkzZpSmpqbS1tZWNm3aVPWc0x544IEyduzYMnjw4NLY\n2FhefPHFcvDgwXL77beXa6+9tsyePbv89NNPqTauXr26rFu3rjQ2NpYhQ4aUhoaGcscdd/zlcc7r\nP2UC4MLrG68KAPRjQg2QnFADJCfUAMkJNUByQg2Q3L8BT66dOZCPL/4AAAAASUVORK5CYII=\n" } ], "prompt_number": 104 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Variance of population\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/variance-of-a-population\n", "\n", "Variance as a measure of average, how far the data points in a population are from the population mean. 'N' is the total population.\n", "\n", "$$\n", "\\begin{align*}\n", "mean = \\mu = \\frac{\\displaystyle\\sum_{i=1}^{N} x^2} {N} = \\\\\n", "\\frac{x_1 + x_2 + x_3 + x_4 + x_5} 5 \n", "\\end{align*}\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [1, 3, 5, 7, 14]\n", "population_mean = sum(sample)/len(sample)\n", "print population_mean" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6\n" ] } ], "prompt_number": 235 }, { "cell_type": "markdown", "metadata": {}, "source": [ "variance $\\begin{align*}= \\sigma^2 = \\frac{\\displaystyle\\sum_{i=1}^{N} (x_i - \\mu)^2} {N}\n", "\\end{align*}$\n", "\n", "$\\mu =$ population mean" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def population_variance(population):\n", " mean = np.mean(population)\n", " pv = sum([(float(i) - mean)**2 for i in population]) / len(population)\n", " return pv\n", "population_variance(sample)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "png": "iVBORw0KGgoAAAANSUhEUgAAACUAAAASCAYAAADc4RcWAAAABHNCSVQICAgIfAhkiAAAAcdJREFU\nSInt1c+LTlEcx/HXPH5MGoaZRESUZ7I2CgtJVv4HP8pSyUIpIZuh2c7axkbMYqw0NWMhWVjYUEgp\nTeTHPCjCLBBjcc70nOd07zz3WlG+dfve7/d83+d+zu2c7+EvtJ4s3oUzWIFNuI8LeJ3VbccoXmAe\na3EarQrfrMUO4xbWxHgl7uIdtiZ1q/EKh5PcWTzG8i6CarOTaGa5HcJqxpPcpSh0aZIbxA8c7yKq\nNvsVL7Euy3/EhyR+hpsF/CPc7iKqEttIBmawHn0Z8E3YY7AKQ4L43N5g5yKCKrPpb9wTwdkktzEK\nvRPjLdF/Lph4Dv3oFRaSW2W2kQ3MZsUn8QvnYtwf/feSiWkflNwqs42CggVr4oRwfO/F3M/o5wvq\nl0W/pGS+ymyZqF5cw2WcT/LvS+pp78UvJeOV2SJRPbiCKZzKxlrCSgdKJv60iKjKbJGoETwVOvmC\nHY1+Dg+wuYBr4mGJoFpsLuqYsLFHsvze5H0Su3VeUdvixyYybki7ndRlwQGhSV7NnnFcT+o2CL/6\nSJIbwxOdV8U+YYHTddm0T90QjuShAsEXk/e32C9cGcNCbxvEQZ3HvSVs7ud/wP63f9d+A8Qvdpge\nZ/OWAAAAAElFTkSuQmCC\n", "prompt_number": 236, "text": [ "20.0" ] } ], "prompt_number": 236 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Variance\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/sample-variance\n", "\n", "### TV Watching\n", "Population of TV watchers ~300m (μ)\n", "Take a sample.\n", "What is the sample mean?\n", "\n", "$$\n", "mean = {\\bar x} = \\frac{\\displaystyle\\sum_{i=1}^{n} x_i} {n}\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Hours of television\n", "sample = [1.5, 4, 1, 2.5, 2, 1]\n", "print 'mean: %s' % np.mean(sample)\n", "print 'variance: %s' % population_variance(sample)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "mean: 2.0\n", "variance: 1.08333333333\n" ] } ], "prompt_number": 246 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is this variance the best estimate we can make given the data we have?\n", "\n", "$$\n", "s^2_n = \\frac{\\displaystyle\\sum_{i=1}^{n} (x_i - {\\bar x})^2} {n}\n", "$$\n", "\n", "Unbiased sample variance:\n", "\n", "$$\n", "s^2 = s^2_{n-1} = \\frac{\\displaystyle\\sum_{i=1}^{n} (x_i - {\\bar x})^2} {n-1}\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def unbiased_variance(sample):\n", " mean = np.mean(sample)\n", " pv = sum([(float(i) - mean)**2 for i in sample]) / (len(sample) - 1)\n", " return pv\n", "print 'sample variance: %s' % unbiased_variance(sample)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sample variance: 1.3\n" ] } ], "prompt_number": 247 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review and intuition why we divide by n-1 for the unbiaed sample variance\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance\n", "\n", "N = Total Population\n", "\n", "n = Sample population\n", "\n", "Population Mean (parameter):\n", "$$\n", "\\mu = \\frac{\\displaystyle\\sum_{i=1}^{N} x^2} {N}\n", "$$\n", "\n", "Sample Mean (statistic):\n", "$$\n", "{\\bar x} = \\frac{\\displaystyle\\sum_{i=1}^{n} x_i} {n}\n", "$$\n", "\n", "Population variance (parameter):\n", "$$\n", "\\sigma^2 = \\frac{\\displaystyle\\sum_{i=1}^{N} (x_i - \\mu)^2} {N}\n", "$$\n", "\n", "Sample variance estimate (statistic):\n", "$$\n", "s^2_n = \\frac{\\displaystyle\\sum_{i=1}^{n} (x_i - {\\bar x})^2} {n}\n", "$$\n", "\n", "Unbiased sample variance estimate:\n", "$$\n", "s^2 = s^2_{n-1} = \\frac{\\displaystyle\\sum_{i=1}^{n} (x_i - {\\bar x})^2} {n-1}\n", "$$\n", "\n", "When you divide by a smaller number, e.g. n-1, you will get a larger value. If you have to guess, they are probably talking about the unbiased estimate. With a larger sample, decreasing the denominator by 1 will have a smaller effect.\n", "\n", "When using sample variance we are approaching a biased estimate:\n", "\n", "$$\n", "mean = {\\bar x} = \\frac{\\displaystyle\\sum_{i=1}^{n} x_i} {n} => \\\\\n", "\\frac{(n-1)\\sigma^2} {n} = \\sigma^2\n", "$$\n", "\n", "$$\n", "variance = s^2_{n-1} = \\frac{\\displaystyle\\sum_{i=1}^{n} (x_i - {\\bar x})^2} {n-1}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Variance Problem Set\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/variance\n", "\n", "###population variance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [8,10,16,2,23,4]\n", "print '%.2f years old' % np.mean(sample)\n", "print '%.2f years^2' % population_variance(sample)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10.50 years old\n", "51.25 years^2\n" ] } ], "prompt_number": 218 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###unbiased variance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [30,17,1,3,11]\n", "print '%.2f years old' % np.mean(sample)\n", "print '%.2f years^2' % unbiased_variance(sample)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "12.40 years old\n", "137.80 years^2\n" ] } ], "prompt_number": 216 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Statistics: Standard Deviation\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/statistics--standard-deviation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [1, 2, 3, 8, 7]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 142 }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "{\\mu} = \\frac{1 + 2 + 3 + 8 + 7} {5} = 4.2\n", "$$\n", "$$\n", "{\\sigma^2} = \\frac{(1 - 4.2)^2 + (2 - 4.2)^2 + (3 - 4.2)^2 + (8 - 4.2)^2 + (7 - 4.2)^2} {5} = 7.76\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.display import display, Math, Latex\n", "from sympy.printing.python import python\n", "import sympy as sym\n", "display(Math(latex('\\mu = %.2f' % np.mean(sample))))\n", "display(Math(latex('\\sigma^2 = %.2f' % population_variance(sample))))\n", "display(Math(latex('s^2_{n-1} = %.2f' % ( unbiased_variance(sample)))))" ], "language": "python", "metadata": {}, "outputs": [ { "latex": [ "$$\\mu = 4.20$$" ], "output_type": "display_data", "text": [ "" ] }, { "latex": [ "$$\\sigma^2 = 7.76$$" ], "output_type": "display_data", "text": [ "" ] }, { "latex": [ "$$s^2_{n-1} = 9.70$$" ], "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 215 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard deviation is simply the square root of the variance." ] }, { "cell_type": "code", "collapsed": false, "input": [ "math.sqrt(7.76)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 62, "text": [ "2.7856776554368237" ] } ], "prompt_number": 62 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Standard Deviation Problem Set\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/standard_deviation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [30,17,1,3,11]\n", "mean = np.mean([float(i) for i in sample])\n", "print '%.2f' % mean\n", "v = unbiased_variance([float(i) for i in sample])\n", "print '%.2f' % v\n", "print '%.2f' % math.sqrt(v)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "12.40\n", "137.80\n", "11.74\n" ] } ], "prompt_number": 232 }, { "cell_type": "code", "collapsed": false, "input": [ "sample = [8,10,16,2,23,4]\n", "mean = np.mean([float(i) for i in sample])\n", "print '%.2f' % mean\n", "v = variance([float(i) for i in sample])\n", "print '%.2f' % v\n", "print '%.2f' % math.sqrt(v)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10.50\n", "51.25\n", "7.16\n" ] } ], "prompt_number": 230 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Alternate Variance Formulas\n", "https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/statistics--alternate-variance-formulas\n", "\n", "$$\n", "\\begin{align*}\n", "\\sigma^2 = \\\\\n", "\\frac{\\displaystyle\\sum_{i=1}^{N} (x_i - \\mu)^2} {N} = \\\\\n", "\\frac{\\displaystyle\\sum_{i=1}^{N}(x_{i}^2 - 2_x{\\mu} + \\mu^2)} {N}\n", "\\end{align*}\n", "$$\n", "\n", "Now, we are just going to look at the part numerator, ignoring the N, as we simplify the equation.\n", "$$\n", "\\begin{align*}\n", "(x_i - \\mu)^2 = \\\\\n", "(x_i - \\mu)(x_i - \\mu) = \\\\\n", "(x_i^2 - 2\\mu{x_i} + \\mu^2)\n", "\\end{align*}\n", "$$\n", "\n", "And below we pick up again\n", "$$\n", "\\begin{align*}\n", "\\sigma^2 = \\\\\n", "\\frac{\\displaystyle\\sum_{i=1}^{N} (x_i - \\mu)^2} {N} = \\\\\n", "\\frac{\\displaystyle\\sum_{i=1}^{N}(x_{i}^2 - 2_x{\\mu} + \\mu^2)} {N} = \\\\\n", "\\frac{\\displaystyle\\sum_{i=1}^{N}x_i - 2{\\mu}\\sum_{i=1}^{N}x_i + \\mu^2\\sum_{i=1}^{N}1} {N} = \\\\\\\\\n", "\\frac{\\sum_{i=1}^{N}x_i^2} {N} - \\frac{2\\mu\\sum_{i=1}^{N}x_i} {N} + \\frac{\\mu^2N} {N} = \\\\\n", "\\frac{\\sum_{i=1}^{N}x_i^2} {N} - 2\\mu^2 + \\mu^2 = \\\\\n", "\\frac{\\sum_{i=1}^{N}x_i^2} {N} - \\mu^2\n", "\\end{align*}\n", "$$\n", "\n", "This is close to the \"Raw Score Method\"\n", "\\begin{align*}\n", "\\frac{\\displaystyle\\sum_{i=1}^{N}x_i^2} {N} - \\frac{\\displaystyle\\sum_{i=1}^{N}{x_i}^2} {N^2}\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Notes, Tests, and Experiments" ] }, { "cell_type": "code", "collapsed": false, "input": [ "display(Math(r'F(k) = \\int_{-\\infty}^{\\infty} f(x) e^{2\\pi i k} dx'))" ], "language": "python", "metadata": {}, "outputs": [ { "latex": [ "$$F(k) = \\int_{-\\infty}^{\\infty} f(x) e^{2\\pi i k} dx$$" ], "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 78 }, { "cell_type": "code", "collapsed": false, "input": [ "from sympy.printing.python import python\n", "import sympy as sym\n", "\n", "x, y, z = sym.symbols(\"x y z\")\n", "print latex(Rational(3,2)*pi + exp(I*x) / (x**2 + y))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\\frac{3}{2} \\pi + \\frac{e^{\\mathbf{\\imath} x}}{x^{2} + y}\n" ] } ], "prompt_number": 196 }, { "cell_type": "code", "collapsed": false, "input": [ "display(Math(latex('%s^2' % sym.Symbol(\"s_n\"))))" ], "language": "python", "metadata": {}, "outputs": [ { "latex": [ "$$s_n^2$$" ], "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 199 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }