{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Skew test\n", "\n", "Allen Downey\n", "\n", "[MIT License](https://en.wikipedia.org/wiki/MIT_License)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set(style='white')\n", "\n", "from thinkstats2 import Pmf, Cdf\n", "\n", "import thinkstats2\n", "import thinkplot\n", "\n", "decorate = thinkplot.config" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you buy a loaf of bread every day for a year, take it\n", "home, and weigh it. You suspect that the distribution of weights is\n", "more skewed than a normal distribution with the same mean and\n", " standard deviation.\n", "\n", "To test your suspicion, write a definition for a class named\n", " `SkewTest` that extends `thinkstats.HypothesisTest` and provides\n", " two methods:\n", "\n", "* `TestStatistic` should compute the skew of a given sample.\n", "\n", "* `RunModel` should simulate the null hypothesis and return\n", " simulated data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class HypothesisTest(object):\n", " \"\"\"Represents a hypothesis test.\"\"\"\n", "\n", " def __init__(self, data):\n", " \"\"\"Initializes.\n", "\n", " data: data in whatever form is relevant\n", " \"\"\"\n", " self.data = data\n", " self.MakeModel()\n", " self.actual = self.TestStatistic(data)\n", " self.test_stats = None\n", "\n", " def PValue(self, iters=1000):\n", " \"\"\"Computes the distribution of the test statistic and p-value.\n", "\n", " iters: number of iterations\n", "\n", " returns: float p-value\n", " \"\"\"\n", " self.test_stats = np.array([self.TestStatistic(self.RunModel()) \n", " for _ in range(iters)])\n", "\n", " count = sum(self.test_stats >= self.actual)\n", " return count / iters\n", "\n", " def MaxTestStat(self):\n", " \"\"\"Returns the largest test statistic seen during simulations.\n", " \"\"\"\n", " return np.max(self.test_stats)\n", "\n", " def PlotHist(self, label=None):\n", " \"\"\"Draws a Cdf with vertical lines at the observed test stat.\n", " \"\"\"\n", " plt.hist(self.test_stats, color='C4', alpha=0.5)\n", " plt.axvline(self.actual, linewidth=3, color='0.8')\n", " plt.xlabel('Test statistic')\n", " plt.ylabel('Count')\n", " plt.title('Distribution of the test statistic under the null hypothesis')\n", "\n", " def TestStatistic(self, data):\n", " \"\"\"Computes the test statistic.\n", "\n", " data: data in whatever form is relevant \n", " \"\"\"\n", " raise UnimplementedMethodException()\n", "\n", " def MakeModel(self):\n", " \"\"\"Build a model of the null hypothesis.\n", " \"\"\"\n", " pass\n", "\n", " def RunModel(self):\n", " \"\"\"Run the model of the null hypothesis.\n", "\n", " returns: simulated data\n", " \"\"\"\n", " raise UnimplementedMethodException()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test this class, I'll generate a sample from an actual Gaussian distribution, so the null hypothesis is true." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = 1000\n", "sigma = 35\n", "data = np.random.normal(mu, sigma, size=365)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can make a `SkewTest` and compute the observed skewness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test = SkewTest(data)\n", "test.actual" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the p-value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "test = SkewTest(data)\n", "test.PValue()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the distribution of the test statistic under the null hypothesis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "test.PlotHist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the time the p-value exceeds 5%, so we would conclude that the observed skewness could plausibly be due to random sample.\n", "\n", "But let's see how often we get a false positive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iters = 100\n", "count = 0\n", "\n", "for i in range(iters):\n", " data = np.random.normal(mu, sigma, size=365)\n", " test = SkewTest(data)\n", " p_value = test.PValue()\n", " if p_value < 0.05:\n", " count +=1\n", " \n", "print(count/iters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the long run, the false positive rate is the threshold we used, 5%." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }