{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "from datascience import *\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Terminology of Testing ###\n", "We have developed some of the fundamental concepts of statistical tests of hypotheses, in the context of examples about jury selection. Using statistical tests as a way of making decisions is standard in many fields and has a standard terminology. Here is the sequence of the steps in most statistical tests, along with some terminology and examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: The Hypotheses ###\n", "\n", "All statistical tests attempt to choose between two views of the world. Specifically, the choice is between two views about how the data were generated. These two views are called *hypotheses*.\n", "\n", "**The null hypothesis.** This says that the data were generated at random under clearly specified assumptions that make it possible to compute chances. The word \"null\" reinforces the idea that if the data look different from what the null hypothesis predicts, the difference is due to *nothing* but chance.\n", "\n", "In the examples about jury selection in Alameda County, the null hypothesis is that the panels were selected at random from the population of eligible jurors. Though the ethnic composition of the panels was different from that of the populations of eligible jurors, there was no reason for the difference other than chance variation.\n", "\n", "**The alternative hypothesis.** This says that some reason other than chance made the data differ from what was predicted by the null hypothesis. Informally, the alternative hypothesis says that the observed difference is \"real.\"\n", "\n", "In our examples about jury selection in Alameda County, the alternative hypothesis is that the panels were not selected at random. Something other than chance led to the differences between the ethnic composition of the panels and the ethnic composition of the populations of eligible jurors. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: The Test Statistic ###\n", "\n", "In order to decide between the two hypothesis, we must choose a statistic upon which we will base our decision. This is called the **test statistic**.\n", "\n", "In the example about jury panels in Alameda County, the test statistic we used was the total variation distance between the racial distributions in the panels and in the population of eligible jurors. \n", "\n", "Calculating the observed value of the test statistic is often the first computational step in a statistical test. In our example, the observed value of the total variation distance between the distributions in the panels and the population was 0.14. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: The Probability Distribution of the Test Statistic, Under the Null Hypothesis ###\n", "\n", "This step sets aside the observed value of the test statistic, and instead focuses on *what the value of the statistic might be if the null hypothesis were true*. Under the null hypothesis, the sample could have come out differently due to chance. So the test statistic could have come out differently. This step consists of figuring out all possible values of the test statistic and all their probabilities, under the null hypothesis of randomness.\n", "\n", "In other words, in this step we calculate the probability distribution of the test statistic pretending that the null hypothesis is true. For many test statistics, this can be a daunting task both mathematically and computationally. Therefore, we approximate the probability distribution of the test statistic by the empirical distribution of the statistic based on a large number of repetitions of the sampling procedure.\n", "\n", "In our example, we visualized this distribution by a histogram. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4. The Conclusion of the Test ###\n", "\n", "The choice between the null and alternative hypotheses depends on the comparison between the results of Steps 2 and 3: the observed value of the test statistic and its distribution as predicted by the null hypothesis. \n", "\n", "If the two are consistent with each other, then the observed test statistic is in line with what the null hypothesis predicts. In other words, the test does not point towards the alternative hypothesis; the null hypothesis is better supported by the data.\n", "\n", "But if the two are not consistent with each other, as is the case in our example about Alameda County jury panels, then the data do not support the null hypothesis. That is why we concluded that the jury panels were not selected at random. Something other than chance affected their composition.\n", "\n", "If the data do not support the null hypothesis, we say that the test *rejects* the null hypothesis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mendel's Pea Flowers ###\n", "[Gregor Mendel](https://en.wikipedia.org/wiki/Gregor_Mendel) (1822-1884) was an Austrian monk who is widely recognized as the founder of the modern field of genetics. Mendel performed careful and large-scale experiments on plants to come up with fundamental laws of genetics. \n", "\n", "Many of his experiments were on varieties of pea plants. He formulated sets of assumptions about each variety; these are known as *models*. He then tested the validity of his models by growing the plants and gathering data.\n", "\n", "Let's analyze the data from one such experiment to see if Mendel's model was good.\n", "\n", "In a particular variety, each plant has either purple flowers or white. The color in each plant is unaffected by the colors in other plants. Mendel hypothesized that the plants should bear purple or white flowers at random, in the ratio 3:1. \n", "\n", "Mendel's model can be formulated as a hypothesis that we can test.\n", "\n", "**Null Hypothesis.** For every plant, there is a 75% chance that it will have purple flowers, and a 25% chance that the flowers will be white, regardless of the colors in all the other plants.\n", "\n", "That is, the null hypothesis says that Mendel's model is good. Any observed deviation from the model is the result of chance variation.\n", "\n", "Of course, there is an opposing point of view.\n", "\n", "**Alternative Hypothesis.** Mendel's model isn't valid.\n", "\n", "Let's see which of these hypotheses is better supported by the data that Mendel gathered." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table `flowers` contains the proportions predicted by the model, as well as the data on the plants that Mendel grew." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Color | Model Proportion | Plants | \n", "
---|---|---|
Purple | 0.75 | 705 | \n", "
White | 0.25 | 224 | \n", "
Section | Midterm | \n", "
---|---|
1 | 22 | \n", "
2 | 12 | \n", "
2 | 23 | \n", "
2 | 14 | \n", "
1 | 20 | \n", "
3 | 25 | \n", "
4 | 19 | \n", "
1 | 24 | \n", "
5 | 8 | \n", "
6 | 14 | \n", "
... (349 rows omitted)
\n", " \n", "Section | count | \n", "
---|---|
1 | 32 | \n", "
2 | 32 | \n", "
3 | 27 | \n", "
4 | 30 | \n", "
5 | 33 | \n", "
6 | 32 | \n", "
7 | 24 | \n", "
8 | 29 | \n", "
9 | 30 | \n", "
10 | 34 | \n", "
... (2 rows omitted)
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "section_3_mean = 13.6667\n", "\n", "repetitions = 10000\n", "\n", "means = make_array()\n", "\n", "for i in np.arange(repetitions):\n", " new_mean = scores.sample(27, with_replacement=False).column('Midterm').mean()\n", " means = np.append(means, new_mean)\n", " \n", "emp_p_value = np.count_nonzero(means <= section_3_mean)/repetitions\n", "print('Empirical P-value:', emp_p_value)\n", "results = Table().with_column('Random Sample Mean', means)\n", "results.hist() \n", "\n", "#Plot the observed statistic as a large red point on the horizontal axis\n", "plots.scatter(section_3_mean, 0, color='red', s=30);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the histogram, the low mean in section 3 looks somewhat unusual, but the conventional 5% cut-off gives the GSI's hypothesis the benefit of the doubt. With that cut-off, we say that the result is not statistically significant." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }