{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# P-Hacking\n", "\n", "## Author [Jean-François Puget](https://www.ibm.com/developerworks/community/blogs/jfp)\n", "\n", "This notebook is the code used in my blog post on [Green dice are loaded (welcome to p-hacking)](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking)\n", "\n", "It demonstrates how to mislead people about a scientific experiment by hacking p-values following the methodlogy from this [xkcd comic](http://xkcd.com/882/):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, some useful imports." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from numpy.random import random_integers, seed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The experiment\n", "\n", "We are given 20 dice colors, and we roll dice for each color 1,000 times. We report the number of six.\n", "We set the random seed to make sure results are reproducible." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Number of Six
Purple151
Brown167
Pink158
Blue167
Teal181
Salmon162
Red170
Turquoise161
Magenta165
Yellow180
Grey172
Tan164
Cyan181
Green188
Mauve165
Beige172
Lilac178
Black176
Peach173
Orange157
\n", "
" ], "text/plain": [ " Number of Six\n", "Purple 151\n", "Brown 167\n", "Pink 158\n", "Blue 167\n", "Teal 181\n", "Salmon 162\n", "Red 170\n", "Turquoise 161\n", "Magenta 165\n", "Yellow 180\n", "Grey 172\n", "Tan 164\n", "Cyan 181\n", "Green 188\n", "Mauve 165\n", "Beige 172\n", "Lilac 178\n", "Black 176\n", "Peach 173\n", "Orange 157" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dice = ['Purple', 'Brown', 'Pink', 'Blue', 'Teal', \n", " 'Salmon', 'Red', 'Turquoise', 'Magenta', 'Yellow', \n", " 'Grey', 'Tan', 'Cyan', 'Green', 'Mauve',\n", " 'Beige', 'Lilac', 'Black', 'Peach', 'Orange']\n", "\n", "def dice_experiments(dice, n):\n", " df = pd.DataFrame(index=dice)\n", " for die in dice:\n", " result = random_integers(1,6,n)\n", " df.ix[die,'Number of Six'] = np.sum(result[result==6])//6\n", " return df\n", "\n", "np.random.seed(75000)\n", "df = dice_experiments(dice, 1000)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The green color has 188 times a six, which is rather high. Indeed, average value is 1,000/6 = 167.\n", "\n", "The probability that green dice have at least 188 six is about 4% as we will see below. This is the *p-value* for the green color result. If we decided to watch the green color before the experiement, then this low p-value would warrant us a scientific publication. Read my [blog](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking) to see why.\n", "\n", "Let's see how unlikely this result is by running our experiment many times." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.041" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_fraction_one_color(color, dice, k, n, repeat):\n", " success = 0.0\n", " for experiment in range(repeat):\n", " df = dice_experiments(dice, n)\n", " if df.loc['Green','Number of Six'] >= k:\n", " success += 1\n", " return success/repeat\n", "\n", "get_fraction_one_color('Green', dice, 188, 1000, 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is about 4.1%, i.e. rather unlikely. There must be a reason why green dice favor six that much!\n", "\n", "## The Catch\n", "\n", "We misinterpreted what we did and our conclusion is misleading. What we did was to run the experiment for 20 colors, then select the color with the highest numpber of six. The p-value should be the probability that at least one color gets at least 188 six. \n", "\n", "Let's evaluate this probability experimentally." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.556" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_fraction_any_color(dice, k, n, repeat):\n", " success = 0.0\n", " for experiment in range(repeat):\n", " df = dice_experiments(dice, n)\n", " if df['Number of Six'].max() >= k:\n", " success += 1\n", " return success/repeat\n", "\n", "get_fraction_any_color(dice, 188, 1000, 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This probability is about 56%, i.e. very likely. There is nothing special about our dice afterall.\n", "\n", "This little code may look contrived, but it is not. scientists got publications using a similar methodology, see my [blog](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking) for details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exact probabilities\n", "\n", "Let's compute the above p-values using probability theory. \n", "\n", "The probability that k out of n are six is given by the function below. Indeed, in order to get exactly $k$ times a six among $n$ rolls, we must first decide where those six appear in the sequence of 1,000 rolls. There are exactly $n$ choose $k = n!/(k!.(n-k)!$ ways to do it. Then for each of these sequence, the probability that all of the $k$ occurrences are a six is $1/6^k$ and the probability that the remaining $n-k$ occurrences are not a six is $(5/6)^{n-k}$. The probability to get exactly $k$ times a six out of $n$ dice rolls is therefore equal to $1/6^k(5/6)^{n-k}n!/(k!(n-k)!)$." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0066\n" ] } ], "source": [ "from scipy.misc import comb\n", "\n", "def proba_k_among_n(k, n):\n", " p = 1/6\n", " q = 1-p\n", " result = p**k * q**(n-k) * comb(n, k)\n", " return result\n", "\n", "print(\"%0.4f\" % proba_k_among_n(188,1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probability that at least k out of n are six is the sum of the probability for each possible value greater than or equal to k." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0401\n" ] } ], "source": [ "def proba_at_least_k_among_n(k, n):\n", " proba = 0.0\n", " for i in range(k, n+1):\n", " proba += proba_k_among_n(i, n)\n", " return proba\n", "\n", "print(\"%0.4f\" % proba_at_least_k_among_n(188,1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probability that one specific color gets at least k six out of n is what we just computed." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.040\n" ] } ], "source": [ "def proba_one_color(k, n):\n", " return proba_at_least_k_among_n(k, n)\n", "\n", "print(\"%0.3f\" % proba_one_color(188, 1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probability that at least one color among ncolors gets at least k six out of n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.559\n" ] } ], "source": [ "def proba_at_least_one_color(ncolors, k, n):\n", " p = proba_one_color(k, n)\n", " proba_no_color = (1-p)**ncolors\n", " return 1 - proba_no_color\n", "\n", "print(\"%0.3f\" % proba_at_least_one_color(20, 188, 1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is close to what we found experimentally.\n", "\n", "## Using built-in binomial distribution\n", "\n", "A more direct code, using binomial distribution." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy import stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One experiment gets more than 188 times a six." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.040142990286396493" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.binom(1000, 1/6.0).sf(187)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At least one in 20 experiments gets more than 188 times a six." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.55931241410111465" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 - (stats.binom(1000, 1/6.0).cdf(187)**20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ultimate p-hacking\n", "\n", "We wanted to mimick the xkcd comic, hence we looked for a random seed that yields the result we want. This is really bad p-hacking, as we define first the p-value, then find experiment settings that produce it. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "75000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def find_seed(dice, k, n, repeat, rounding):\n", " nseed = 0.0\n", " for seed in range(0,repeat,rounding):\n", " np.random.seed(seed)\n", " df = dice_experiments(dice, n)\n", " m = df['Number of Six'].max()\n", " if m == k and m == df.loc['Green','Number of Six']:\n", " return seed\n", " \n", "res = find_seed(dice, 188, 1000, 1000000, 1000)\n", "res" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }