{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# P-Hacking\n",
    "\n",
    "## Author [Jean-François Puget](https://www.ibm.com/developerworks/community/blogs/jfp)\n",
    "\n",
    "This notebook is the code used in my blog post on [Green dice are loaded (welcome to p-hacking)](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking)\n",
    "\n",
    "It demonstrates how to mislead people about a scientific experiment by hacking p-values following the methodlogy from this [xkcd comic](http://xkcd.com/882/):\n",
    "\n",
    "<img src=\"http://imgs.xkcd.com/comics/significant.png\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, some useful imports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from numpy.random import random_integers, seed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The experiment\n",
    "\n",
    "We are given 20 dice colors, and we roll dice for each color 1,000 times.  We report the number of six.\n",
    "We set the random seed to make sure results are reproducible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Number of Six</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Purple</th>\n",
       "      <td>151</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brown</th>\n",
       "      <td>167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pink</th>\n",
       "      <td>158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blue</th>\n",
       "      <td>167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Teal</th>\n",
       "      <td>181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Salmon</th>\n",
       "      <td>162</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Red</th>\n",
       "      <td>170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Turquoise</th>\n",
       "      <td>161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Magenta</th>\n",
       "      <td>165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yellow</th>\n",
       "      <td>180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Grey</th>\n",
       "      <td>172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tan</th>\n",
       "      <td>164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cyan</th>\n",
       "      <td>181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Green</th>\n",
       "      <td>188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mauve</th>\n",
       "      <td>165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beige</th>\n",
       "      <td>172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lilac</th>\n",
       "      <td>178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Black</th>\n",
       "      <td>176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Peach</th>\n",
       "      <td>173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Orange</th>\n",
       "      <td>157</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Number of Six\n",
       "Purple               151\n",
       "Brown                167\n",
       "Pink                 158\n",
       "Blue                 167\n",
       "Teal                 181\n",
       "Salmon               162\n",
       "Red                  170\n",
       "Turquoise            161\n",
       "Magenta              165\n",
       "Yellow               180\n",
       "Grey                 172\n",
       "Tan                  164\n",
       "Cyan                 181\n",
       "Green                188\n",
       "Mauve                165\n",
       "Beige                172\n",
       "Lilac                178\n",
       "Black                176\n",
       "Peach                173\n",
       "Orange               157"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dice = ['Purple', 'Brown', 'Pink', 'Blue', 'Teal', \n",
    "        'Salmon', 'Red', 'Turquoise', 'Magenta', 'Yellow', \n",
    "        'Grey', 'Tan', 'Cyan', 'Green', 'Mauve',\n",
    "        'Beige', 'Lilac', 'Black', 'Peach', 'Orange']\n",
    "\n",
    "def dice_experiments(dice, n):\n",
    "    df = pd.DataFrame(index=dice)\n",
    "    for die in dice:\n",
    "        result = random_integers(1,6,n)\n",
    "        df.ix[die,'Number of Six'] = np.sum(result[result==6])//6\n",
    "    return df\n",
    "\n",
    "np.random.seed(75000)\n",
    "df = dice_experiments(dice, 1000)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The green color has 188 times a six, which is rather high. Indeed, average value is 1,000/6 = 167.\n",
    "\n",
    "The probability that green dice have at least 188 six is about 4% as we will see below.  This is the *p-value* for the green color result. If we decided to watch the green color before the experiement, then this low p-value would warrant us a scientific publication.  Read my [blog](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking) to see why.\n",
    "\n",
    "Let's see how unlikely this result is by running our experiment many times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.041"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def get_fraction_one_color(color, dice, k, n, repeat):\n",
    "    success = 0.0\n",
    "    for experiment in range(repeat):\n",
    "        df = dice_experiments(dice, n)\n",
    "        if df.loc['Green','Number of Six'] >= k:\n",
    "            success += 1\n",
    "    return success/repeat\n",
    "\n",
    "get_fraction_one_color('Green', dice, 188, 1000, 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is about 4.1%, i.e. rather unlikely. There must be a reason why green dice favor six that much!\n",
    "\n",
    "## The Catch\n",
    "\n",
    "We misinterpreted what we did and our conclusion is misleading.  What we did was to run the experiment for 20 colors, then select the color with the highest numpber of six.  The p-value should be the probability that at least one color gets at least 188 six. \n",
    "\n",
    "Let's evaluate this probability experimentally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.556"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def get_fraction_any_color(dice, k, n, repeat):\n",
    "    success = 0.0\n",
    "    for experiment in range(repeat):\n",
    "        df = dice_experiments(dice, n)\n",
    "        if df['Number of Six'].max() >= k:\n",
    "            success += 1\n",
    "    return success/repeat\n",
    "\n",
    "get_fraction_any_color(dice, 188, 1000, 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This probability is about 56%, i.e. very likely.  There is nothing special about our dice afterall.\n",
    "\n",
    "This little code may look contrived, but it is not.  scientists got publications using a similar methodology, see my [blog](https://www.ibm.com/developerworks/community/blogs/jfp/entry/green_dice_are_loaded_welcome_to_p_hacking)  for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exact probabilities\n",
    "\n",
    "Let's compute the above p-values using probability theory.  \n",
    "\n",
    "The probability that k out of n are six is given by the function below.  Indeed, in order to get exactly $k$ times a six among $n$ rolls, we must first decide where those six appear in the sequence of 1,000 rolls.  There are exactly $n$ choose $k =  n!/(k!.(n-k)!$ ways to do it.  Then for each of these sequence, the probability that all of the $k$ occurrences are a six is $1/6^k$ and the probability that the remaining $n-k$ occurrences are not a six is $(5/6)^{n-k}$.  The probability to get exactly $k$ times a six out of $n$ dice rolls is therefore equal to $1/6^k(5/6)^{n-k}n!/(k!(n-k)!)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0066\n"
     ]
    }
   ],
   "source": [
    "from scipy.misc import comb\n",
    "\n",
    "def proba_k_among_n(k, n):\n",
    "    p = 1/6\n",
    "    q = 1-p\n",
    "    result = p**k * q**(n-k) * comb(n, k)\n",
    "    return result\n",
    "\n",
    "print(\"%0.4f\" % proba_k_among_n(188,1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The probability that at least k out of n are six is the sum of the probability for each possible value greater than or equal to k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0401\n"
     ]
    }
   ],
   "source": [
    "def proba_at_least_k_among_n(k, n):\n",
    "    proba = 0.0\n",
    "    for i in range(k, n+1):\n",
    "        proba += proba_k_among_n(i, n)\n",
    "    return proba\n",
    "\n",
    "print(\"%0.4f\" % proba_at_least_k_among_n(188,1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Probability that one specific color gets at least k six out of n is what we just computed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.040\n"
     ]
    }
   ],
   "source": [
    "def proba_one_color(k, n):\n",
    "    return proba_at_least_k_among_n(k, n)\n",
    "\n",
    "print(\"%0.3f\" % proba_one_color(188, 1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Probability that at least one color among ncolors gets at least k six out of n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.559\n"
     ]
    }
   ],
   "source": [
    "def proba_at_least_one_color(ncolors, k, n):\n",
    "    p = proba_one_color(k, n)\n",
    "    proba_no_color = (1-p)**ncolors\n",
    "    return 1 - proba_no_color\n",
    "\n",
    "print(\"%0.3f\" % proba_at_least_one_color(20, 188, 1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is close to what we found experimentally.\n",
    "\n",
    "## Using built-in  binomial distribution\n",
    "\n",
    "A more direct code, using binomial distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One experiment gets more than 188 times a six."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.040142990286396493"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.binom(1000, 1/6.0).sf(187)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At least one in 20 experiments gets more than 188 times a six."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.55931241410111465"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 - (stats.binom(1000, 1/6.0).cdf(187)**20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ultimate p-hacking\n",
    "\n",
    "We wanted to mimick the xkcd comic, hence we looked for a random seed that yields the result we want.  This is really bad p-hacking, as we define first the p-value, then find experiment settings that produce it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "75000"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def find_seed(dice, k, n, repeat, rounding):\n",
    "    nseed = 0.0\n",
    "    for seed in range(0,repeat,rounding):\n",
    "        np.random.seed(seed)\n",
    "        df = dice_experiments(dice, n)\n",
    "        m = df['Number of Six'].max()\n",
    "        if m == k and m == df.loc['Green','Number of Six']:\n",
    "            return seed\n",
    "    \n",
    "res = find_seed(dice, 188, 1000, 1000000, 1000)\n",
    "res"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}