{
 "metadata": {
  "name": "AB Test Sample Size Calculator"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "This notebook computes an estimate of A/B test duration in a manner similar to Dan McKinley's online tool http://www.experimentcalculator.com.  The calculations are adapted from Casagrande, Pike, and Smith (1978), _An improved approximate formula for calculating sample sizes for comparing two binomial distributions_ [(link to PDF)](http://www.bios.unc.edu/~mhudgens/bios/662/2008fall/casagrande.pdf). \n\nThe adjustable parameters are as follows: \n\n* The number of test subjects expected per day (to be split into 50% control, 50% test)\n* Baseline rate of conversion\n* Percentage change anticipated in the baseline rate\n* Desired false alarm rate\n* Desired probability of correct detection\n\nThe false alarm rate (denoted $\\alpha$) is the statistical significance level for rejecting the null hypothesis. In other words, if it turns out that there really is no difference in our conversion rate in the test group, this is the chance we might think there is one anyways.\n\nThe correct detection rate (denoted $\\beta$, also called the \"power\" of the test) is the probability that, given there _does_ turn out to be a difference between test and control, we would actually correctly notice it."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import numpy as np\nfrom scipy import stats\n\nn_per_day = 10000  # Number of subjects per day\n\ncontrol_rate = 0.05  # Baseline conversion rate\nrate_diff = 0.03  # Anticipated percent change to baseline rate\n\nalpha = 0.05  # False alarm rate\nbeta = 0.95  # Correct detection rate",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Let $p_1$ and $p_2$ be the conversion rates of the two groups (test and control), with the test and control rates assigned such that $p_1 > p_2$."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "test_rate = control_rate*(1.0 + rate_diff)\np1 = max(control_rate, test_rate)\np2 = min(control_rate, test_rate)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "We want to test whether $p_1 = p_2$ (the null hypothesis) or $p_1 > p_2$ (the alternate hypothesis).  The number of samples we will need depends on how many successes we expect to see in each of the groups, which in turn depends on $p_1$, $p_2$, our desired false positive rate $\\alpha$, and our desired correct detection rate $\\beta$.\n\nWe compute required sample size using the following approximate formula from Casagrande et al (1978):\n$$n = A \\left[\\frac{1 + \\sqrt{1 + \\frac{4(p_1 - p_2)}{A}}}{2(p_1 - p_2)}\\right]^2$$\nwhere $A$ is a $\\chi^2$ \"correction factor\" given by  \n$$A = \\left[z_{1-\\alpha} \\sqrt{2\\bar{p}(1 - \\bar{p})} + z_{\\beta} \\sqrt{p_1 (1-p_1) + p_2 (1-p_2)} \\right]^2,$$\nwith $\\bar{p} = (p_1+p_2)/2$ and where $z_p$ denotes the standard normal quantile function, i.e. $z_p = \\Phi^{-1}(p)$ is location of the $p$-th quantile for $N(0, 1)$."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "p_bar = (p1 + p2)/2.0\n\n# PPF is the \"percent point function\", a.k.a. quantile function.\n# Note that we divide alpha by 2 if we desire a two-sided test.\nza = stats.norm.ppf(1 - alpha/2)  \nzb = stats.norm.ppf(beta)\n\nA = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2\nn = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2\nn_act = int(np.ceil(n))\n\n# Print results\nprint('Number of users needed each for control and test groups = {:,d} ({:,d} total)'.format(n_act, 2*n_act))\nprint('Estimated duration at {:,d} subjects per day: {:,d} days'.format(int(n_per_day), int(np.ceil(2.0*n_act/n_per_day))))",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Number of users needed each for control and test groups = 557,786 (1,115,572 total)\nEstimated duration at 10,000 subjects per day: 112 days\n"
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "To validate these results, we can run $n_s$ trials of the experiment to estimate empirical false alarm and missed detection rates.  To perform the actual statistical test, we compute the 2x2 contigency table for each pair of binomial variates and use [Fisher's exact test](http://en.wikipedia.org/wiki/Fisher's_exact_test).  We can then make a decision for each trial where the null an alternate are actually true, and compute the frequency of false alarms and missed detections."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "dns = 1000\nastr = 'greater' if (rate_diff < 0) else 'less'  # The tail probability we use depends on whether the change is positive or negative\n\n# Experimental results when null is true\ncontrol0 = stats.binom.rvs(n, control_rate, size=ns)\ntest0 = stats.binom.rvs(n, control_rate, size=ns)  # Test and control are the same\ntables0 = [[[a, n_act-a], [b, n_act-b]] for a, b in zip(control0, test0)]  # Contingency tables\nresults0 = [stats.fisher_exact(T, alternative=astr) for T in tables0]\ndecisions0 = [x[1] <= alpha for x in results0]\n         \n# Experimental results when alternate is true\ncontrol1 = stats.binom.rvs(n, control_rate, size=ns)\ntest1 = stats.binom.rvs(n, test_rate, size=ns)  # Test and control are different\ntables1 = [[[a, n_act-a], [b, n_act-b]] for a, b in zip(control1, test1)]  # Contingency tables\nresults1 = [stats.fisher_exact(T, alternative=astr) for T in tables1]\ndecisions1 = [x[1] <= alpha for x in results1]\n\n# Compute false alarm and correct detection rates\nalpha_est = sum(decisions0)/float(ns)\nbeta_est = sum(decisions1)/float(ns)\n\nprint('True false alarm rate = %0.2f, estimated false alarm rate = %0.2f' % (alpha, alpha_est))\nprint('True correct detection rate = %0.2f, estimated correct detection rate = %0.2f' % (beta, beta_est))",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "True false alarm rate = 0.05, estimated false alarm rate = 0.05\nTrue correct detection rate = 0.95, estimated correct detection rate = 0.97\n"
      }
     ],
     "prompt_number": 4
    }
   ],
   "metadata": {}
  }
 ]
}