{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 4: Expectation\n",
    " \n",
    "This Jupyter notebook is the Python equivalent of the R code in section 4.11 R, pp. 175 - 178, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geometric, Negative Binomial, and Poisson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The three functions for the Geometric distribution in SciPy's [`scipy.stats.geom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html#scipy.stats.geom) are `pmf`, `cdf`, and `rvs`, corresponding to the PMF, CDF, and random number generation. For `pmf` and `cdf`, we need to supply the following as inputs: (1) the value `k` at which to evaluate the PMF or CDF, and (2) the parameter `p`. For `rvs`, we need to input (1) the number of random variables to generate and (2) the parameter `p`.\n",
    "\n",
    "For example, to calculate $P(X = 3)$ and $P(X \\leq 3)$ where $X \\sim Geom(0.5)$, we use `geom.pmf(3, 0.5)` and `geom.cdf(3, 0.5)`, respectively. To generate 100 i.i.d. $Geom(0.8)$ r.v.s, we use `geom.rvs(0.8, size=100)`. If instead we want 100 i.i.d. $FS(0.8)$ r.v.s, we just need to add 1 to include the success: `geom.rvs(0.8, size=100) + 1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "geom.pmf(3, 0.5) = 0.125\n",
      "\n",
      "geom.cdf(3, 0.5) = 0.875\n",
      "\n",
      "geom.rvs(0.8, size=100) = [1 1 1 1 1 1 1 1 2 2 1 1 2 1 2 1 3 1 1 2 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1\n",
      " 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 3 1 2 1 1 1 1 3 1 1 2 1 1 1 1 1\n",
      " 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2]\n",
      "\n",
      "geom.rvs(0.8, size=100) + 1 = [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3\n",
      " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 4 2 2 3 2 2 3 2 2 3 2\n",
      " 2 3 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 4 2 2 2 2 3 2 2 2]\n"
     ]
    }
   ],
   "source": [
    "# seed the random number generator\n",
    "np.random.seed(610)\n",
    "\n",
    "from scipy.stats import geom\n",
    "\n",
    "# to learn more about scipy.stats.geom, un-comment ouf the following line\n",
    "#print(geom.__doc__)\n",
    "\n",
    "print('geom.pmf(3, 0.5) = {}'.format(geom.pmf(3, 0.5)))\n",
    "\n",
    "print('\\ngeom.cdf(3, 0.5) = {}'.format(geom.cdf(3, 0.5)))\n",
    "\n",
    "print('\\ngeom.rvs(0.8, size=100) = {}'.format(geom.rvs(0.8, size=100)))\n",
    "\n",
    "print('\\ngeom.rvs(0.8, size=100) + 1 = {}'.format(geom.rvs(0.8, size=100)+1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the Negative Binomial distribution, we have [`scipy.stats.nbinom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nbinom.html#scipy.stats.nbinom) `pmf`, `cdf`, and `rvs`. These take three inputs. For example, to calculate the $NBin(5, 0.5)$ PMF at 3, we type `nbimom.pmf(3, 5, 0.5)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nbinom.pmf(3, 5, 0.5) = 0.1367187500000001\n"
     ]
    }
   ],
   "source": [
    "from scipy.stats import nbinom\n",
    "\n",
    "# to learn more about scipy.stats.nbinom, un-comment ouf the following line\n",
    "#print(nbinom.__doc__)\n",
    "\n",
    "print('nbinom.pmf(3, 5, 0.5) = {}'.format(nbinom.pmf(3, 5, 0.5)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, for the Poisson distribution and [`scipy.stats.poisson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html#scipy.stats.poisson), the three functions are `pmf`, `cdf`, and `rvs`. These take two inputs. For example, to find the $Pois(10)$ CDF at 2, we type `poisson.cdf(2, 10)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "poisson.cdf(2, 10) = 0.0027693957155115775\n"
     ]
    }
   ],
   "source": [
    "from scipy.stats import poisson\n",
    "\n",
    "# to learn more about scipy.stats.poisson, un-comment ouf the following line\n",
    "#print(poisson.__doc__)\n",
    "\n",
    "print('poisson.cdf(2, 10) = {}'.format(poisson.cdf(2, 10)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Continuing with Example 4.4.4, let's use simulation to calculate the expected number of matches in a deck of cards. As in Chapter 1, we let $n$ be the number of cards in the deck and perform the experiment 10<sup>4</sup> times using iteration with for-loop. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(987)\n",
    "\n",
    "n = 100\n",
    "trials = 10**4\n",
    "\n",
    "ordered = np.arange(1, n+1)\n",
    "\n",
    "r = []\n",
    "for i in range(trials):    \n",
    "    shuffled = np.random.permutation(np.arange(1, n+1))\n",
    "    m = np.sum(shuffled == ordered)\n",
    "    r.append(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now $r$ contains the number of matches from each of the 10<sup>4</sup> simulations. But instead of looking at the probability of at least one match, as in Chapter 1, we now want to find the expected number of matches. We can approximate this by the average of all the simulation results, that is, the arithmetic mean of the elements of $r$. This is accomplished with the [`numpy.mean`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.mean.html) function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0127"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The command `numpy.mean(r)` is equivalent to `numpy.sum(r)/len(r)`. The result we get is very close to 1, confirming the calculation we did in Example 4.4.4 using indicator r.v.s. You can verify that no matter what value of $n$ you choose, `numpy.mean(r)` will be very close to 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distinct birthdays simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's calculate the expected number of distinct birthdays in a group of $k$ people by simulation. We'll let $k = 20$, but you can choose whatever value of $k$ you like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1597)\n",
    "\n",
    "k = 20\n",
    "trials = 10**4\n",
    "\n",
    "r = []\n",
    "for i in range(trials):\n",
    "    bdays = np.random.choice(np.arange(1,365+1), k)\n",
    "    uniqs = len(np.unique(bdays))\n",
    "    r.append(uniqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a for loop to iterate 10<sup>4</sup> times, so we just need to understand what is inside the body of the for loop. First, we sample `k` times with replacement from the numbers 1 through 365 and call these the birthdays of the `k` people, `bdays`. Then, [`numpy.unique(bdays)`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.unique.html) removes duplicates in the array `bdays`, and `len(numpy.unique(bdays))` returns the length of the array after duplicates have been removed (number of unique birthdays). This count of unique birthdays is appended to array `r`. \n",
    "\n",
    "Now `r` contains the number of distinct birthdays that we observed in each of the 10<sup>4</sup> simulations. The average number of distinct birthdays across the 10<sup>4</sup> simulations is `numpy.mean(r)`. We can compare the simulated value to the theoretical value that we found in Example 4.4.5 using indicator r.v.s:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "simulated: 19.4921\n",
      "theoretical: 19.487910239138333\n"
     ]
    }
   ],
   "source": [
    "simulated = np.mean(r)\n",
    "theoretical = 365*(1-(364/365)**k)\n",
    "\n",
    "print('simulated: {}'.format(simulated))\n",
    "print('theoretical: {}'.format(theoretical))     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we ran the code, both the simulated and theoretical values gave us approximately 19.5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "&copy; Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}