{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Homework 6\n",
    "====\n",
    "#### CHE 116: Numerical Methods and Statistics\n",
    "Version 1.0 (2/17/2016)\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Misc Distribution Problems (10 Points)\n",
    "====\n",
    "Answer symbolically first, indicating what equations your Python program is using, and then compute the answer in Python. If not specified, say which distribution you're assuming.\n",
    "\n",
    "1. [1] The time between traffic tickets is exponentially distributed. Based on past experience, you receive a traffic ticket about every 3 years. What's the probability of having a ticket within 12 months? For 2 bonus points, what's the probability of having 2 tickets within 12 months? Use scipy stats.\n",
    "\n",
    "2. [2] You see two deer per day. How many days must pass before you have a 99% of having seen a deer? Answer in days, hours, and minutes.\n",
    "\n",
    "3. [1] The expected score on a test is 90% with a standard deviation of 15%. You cannot receive more than 100% on this test. What's the probability failing (< 60%)?\n",
    "\n",
    "4. [2] Using the above parameters, what's the probability of getting an A (93%-100%)?\n",
    "\n",
    "7. [4] Using the definition of expected value, write a for loop that computes the expected value of a binomial distribution with $N = 10$ and $p = 0.3$. Do not use scipy stats. Compre with the fomula $E[x] = pN$ for binomial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1 Answer\n",
    "**One ticket:**\n",
    "We are being asked:\n",
    "\n",
    "$$\\int_0^{12} \\lambda e^{-\\lambda t}\\, dt$$\n",
    "\n",
    "where $\\lambda = \\frac{1}{36}$, based on the prompt. The answer is $0.28$\n",
    "\n",
    "**Two tickets :**\n",
    "We can view this as kind of a binomial distribution, where each month we can get a traffic ticket. That would give these parameters:\n",
    "\n",
    "$$N = 12$$\n",
    "$$ p = \\frac{1}{36}$$\n",
    "\n",
    "$$P(x = 2) = \\binom{x}{N} p^x \\,\\left(1 - p\\right)^{N - x} = 0.03$$\n",
    "\n",
    "However, of course it's possible to get two tickets in a month. SO, let's try breaking it down by day:\n",
    "\n",
    "$$N = 365$$\n",
    "$$ p = \\frac{1}{3\\times365}$$\n",
    "\n",
    "$$P(x = 2) = \\binom{x}{N} p^x \\,\\left(1 - p\\right)^{N - x} = 0.0398$$\n",
    "\n",
    "Ah, we see that it's approximately the same. To go to the extreme, we can use a Poisson distribution, where $\\mu = \\frac{1}{3}$. That gives $0.0398$, which is the same. To make sure we're sane, we can check if the two distributions are consistent. This Poisson distribution gives 0.24 (instead of 0.28) for a single ticket. Sort of close. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.283468689426\n",
      "0.0384232741801\n",
      "0.0397647849595\n",
      "0.0398072950319\n",
      "0.238843770191\n"
     ]
    }
   ],
   "source": [
    "from scipy import stats as ss\n",
    "\n",
    "print(ss.expon.cdf(12, scale=36))\n",
    "print(ss.binom.pmf(2, p=1 / 36, n=12))\n",
    "print(ss.binom.pmf(2, p=1 / (3 * 365), n=365))\n",
    "print(ss.poisson.pmf(2, mu=1 / 3))\n",
    "print(ss.poisson.pmf(1, mu=1 / 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ss.binom?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2 Answer\n",
    "\n",
    "This is an exponential distribution with $\\lambda = \\frac{2}{24 \\times 60}$. We are being aksed\n",
    "\n",
    "$$P(t > a) = 0.99$$\n",
    "\n",
    "which is 2 days, 7 hours, and 16 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 7 15.7225339114\n"
     ]
    }
   ],
   "source": [
    "result = ss.expon.ppf(0.99, scale = 24 * 60 / 2)\n",
    "days = int(result / 24 / 60)\n",
    "hours = int((result / 60 - days * 24))\n",
    "minutes = (result - days * 24 * 60 - hours * 60)\n",
    "print(days, hours, minutes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3 Answer\n",
    "\n",
    "We are being asked:\n",
    "\n",
    "$$\\int_{-\\infty}^{60} \\cal{N}(90, 12)$$\n",
    "\n",
    "which is 2.3%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.022750131948179195"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ss.norm.cdf(60, scale=15, loc=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4 Answer\n",
    "\n",
    "We are being asked:\n",
    "\n",
    "$$\\int_{93}^{\\infty} \\cal{N}(90, 12)$$\n",
    "\n",
    "The reason it goes to infinity is that many who got 100%, would have gotten higher scores had the test allowed it. So they would be greater than 100% if the test scores truly followed a normal distribution. The answer is 42%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.42074029056089701"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 - ss.norm.cdf(93, scale=15, loc=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.7 Answer\n",
    "\n",
    "The definition of expected value is:\n",
    "\n",
    "$$E[x] = \\sum_0^N x P(x)$$\n",
    "\n",
    "where $P(x)$ is the binomial distribution for us"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.0 3.0\n"
     ]
    }
   ],
   "source": [
    "from scipy.special import comb\n",
    "\n",
    "sum = 0\n",
    "N = 10\n",
    "p = 0.3\n",
    "\n",
    "for i in range(0,N+1):\n",
    "    sum += i * comb(N, i) * p**i * (1 - p)**(N - i)\n",
    "print(sum, p * N)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. CLT Theory (4 Points)\n",
    "====\n",
    "Indicate if the CLT applies with yes or no. If no, state why.\n",
    "\n",
    "1. You measure the density of a solution 50 times and take the average\n",
    "2. You sum the number of students who attended the last 20 lectures\n",
    "3. Flip a coin 25 times and consider a heads 0 and a tails 1 and take the average\n",
    "4. Your final grade, which is the average of 25 homeworks, tests, and a project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1 Answer\n",
    "\n",
    "CLT applies. Enough samples and you took an average\n",
    "\n",
    "#### 2.2 Answer\n",
    "\n",
    "CLT applies. It's proportional to an average (you didn't divide by 20)\n",
    "\n",
    "\n",
    "#### 2.4 Answer\n",
    "\n",
    "CLT applies\n",
    "\n",
    "#### 2.5 Answer\n",
    "\n",
    "CLT applies. You have 27 rvs averaged together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Confidence Intervals (12 Points)\n",
    "====\n",
    "Report the given confidence interval for error in the mean using the data in the next cell and describe in words what the confidence interval is for each example\n",
    "\n",
    "1. 80% Double\n",
    "2. 99% Upper ( a value such that the mean lies above that value 99% of the time)\n",
    "3. 95% Double\n",
    "4. Redo part 3 with a known standard deviation of 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data_3_1 = [93.14,94.66, 102.1, 79.98, 96.85, 106.79, 101.92, 91.99, 97.22, 99.1, 88.7, 123.66, 99.7, 115.03, 99.28, 114.59, 102.25, 88.4, 111.06, 75.19, 107.32, 81.21, 100.49, 109.04, 105.09, 96.17, 78.13, 98.37, 104.47, 95.41]\n",
    "data_3_2 = [2.24,3.86, 2.19, 1.5, 2.34, 2.55, 1.8, 3.99, 2.64, 3.8]\n",
    "data_3_3 = [53.43,50.49, 52.55, 51.73]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 3.1 Answer\n",
    "\n",
    "This is a normal distribution because $N > 25$. Our interval will run from 10% to 90% to have an area of 80%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "95.9671913504 to 101.18680865\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from scipy import stats as ss\n",
    "\n",
    "sample_mean = np.mean(data_3_1)\n",
    "sample_std = np.std(data_3_1, ddof=1)\n",
    "Zlo = ss.norm.ppf(0.1)\n",
    "Xlo = Zlo * sample_std / np.sqrt(len(data_3_1)) + sample_mean\n",
    "Xhi = -Zlo * sample_std / np.sqrt(len(data_3_1)) + sample_mean\n",
    "\n",
    "print(Xlo, 'to', Xhi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 Answer\n",
    "\n",
    "This is a t distribution because $N < 25$. Our interval will run from 1% to 100% to have an area of 99%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.91493832637\n"
     ]
    }
   ],
   "source": [
    "sample_mean = np.mean(data_3_2)\n",
    "sample_std = np.std(data_3_2, ddof=1)\n",
    "Tlo = ss.t.ppf(0.01, len(data_3_2) - 1)\n",
    "Xlo = Tlo * sample_std /np.sqrt(len(data_3_2)) + sample_mean\n",
    "print(Xlo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 Answer\n",
    "\n",
    "This is a t distribution because $N < 25$. Our interval will run from 2.5% to 97.5% to have an area of 95%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50.3141851129 to 53.7858148871\n"
     ]
    }
   ],
   "source": [
    "sample_mean = np.mean(data_3_3)\n",
    "sample_std = np.std(data_3_3, ddof=1)\n",
    "Tlo = ss.t.ppf(0.025, len(data_3_3) - 1)\n",
    "Xlo = Tlo * sample_std / np.sqrt(len(data_3_3)) + sample_mean\n",
    "Xhi = -Tlo * sample_std / np.sqrt(len(data_3_3)) + sample_mean\n",
    "print(Xlo, 'to', Xhi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 Answer\n",
    "\n",
    "We can now use a normal distribution because we know the true standard deviation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50.0900360155 to 54.0099639845\n"
     ]
    }
   ],
   "source": [
    "sample_mean = np.mean(data_3_3)\n",
    "true_std = 2\n",
    "Zlo = ss.norm.ppf(0.025)\n",
    "Xlo = Zlo * true_std / np.sqrt(len(data_3_3)) + sample_mean\n",
    "Xhi = -Zlo * true_std / np.sqrt(len(data_3_3)) + sample_mean\n",
    "print(Xlo, 'to', Xhi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Sample Statistics (17 Points)\n",
    "=====\n",
    "\n",
    "Answer the following questions using the data given in the next cell.\n",
    "\n",
    "1. [1] What is the sample correlation between $X$ and $Y$?\n",
    "4. [5] Write your own method to compute sample covariance using a for loop. Use the code in the second cell below as a starting point. Do  not use any numpy methods except to check your answer. *Hint: You will need to use two loops*\n",
    "3. [5 + 1] What is the median of $Y$? Use Python and Numpy. Be careful if you use the sort method, since it will permanently alter your $Y$ array.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = [1.6,0.4, -1.05, -0.08, 0.99, -1.89, 0.29, 0.71, -0.47, 1.15]\n",
    "Y = [3.59,1.49, -2.57, -0.0, 2.0, -3.48, 0.14, 1.38, -1.48, 2.6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.6 3.59\n",
      "0.4 1.49\n",
      "-1.05 -2.57\n",
      "-0.08 -0.0\n",
      "0.99 2.0\n",
      "-1.89 -3.48\n",
      "0.29 0.14\n",
      "0.71 1.38\n",
      "-0.47 -1.48\n",
      "1.15 2.6\n"
     ]
    }
   ],
   "source": [
    "for xi, yi in zip(X, Y): \n",
    "    print(xi, yi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.1 Answer\n",
    "\n",
    "$$r_{xy} = -0.16$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.985277421286\n"
     ]
    }
   ],
   "source": [
    "ans = np.corrcoef(X, Y, ddof=1)\n",
    "print(ans[0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2 Answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.4106833333333326\n"
     ]
    }
   ],
   "source": [
    "#compute the mean first\n",
    "xmean = 0\n",
    "ymean = 0\n",
    "for xi, yi in zip(X, Y): \n",
    "    xmean += xi\n",
    "    ymean += yi\n",
    "    \n",
    "xmean /= len(X)\n",
    "ymean /= len(Y)\n",
    "\n",
    "#now compute covariance using our previous calculation.\n",
    "cov = 0\n",
    "for xi, yi in zip(X, Y): \n",
    "    cov += (xi - xmean) * (yi - ymean)\n",
    "cov /= len(X) - 1\n",
    "\n",
    "print(cov)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.3 Answer\n",
    "\n",
    "It is an even number, so there is no center element"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.76\n"
     ]
    }
   ],
   "source": [
    "from math import *\n",
    "\n",
    "YSort = Y[:]\n",
    "YSort.sort()\n",
    "N = len(YSort)\n",
    "print((YSort[int(N / 2)] + YSort[int(N / 2 - 1)]) / 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}