{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Error Propogation\n", "====\n", "\n", "## Unit 13, Lecture 2\n", "\n", "*Numerical Methods and Statistics*\n", "\n", "----\n", "\n", "#### Prof. Andrew White, April 26 2018" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Goals:\n", "---\n", "\n", "0. Be able to propogate uncertainty using partial derivatives\n", "1. Account for sampling bias when designing experiments\n", "2. Realize that correlation is used to disprove causation, not prove it\n", "3. Design tests to disprove your hypothesis, not prove it\n", "4. Understand the effects of dependent samples" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from math import sqrt, pi, erf\n", "import seaborn\n", "seaborn.set_context(\"talk\")\n", "seaborn.set_style(\"whitegrid\")\n", "import scipy.stats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Estimating Statistical Uncertainty\n", "===\n", "\n", "Assume for a moment that $X$ is a random variable with a PDF $P(X = x)$. Next, we have $y = f(x)$ and we want to know what $P(y)$ is.\n", "For example, let's say $x$ is mass measured on a balance, $y$ is the number of moles, and $f(x) = x \\,/\\, \\mathrm{MW}$, $\\mathrm{MW}=18$. If we have 5 measurements of $x$ that are [3.5g, 3.4g, 3.3g, 3.6g, 3.2g], What is the confidence interval for $y$?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Error Propogation\n", "---\n", "\n", "The method is to propogate standard deviation or confidence interval widths using derivatives. We use the following approximation:\n", "\n", "$$\\frac{\\Delta y}{\\Delta x} \\approx \\left.\\frac{dy}{dx}\\right|_{x=\\hat{x}}$$\n", "\n", "which is valid when $\\Delta x$ is small relative to $\\frac{dy}{dx}$.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This approximation can be rewritten as:\n", "\n", "$$\\Delta y \\approx \\left.\\frac{dy}{dx}\\right|_{x=\\hat{x}}\\Delta x$$\n", "\n", "so a distance in $x$ can be turned into a distance in $y$ using the derivative. To solve our example, we'll get a confidence interval in $x$ and turn it into a confidence interval in $y$. Notice that we evaluate the derivative at the sample-mean (the most likely point)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.196324316148\n", "3.4\n" ] } ], "source": [ "data = np.array([3.5, 3.4, 3.3, 3.6, 3.2])\n", "sigma_x = sqrt(np.var(data, ddof=1))\n", "mean = np.mean(data)\n", "\n", "T_hi = scipy.stats.t.ppf(0.975, len(data) - 1)\n", "ci_width = T_hi * sigma_x / sqrt(len(data))\n", "\n", "print(ci_width)\n", "print(mean)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So the mean in $x$ is $3.4 \\pm 0.18$. Then we find the expected $y$ using the $f(x)$ equation:\n", "\n", "$$\\hat{y} = \\frac{\\hat{x}}{18}$$\n", "\n", "and get the confidence interval width using our new distance conversion formula\n", "\n", "$$w_y = \\frac{1}{18} w_x$$" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.188888888889\n", "0.0109069064527\n" ] } ], "source": [ "print(mean / 18.)\n", "print(ci_width / 18.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So value for the number of moles is $0.19 \\pm 0.01$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Fraction Uncertainty Example\n", "===\n", "\n", "We measure the volume of a sample with known mass. We obtain the following volumes: [1.2 ml, 0.9 ml, 1.3 ml, 1.0 ml, 1.3 ml]. The mass is 1g. What is the density and what is the uncertainty?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\\rho = \\frac{1.0}{v}$$\n", "\n", "$$\\Delta \\rho = \\left.\\frac{d\\rho}{dv}\\right|_{v = \\hat{v}} \\Delta v = -\\frac{1.0}{\\hat{v}^2} \\Delta v$$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.22555946663\n", "1.14\n" ] } ], "source": [ "data = np.array([1.2, 0.9, 1.3, 1.0, 1.3])\n", "sigma_x = sqrt(np.var(data, ddof=1))\n", "mean = np.mean(data)\n", "\n", "T_hi = scipy.stats.t.ppf(0.975, len(data) - 1)\n", "v_width = T_hi * sigma_x / sqrt(len(data))\n", "\n", "print(v_width)\n", "print(mean)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.877192982456 -0.173560685311\n" ] } ], "source": [ "rho_width = -1.0 / mean**2 * v_width\n", "\n", "print(1.0 / mean, rho_width)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice that the negative sign has no influence on the confidence inerval. The answer is $\\rho = 0.88 \\pm 0.16$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Error Propogation in Multiple Dimensions\n", "===" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The general formula for $N$-dimensions is:\n", "\n", "$$\\Delta f(x_1, \\ldots, x_N) = \\sqrt{\\sum_i^N \\left(\\left.\\frac{\\partial f(x_1, \\ldots, x_N)}{\\partial x_i}\\right|_{f(\\hat{x})}\\right)^2 \\left(\\Delta x_i\\right)^2}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So, for examlpe, in 2D it would be:\n", "\n", "$$\\Delta f(x,y) = \\sqrt{\\left(\\left.\\frac{\\partial f(x,y)}{\\partial x}\\right|_{f(\\hat{x}, \\hat{y})}\\right)^2 \\left(\\Delta x\\right)^2 + \\left(\\left.\\frac{\\partial f(x,y)}{\\partial y}\\right|_{f(\\hat{x}, \\hat{y})}\\right)^2 \\left(\\Delta y\\right)^2 }$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Example\n", "----\n", "\n", "The formula for density is:\n", "\n", "$$\\rho = \\frac{m}{v}$$\n", "\n", "and I have the following measurements:\n", "\n", "* $m = $ [3.5g, 3.4g, 3.3g, 3.6g, 3.2g]\n", "* $v = $ [1.2 ml, 0.9 ml, 1.3 ml, 1.0 ml, 1.3 ml]\n", "\n", "\n", "What is the 95% confidence interval for density?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The partial derviatves are:\n", "\n", "$$\\frac{\\partial \\rho}{\\partial m} = \\frac{1}{v}$$\n", "\n", "$$\\frac{\\partial \\rho}{\\partial v} = -\\frac{m}{v^2}$$" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.196324316148 0.22555946663\n" ] } ], "source": [ "masses = np.array([3.5, 3.4, 3.3, 3.6, 3.2])\n", "volumes = np.array([1.2, 0.9, 1.3, 1.0, 1.3])\n", "\n", "m_ci_width = scipy.stats.t.ppf(0.975, len(masses) - 1) * np.sqrt(np.var(masses, ddof=1) / len(masses))\n", "\n", "v_ci_width = scipy.stats.t.ppf(0.975, len(volumes) - 1) * np.sqrt(np.var(volumes, ddof=1) / len(volumes))\n", "\n", "print(m_ci_width, v_ci_width)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now, we check if the errors are small relative to the partial derivatives" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.877192982456 0.196324316148\n", "2.6161895968 0.22555946663\n" ] } ], "source": [ "m = np.mean(masses)\n", "v = np.mean(volumes)\n", "\n", "dm = (1.0 / v)\n", "dv = (m / v**2)\n", "\n", "print(dm, m_ci_width)\n", "print(dv, v_ci_width)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Looks good enough!" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.98245614035 0.6147220918203058\n" ] } ], "source": [ "drho = sqrt( dm**2 * m_ci_width**2 + dv**2 * v_ci_width**2)\n", "\n", "print(m / v, drho)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So the density is $$2.98 \\pm 0.57 \\; \\textrm{g / ml}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Statistical Intuition\n", "===" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Sampling Bias\n", "====\n", "\n", "Sampling bias is when your samples are NOT independnet from one another, possibly due to some hidden variable.\n", "\n", "* Mailed surveys (only weirdos will fill out and return a mailed in survey, giving a biased sample)\n", "* Telephone voting surveys (Older people will sit by their home phone and answer the survey. Young people only have cellphones and don't have time to answer surverys)\n", "* Repeated measurements (If you read the value off a balance 100 times, those aren't independent)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Experimantal Randomization\n", "====\n", "\n", "A common sampling bias is the accidental conflation of experimental variables. For example, let's say you're studying a protein assay and want to know if it depends on temperature. So you prepare your experiments like this:\n", "\n", "
Day | Replicate Number | Temperature | \n", "
Monday | 0 | 25$^\\circ{}$C | \n", "
Monday | 1 | 25$^\\circ{}$C | \n", "
Monday | 2 | 25$^\\circ{}$C | \n", "
Monday | 3 | 25$^\\circ{}$C | \n", "
Tuesday | 0 | 30$^\\circ{}$C | \n", "
Tuesday | 1 | 30$^\\circ{}$C | \n", "
Tuesday | 2 | 30$^\\circ{}$C | \n", "
Tuesday | 3 | 30$^\\circ{}$C | \n", "
Wednesday | 0 | 35$^\\circ{}$C | \n", "
Wednesday | 1 | 35$^\\circ{}$C | \n", "
Wednesday | 2 | 35$^\\circ{}$C | \n", "
Wednesday | 3 | 35$^\\circ{}$C | \n", "