{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework 3\n", "\n", "## Intro to Statistics - Part 2\n", "
\n", "This notebook is arranged in cells. Texts are usually written in the markdown cells, and here you can use html tags (make it bold, italic, colored, etc). You can double click on this cell to see the formatting.
\n", "
\n", "The ellipsis (...) are provided where you are expected to write your solution but feel free to change the template (not over much) in case this style is not to your taste.
\n", "
\n", "Hit \"Shift-Enter\" on a code cell to evaluate it. Double click a Markdown cell to edit.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Link Okpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from client.api.notebook import Notebook\n", "ok = Notebook('hw3.ok')\n", "_ = ok.auth(inline = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "from scipy.integrate import quad\n", "#For plotting\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem 1 - Fitting Gaussian Contours to a 2D Data\n", "\n", "Gaussian distribution function plays a central role in statistics and is the most ubiquitous distribution in physics. It often provides a good approximation to the true probability density function (pdf) even in cases where its application is not strictly correct.

\n", "In this problem, suppose that you have measured 1000 pairs of values $(x_1, y_1), ... , (x_{1000}, y_{1000})$ of two variables $x, y$. You saved these measurements to a .dat file (\"Problem1_data.dat\"). Plot their 1-dimensional pdf's and determine how well Gaussian pdf can approximate them. Compute the mean, variance, median, mode, 68% and 95% confidence intervals, and correlation coefficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 1. Plot 1-dimensional pdf for $x$.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load a given 2D data\n", "data = np.loadtxt(\"Problem1_data.dat\")\n", "x = data[:,0]\n", "y = data[:,1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Plot a normalized histogram (Hint - https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html)\n", "# Try to have 25 elements per each bin\n", "\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 2. Calculate mean, variance, and median of $x$. First, do it \"by hand\" without using any in-built functions. Then, check your answers using in-built functions from numpy. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Calculating \"by hand\"\n", "mean_x = ...\n", "variance_x = ...\n", "\n", "median_x = ...\n", "\n", "print(\"For x, mean = \", mean_x, \", variance = \", variance_x, \", and median = \", median_x)\n", "\n", "# Using in-built functions from numpy\n", "mean_x = ...\n", "variance_x = ...\n", "median_x = ...\n", "\n", "print(\"For x, mean = \", mean_x, \", variance = \", variance_x, \", and median = \", median_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 3. Smoothly interpolate the discrete probability density from Part 1. Then, find the mode and symmetric 68%, 95% confidence intervals. (Suggestion - Read https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html and https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.UnivariateSpline.html)

\n", "(Hint - For 68% confidence interval, find the range where 68% of the given sample occurs. We assume that such interval is symmetrically placed around the mean.
\n", "In other words, find $a$ such taht\n", "$$ 0.68 = \\int_{\\mu-a}^{\\mu+a} P(x) $$\n", "where $P(x)$ is $x$'s pdf, and $\\mu$ is the mean.
\n", "One way to find $a$ is to define a cumulative distribution function (cdf) $G(x)$ and find $a$ such that $G(\\mu+a)-G(\\mu-a) = .68$.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# NOTE: The following skeleton code is only a suggestion.\n", "from scipy.interpolate import UnivariateSpline\n", "\n", "p, xvar = np.histogram( ... , normed = True)\n", "# Find x values at the center of each bin (call it xvar)\n", "xvar = xvar[:-1] + (xvar[1] - xvar[0])/2\n", "\n", "# Find the interpolated function f\n", "f = UnivariateSpline(xvar, p, s=n)\n", "# Change the amount of smoothing\n", "f.set_smoothing_factor(1.e-8)\n", "\n", "# Plot both histogram and interpolated function\n", "plt.hist( ... , normed = True, rwidth = 0.8)\n", "plt.plot( ... , '--', label = \"Interpolated\")\n", "...\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Find the mode\n", "mode_x = ...\n", "\n", "print(\"For x, mode = \", mode_x)\n", "\n", "# Find 68% and 95% confidence intervals\n", "\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming that the distribution is Gaussian, 68% and 95% confidence interval corresponds to $\\mu \\pm 1\\sigma$ and $\\mu \\pm 2\\sigma$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(\"Assuming Gaussian distribution, 68% confidence interval is\", mean_x, \"±\", np.sqrt(variance_x), \n", " \", and 95% interval is\", mean_x, \"±\", 2*np.sqrt(variance_x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should find that the Gaussian distribution is a reasonable approximation in this case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 4. Plot Gaussian distribution with the mean and variance from Part 1 on top of probability density histogram. Make sure to label each plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define Gaussian distribution\n", "def gaussian(x, mu, sigma):\n", " return ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Plot histogram\n", "...\n", "# Plot Gaussian distribution on top\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " 5. Repeat part 1-4 for $y$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Find mean, variance and median\n", "...\n", "# Plot histogram (discrete pdf) and interpolate it\n", "...\n", "# Find mode and 68%, 95% confidence interval\n", "...\n", "# Plot Gaussian fit on top of the histogram\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 6. Make a 2-d scatter plot with Gaussian contours (ellipses). Then, compute the covariance ($C_{xy}$) of $x$ and $y$ as well as the correlation coefficient $\\rho = \\frac{C_{xy}}{\\sigma_{x}\\sigma_{y}}$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Compute the covariance and correlation coefficient\n", "cov = ...\n", "\n", "print(\"The covariance between x and y is\", cov)\n", "\n", "print(\"The correlation coefficient of x and y is\", ...)\n", "\n", "# Make a 2-d scatter plot with Gaussian contours\n", "\n", "import matplotlib.mlab as mlab\n", "\n", "# Create coordinate matrices from coordinate vectors.\n", "gridx = np.linspace(9000, 11000, 100)\n", "gridy = np.linspace(-1500, 1500, 100)\n", "X, Y = np.meshgrid(gridx, gridy)\n", "\n", "# Create bivariate Gaussian distribution for equal shape X, Y (https://matplotlib.org/api/mlab_api.html)\n", "Z = mlab.bivariate_normal(X, Y, np.sqrt(variance_x), np.sqrt(variance_y), mean_x, mean_y, cov)\n", "\n", "# Make plot\n", "plt.figure(figsize = (10, 7))\n", "# Scatter plot\n", "plt.scatter(x, y, marker = 'x')\n", "# Gaussian contour plots\n", "plt.contour(X, Y, Z)\n", "\n", "plt.xlabel('$x$')\n", "plt.ylabel('$y$')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above contour plot is a bird eye view of the 3-d mesh plot; these are ellipses of equal probability. The coloring represents the intensity. Yellow central ellipse is the region of highest probability; the peak of 2-d Gaussian distribution is at the center of this ellipse. As we move away from the peak, the probability lowers." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem 2 - Central Limit Theorem\n", "\n", "Plot the binomial distribution $P(N_A, N)$ for different values of $N$ and plot the Gaussian with mean and variance for the binomial. Similarly, plot the Poisson distribution with the mean varying from 1 to 10. See if both binomial and Poisson approach Gaussian as the mean/$N$ increases.

\n", "(Reference - Kardar p. 41) For the binomial distribution, consider a random variable with two outcomes $A$ and $B$ of relative probabilities $p_A$ and $p_B = 1 - p_A$. The probability that in $N$ trials the event $A$ occurs exactly $N_A$ times is given by the binomial distribution:\n", "$$ p_N(N_A) = \\binom{N}{N_A} p_A^{N_A}(1-p_A)^{N-N_A}. $$\n", "
\n", " 1. Plot the binomial distribution $P(N_A, N)$ for $N = 5, 20, 40, 100, 300$ and plot the Gaussian with mean and variance for the binomial. Let $p_A = 0.5$ and $0.1$. Make sure to label each plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import packages for the bionomial coefficient\n", "from scipy.special import binom\n", "\n", "# Define the probability for the binomial distribution\n", "def pdf_binom(p_A, N, N_A):\n", " return ...\n", "\n", "# Define Gaussian distribution\n", "def gaussian(x, mu, sigma):\n", " return ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "N = [5, 20, 40, 100, 300]\n", "\n", "# Make plot\n", "\n", "# For p_A = 0.1\n", "N_A = np.linspace(0, 40, 1000)\n", "plt.figure(figsize= (10, 8))\n", "p_A = 0.1\n", "...\n", "plt.show()\n", "\n", "# For p_A = 0.5\n", "N_A = np.linspace(0, 70, 1000)\n", "plt.figure(figsize= (10, 8))\n", "p_A = 0.5\n", "...\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In class, we find that the binomial distribution is approximately normal (with mean $Np_A$ and variance $Np_A(1-p_A)$) as $N \\rightarrow \\infty$, by the central limit theorem. The proof of this theorem can be carried out using Stirling's approximation:\n", "$$ N! \\approx N^N e^{-N}\\sqrt{2\\pi N} $$\n", "
\n", " 2. Plot the above Stirling's formula approximation (i.e. Compare $N!$ with Stirling's approximation. Compute the residual: (actual-estimate)/actual.)
\n", "(Hint: $\\Gamma(n+1) = n!$)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy.special import gamma\n", "\n", "Nvals = np.linspace(1, 40, 1000)\n", "\n", "actual = ...\n", "estimate = ...\n", "\n", "plt.plot(Nvals, (actual-estimate)/actual)\n", "plt.xlabel('$N$')\n", "plt.ylabel('residual')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should find that residual $\\rightarrow 0$ as $N \\rightarrow \\infty$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, consider the Poisson distribution (Kardar p. 42):\n", "$$ P(\\lambda) = \\frac{\\lambda^k e^{-\\lambda}}{k!} $$\n", "where $k$ is the number of occurrences. Its mean and variance are $\\lambda$.

\n", " 3. Plot $P(\\lambda)$ as a function of $k$ for $\\lambda = 1, 3, 5, 10, 20$ and plot the Gaussian with mean and variance for the Poisson. Make sure to label.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define the Poisson distribution\n", "...\n", "\n", "# Make plot\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 4. What happens as the mean/$N$ increases?
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Answer:
\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem 3 - Fitting Data to a Straight Line (Linear Regression)\n", "\n", "(Reference - NR 15.2) We fit a set of 50 data points $(x_i, y_i)$ to a straight-line model $y(x) = a + bx$. The uncertainty $\\sigma_i$ associated with each measurement $y_i$ is known, and we assume that the $x_i$'s are known exactly. To measure how well the model agrees with the data, we use the chi-square merti function:
\n", "$$ \\chi^2(a,b) = \\sum_{i=0}^{N-1} \\big( \\frac{y_i-a-bx_i}{\\sigma_i} \\big)^2. $$\n", "
\n", "Make a scatter plot of data (including uncertainties) and find the best-fit line. Compute the errors on the two parameters $a$ and $b$ and plot lines where the two are changed by $\\pm 1\\sigma$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " 1. Plot data (make sure to include error bars). (Hint - https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.errorbar.html)
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load a given 2D data\n", "data = np.loadtxt(\"Problem3_data.dat\")\n", "x = data[:,0]\n", "y = data[:,1]\n", "sig_y = data[:,2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Make plot\n", "plt.figure(figsize = (10, 7))\n", "# Scatter plot\n", "...\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(NR p. 781) We should minimize the above chi-square function to determine $a$ and $b$. At its minimum, derivatives of $\\chi^2$ with respect to $a, b$ vanish:\n", "$$ \\frac{\\partial{\\chi^2}}{\\partial{a}} = -2 \\sum \\frac{y_i - a - bx_i}{\\sigma_i^2} = 0 \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ (1) $$\n", "$$ \\frac{\\partial{\\chi^2}}{\\partial{b}} = -2 \\sum \\frac{x_i(y_i - a - bx_i)}{\\sigma_i^2} = 0 \\ \\ \\ \\ \\ \\ \\ \\ \\ (2) $$\n", "
\n", "These conditions can be rewritten in a convenient form if we define the following sums:\n", "$$ S = \\sum \\frac{1}{\\sigma_i^2},\\ S_x = \\sum \\frac{x_i}{\\sigma_i^2},\\ S_y = \\sum \\frac{y_i}\n", "{\\sigma_i^2} $$\n", "$$ S_{xx} = \\sum \\frac{x_i^2}{\\sigma_i^2},\\ S_{xy} = \\sum \\frac{x_iy_i}{\\sigma_i^2} $$\n", "
With these, we can rewrite (1), (2) as:\n", "$$ a*S + b*S_x = S_y $$\n", "$$ a*S_x + b*S_{xx} = S_{xy} $$\n", "
The solution to these is calculated as:\n", "$$ \\Delta = SS_{xx} - (S_x)^2 $$
\n", "$$ a = \\frac{S_{xx}S_y - S_xS_{xy}}{\\Delta} $$\n", "$$ b = \\frac{SS_{xy} - S_xS_y}{\\Delta} $$\n", "
2. Find parameters $a, b$ which minimize the chi-square function and plot the best-fit line on top of the data.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "S = ...\n", "Sx = ...\n", "Sy = ...\n", "Sxx = ...\n", "Sxy = ...\n", "Delta = ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = ...\n", "b = ...\n", "\n", "..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Make plot\n", "plt.figure(figsize = (10, 7))\n", "...\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we must estimate the probable uncertainties in the estimates of $a$ and $b$, since obviously the measurement errors in the data must introduce some uncertainty in the determination of those parameters. If the data are independent, then each contributes its own bit of uncertainty to the parameters. Consideration of propagation of errors show that the variance $\\sigma_f^2$ in the value of any function will be \n", "$$ \\sigma_f^2 = \\sum \\sigma_i^2 (\\frac{\\partial f}{\\partial y_i})^2 $$\n", "
For the straight line, the derivatives of $a$ and $b$ with respect to $y_i$ can be directly evaluated from teh solution:\n", "$$ \\frac{\\partial a}{\\partial y_i} = \\frac{S_{xx}-S_x x_i}{\\sigma_i^2 \\Delta} $$\n", "$$ \\frac{\\partial b}{\\partial y_i} = \\frac{S x_i-S_x}{\\sigma_i^2 \\Delta} $$\n", "
Summing over the points, we get\n", "$$ \\sigma_a^2 = S_{xx}/\\Delta $$\n", "$$ \\sigma_b^2 = S/\\Delta $$\n", "\n", " 3. Compute the errors ($\\sigma_a, \\sigma_b$) on the two parameters $a, b$ and plot lines where the two are changed by $\\pm 1\\sigma$.
\n", "(Hint - Try to plot the 1$\\sigma$ confidece band as in http://astropython.blogspot.com/2011/12/. You can use plt.fill_between to shade the region between plots.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Calculate sigma_a, sigma_b\n", "\n", "sigma_a = ...\n", "sigma_b = ...\n", "\n", "print('We estimate that a =', a ,\"±\", sigma_a, \"and b =\", b, \"±\", sigma_b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.figure(figsize = (10, 7))\n", "...\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To Submit\n", "Execute the following cell to submit.\n", "If you make changes, execute the cell again to resubmit the final copy of the notebook, they do not get updated automatically.
\n", "__We recommend that all the above cells should be executed (their output visible) in the notebook at the time of submission.__
\n", "Only the final submission before the deadline will be graded. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "_ = ok.submit()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }