{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Course 4: Probabilities and sampling\n", "\n", "In this worksheet, we will play with various probability measures introduced in the [Lecture 3 of the course of J. Nzabanita](https://docs.google.com/viewer?a=v&pid=sites&srcid=YWltcy5hYy5yd3xhY2FkZW1pY3xneDo2MWNmODczNWM5MjcwMDU1).\n", "\n", "- Bernoulli $Ber(p)$\n", "- binomial $B(n, p)$\n", "- geometric $Geom(p)$\n", "- Poisson $Pois(\\lambda)$\n", "- uniform $\\mathcal{U}(a, b)$\n", "- normal or Gaussian $N(\\mu, \\sigma^2)$\n", "\n", "We will also experiment and illustrate the limit theorems of [Lecture 4](https://docs.google.com/viewer?a=v&pid=sites&srcid=YWltcy5hYy5yd3xhY2FkZW1pY3xneDo2YmUyODgxZmIyZDM5YWJm)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting discrete probability measures\n", "--------------------------------------\n", "\n", "We will mainly be using the Python libraries `numpy`, `matplotlib.pyplot` and `scipy.stats`. The conventional way to import them is given by the code below\n", "```python\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "```\n", "**Exercise:**\n", "- Copy the above cell and execute it.\n", "\n", "**We recall that these import statements should be done only once in the whole worksheet!**\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Exercise:**\n", "- Do you remember how to access the documentation of a function or a module?\n", "- Have a look at the documentation of `stats`.\n", "- Could you find the name of the distributions we are interested in (Bernoulli, binomial, geometric, uniform, Poisson and normal)?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to construct a probability distribution the syntax is close to the mathematical one\n", "```python\n", "P1 = stats.binom(10, 0.3) # the distribution Binom(10, 0.3)\n", "P2 = stats.poisson(2.3) # the distribution Pois(2.3)\n", "```\n", "**Exercise:**\n", "- Construct the probabilities `P1` and `P2` given in the code above\n", "- Construct the probability `P3` that corresponds to $Geom(0.3)$.\n", "- Construct the probability `P4` that corresponds to $Ber(0.7)$.\n", "- For each of `P1`, `P2`, `P3` and `P4` print their mean and their variance (*hint: use tab-completion*)\n", "\n", "If you have doubts about the mean and variance of a random variable, they were defined in [Lecture 2 of J. Nzabanita](https://docs.google.com/viewer?a=v&pid=sites&srcid=YWltcy5hYy5yd3xhY2FkZW1pY3xneDo2YWU0N2E1NWRiZGI0NmYw)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now be interested in graphical representation of the probabilities `P1`, `P2`, `P3` and `P4`. We will be using two methods:\n", " - `pmf`: for the probablity mass function (i.e. the function $k \\mapsto P(\\{k\\})$)\n", " - `cdf`: for the cumulative density function (i.e. the function $k \\mapsto P((-\\infty, k])$.\n", " \n", "**Exercise:**\n", "- Copy, paste and execute the code below to get a graphical representation of `P1`\n", "```python\n", "x1 = np.arange(-2, 10, 1.0)\n", "x2 = np.arange(-2, 10, 0.1)\n", "y = P1.pmf(x1)\n", "z = P1.cdf(x2)\n", "plt.plot(x1, y, 'or', label='pmf')\n", "plt.plot(x2, z, '-b', label='cdf')\n", "plt.legend(loc=5)\n", "plt.ylim(-0.1, 1.1)\n", "plt.show()\n", "```\n", "If there are some functions that you do not know in the above code, I recall that you can access their documentation with the question mark `?`.\n", "- Could you interpret what you see on the graphics?\n", "- Reproduce similar graphics for `P2`, `P3` and `P4`\n", "- Add the titles `\"Binom(10, 0.3)\"`, `\"Poisson(2.3)\"`, `\"Geom(0.4)\"` and `\"Ber(0.7)\"`.\n", "- Save these graphics to four files called respectively `binomial.pdf`, `poisson.pdf`, `geometric.pdf` and `bernoulli.pdf` (*hint: look at the function `plt.savefig`)\n", "- Do you know how you could include these file into a latex document?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting continuous probability measure\n", "---------------------------------------\n", "\n", "For continuous probability measures the main methods are\n", "- `pdf`: probability density function\n", "- `cdf`: cumulative density function (also called cumulative distribution function in Lecture 2 of J. Nzabanita)\n", "\n", "**Exercise:**\n", "- Set `U` to be the uniform probability measure on $[0,5]$. Plot on the same graphics its density (in red) and its cumulative density function (in lue).\n", "- Let `N` be the standard normal distribution ($\\mu = 0$ and $\\sigma^2 = 1$). Plot on the same graphics its density (in red) and its cumulative density function (in blue).\n", "- On both graphics add a black vertical dashed line at the position of the mean." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sampling\n", "--------\n", "\n", "The `math`, `numpy` and `scipy` libraries contain functions to generate random variates. That is realization of random samples. For example the following command generates a random variates of size $10$ of a probability distribution $Ber(1/2)$\n", "```python\n", "P = stats.bernoulli(0.5)\n", "s = P.rvs(10)\n", "```\n", "\n", "**Exercise:**\n", "- Execute the code given above.\n", "- Print the value of `s`. What kind of Python object is it?\n", "- Do you know the mathematical definition of a *random sample*? and a *random variates*?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**WARNING:** In many situation there is an abuse of language between a random sample and a random variates. Computers only deal with random variates.\n", "\n", "Given a finite set of real numbers $A = \\{x_1, x_2, \\ldots, x_n\\}$ we associate the following (discrete) probability measure\n", "$$\n", "\\mu_A = \\frac{1}{\\# A} \\sum_{a \\in A} \\delta_a = \\frac{1}{n} \\sum_{i = 1}^n \\delta_{x_i}.\n", "$$\n", "This measure is called the *empirical distribution*.\n", "\n", "**Exercise:** For each of the probability measures `P1`, `P2`, `P3`, `P4`, `U`, `N` do the following:\n", "- Construct a random sample of size 50.\n", "- Using `plt.hist` make a graphics of the empirical distribution of the sample.\n", "- Add the graph of the probability mass function (discrete case) or the density function (continuous case) to the graphics.\n", "- Repeat these operations with samples of size 100, 500, 1000, 5000.\n", "\n", "By analyzing the graphics, can you say what does the empirical distribution as the size of the sample gets bigger? Do you know how to prove it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Toss a coin\n", "-----------\n", "\n", "We now turn to coin tossing, that is a sequence of independent random variables with probability distribution $Ber(1/2)$. Random variables were introduced in the [Lecture 2 of J. Nzabanita](https://docs.google.com/viewer?a=v&pid=sites&srcid=YWltcy5hYy5yd3xhY2FkZW1pY3xneDo2YWU0N2E1NWRiZGI0NmYw).\n", "\n", "**Exercise:**\n", "- What does the following code\n", "```python\n", "P = stats.bernoulli(0.5)\n", "t = [np.count_nonzero(P.rvs(10)) for i in range(20)]\n", "```\n", "- Explain how it is related to the probability distribution $Binom(10, 1/2)$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:**\n", "- If `x` is a `numpy` array, what does `x.cumsum()` is doing? (*hint: look at the documentation and make some small examples*)\n", "- Construct a random variates of size 200 of the distribution $Bern(1/2)$.\n", "- Let `cs` be the cumulative sum of this random variates. Plot the sequence of points `(0, cs[0])`, `(1, cs[1])`, `(2, cs[2])`, ..., `(499, cs[499])`.\n", "- Now plot the sequence of points `(1, cs[1]/1)`, `(2, cs[2]/2)`, ..., `(499, cs[499]/499)`.\n", "- Now do a figure that contains 6 times the same construction as above but with 6 different random variates. You will need to use the command `plt.subplot` as described in [this chapter of the SciPy lectures](http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html).\n", "- How does those two plots relate to one of the two fundamental theorems of probability?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Exercise:** Let us consider the following code\n", "```python\n", "P = stats.bernoulli(0.5)\n", "v = P.rvs(size=1000)\n", "s = v.nonzero()[0]\n", "t = s[1:] - s[:-1]\n", "plt.hist(t, bins=np.arange(0,10), normed=True)\n", "plt.show()\n", "```\n", "- What does this code is doing (this question needs a detailed answer)?\n", "- Give an interpretation of the graphics obtained." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The central limit theorem\n", "=========================\n", "\n", "We will study the central limit theorem for coin tossing in two different ways:\n", "- first through random sampling,\n", "- then by considering the pmf of binomial distributions for large values of the parameter $n$.\n", "\n", "**Exercise:**\n", "- Let us fix `n = 1000`. For `5000` distinct variates of size $n$ of $Ber(1/2)$, compute the number of ones in that variates. Put these `5000` numbers into an array.\n", "- Plot the empirical distribution of the array you constructed.\n", "- Could you renormalize the values in this array so that the distribution is close to $N(0, 1)$?\n", "- On the same graphics draw the empirical distribution in blue (an histogram) and the density function of $N(0,1)$ in red." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Plot on the same graphics various values of the distribution $Binom(50, p)$ as p varies in $[0, 1]$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** We consider $X_1$, $X_2$, ... a sequence of independent variable with distribution $Ber(0.5)$.\n", "- What is the probability distribution of $S_n = X_1 + X_2 + \\ldots + X_n$?\n", "- What about $$Y_n = \\frac{S_n - n / 2}{\\sqrt{n}}?$$\n", "- Plot the distribution of $Y_n$ for $n \\in \\{5,10,15,20,25,30\\}$.\n", "- On the same graphics add a plot of the density function of the limiting Gaussian distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rare events\n", "-----------\n", "\n", "We now study another behavior of binomial distribution for large parameters $n$.\n", "\n", "**Exercise:** Let us fix $\\lambda = 2$. Put on the same graphics:\n", "- the pmf of $Binom(n, \\lambda / n)$ where $n \\in \\{3,5,7,9,11,13\\}$\n", "- the pmf of $Pois(\\lambda)$\n", "- a legend" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (C) 2016 Vincent Delecroix <vincent.delecroix@u-bordeaux.fr>\n", "\n", "This work is licensed under a Creative Commons Attribution-NonCommercial 4.0\n", "International License (CC BY-NC-SA 4.0). You can either read the\n", "[Creative Commons Deed](https://creativecommons.org/licenses/by-nc-sa/4.0/)\n", "(Summary) or the [Legal Code](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode)\n", "(Full licence)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 1 }