{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 8: Randomness Part I\n", "\n", "## Last time\n", "Slicing of numpy arrays, messing around with images of shape `(M, N, 3)`, histograms. \n", "\n", "## Today\n", "* review on histogram\n", "* Randomness and scattered plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lambda function in Python\n", "Recall in reviewing eigenvalues in Linear algebra, we want to avoid using lambda? This is because lambda has a special use in Python. It can be used to define a *function handle* or *anonymous function*, similar to `@` used in Matlab (`y = @(x) x^2 + 1`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = lambda x: x**2 + 1 # avoid using lambda in ordinary programming in Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this can be applied to ndarray as well" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Randomness\n", "\n", "Randomness is used a lot both in mathematics and the real world.\n", "\n", "Generally, a random number comes from a probability distribution. \n", "\n", "The distribution might be discrete: i.e., \n", "it comes from a set \n", "\n", "$$ \\big\\{ (x_1, p_1), ..., (x_n, p_n) \\big\\},$$\n", "\n", "where you get outcome $x_i$ with probability $p_i$, i.e., \n", "\n", "$$P(X = x_i) = p_i.$$\n", "\n", "\n", "It is assumed that $\\sum_i p_i = 1$ (if not you can normalize the $p$'s so their sum is 1). The function that takes $x_i \\mapsto p_i$ is called the *probability mass function*.\n", "\n", "For continuous random numbers, one normally uses a *probability density function* (pdf). For example, the normal distribution comes from the following function: $\\mathcal{N}(\\mu, \\sigma^2) $\n", "\n", "$$p(x; \\mu,\\sigma) = \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{-\\frac{(x-\\mu )^2}{2\\sigma^2} },$$\n", "\n", "where $\\mu$ and $\\sigma$ are parameters (mean and standard deviation).\n", "\n", "The probability of a random number from this distribution being in the interval $[a,b]$ is then:\n", "\n", "$$P\\big(X\\in [a,b]\\big) = \\int_a^b p(x)\\,dx$$\n", "\n", "The most well-known distributions are the uniform distribution (where pdf is a constant) and the normal distribution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remark:\n", "\n", " The histogram is an estimate of the (probability) density distribution of a (continuous) variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# let us graph the density function of the normal distribution.\n", "from math import pi, sqrt, e\n", "xs = np.linspace(-5,5,300)\n", "pdf = lambda x: 1/sqrt(2*pi)*e**(-0.5*x**2) # pdf for N(0,1) standard normal dist\n", "ys = pdf(xs)\n", "plt.plot(xs, ys)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# numpy.random module\n", "\n", "\"pseudo\" random number generator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy import random # random submodule in numpy, natively vectorized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random integers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# random.randint()\n", "# simulate a die rolling sequence\n", "N = 2000\n", "X = np.zeros(N)\n", "for i in range(N):\n", " X[i] = random.randint(1, 7) # from 1 (inclusive) to 7 (exclusive)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what is the mean of the dice rolling?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uniform distribution\n", "\n", "The easiest distribution is the uniform distribution on $(0,1)$, in which all numbers in a given interval are equally likely. We can use the function `random.random()` that will produce a uniformly distributed random number in $(0,1)$.\n", "Furthermore, we can turn this uniform random number from $(0,1)$ into random numbers from $a$ to $b$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "random.seed(42)\n", "# the seed will initialize the random number generator\n", "# fixing the seed will fix the \"random\" number generated\n", "for i in range(5):\n", " r = random.random()\n", " print(r)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def rnum(a,b):\n", " return a + (b-a)*random.random()\n", "\n", "for i in range(5):\n", " print(rnum(-3,6))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "N = 300\n", "x = np.random.uniform(0,1,N) # this syntax is okay as well\n", "y = np.random.uniform(low=0,high=1,size=N)\n", "plt.scatter(x,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding scattered noise to a linear function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.linspace(0,1,100)\n", "Y = 3 * X + 1\n", "plt.plot(X,Y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's add some noise\n", "Z = 3 * X + 1 + np.random.normal(loc=0,scale=1, size= X.shape[0])\n", "# np.random.normal(0,1, X.shape[0]) same output \n", "# loc is mean\n", "# scale is standard dev\n", "# size is the number of samples we draw in this distribution\n", "# we'll see much more about randomness later\n", "plt.scatter(X,Z) # we use a scatter plot\n", "plt.plot(X,Y, color = \"red\", linewidth= 2.0)\n", "plt.grid(True, linestyle = 'dashed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1:\n", "Write a function `rand_linear`, takes input of the slope `m` and `b`, the strength of the normal random noise (mean 0 and standard deviation `sigma`), and a numpy array `x`, returns the function values of the linear function $y = mx + b$ with a random noise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normal distribution\n", "\n", "Best way to view a probability distribution? Histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 50 # no. of samples\n", "mu = 0.0\n", "sigma = 1.0\n", "X = np.random.normal(loc=mu, scale=sigma, size=N)\n", "plt.hist(X, bins=10, edgecolor='k')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "N = 500000 # no of samples\n", "mu = 0.0\n", "sigma = 1.0\n", "X = np.random.normal(loc=mu, scale=sigma, size=N)\n", "plt.axis([-6, 6, 0, 0.45]) # fix our axes view\n", "plt.hist(X, bins=20, density=True, edgecolor= 'k')\n", "# plt.hist()\n", "# bin size = (total sample)/(no. of bins)\n", "plt.grid(True, linestyle = 'dashed')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\sigma$ is the standard deviation, which measures how spread out the normal distribution is. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "N = 500000\n", "mu = 0.0\n", "sigma = 2.0 # highers standard dev\n", "X = np.random.normal(loc=mu, scale=sigma, size=N)\n", "plt.axis([-6, 6, 0, 0.45])\n", "plt.hist(X, bins=20, density=True, edgecolor ='k')\n", "plt.grid(True, linestyle = 'dashed')\n", "plt.show()\n", "\n", "# looks the same but look at the numbers above and below" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2:\n", "\n", "* Change the `sigma` and the `bins` (no. of bins), while fix the axis by using `plt.axis([-6, 6, 0, 0.45])` like the plots above, see what happens.\n", "* When plotting the histogram, toggle the option `density=True` to `density=False` (by default), see what happens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histogram of uniform distribution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "N = 50000\n", "X = np.random.uniform(low=0, high=1, size=N)\n", "plt.hist(X, 50)\n", "\n", "plt.grid(True, linestyle = 'dashed')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can compute the mean and standard deviation of any data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "np.mean(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "np.std(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, if `X` is our dataset, then the normal distribution with `mu = np.mean(X)`, and `sigma = np.std(X)` will fit the dataset's \"empircal distribution\" best.\n", "\n", "If a dataset's distribution is normal then **about 68 percent of the data values are within one standard deviation of the mean**:\n", "$$\n", "P(\\mu - \\sigma < X < \\mu+\\sigma) \\approx 68\\%\n", "$$\n", "\n", "

\n", "Reference: [68–95–99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }