{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 9: Sampling from a distribution\n", "\n", "Today we will continue to learn about about the randomness, which can come from either\n", "* Simulations that use randomness.\n", "* Real life data that have randomness.\n", "\n", "Simulations are samples obtained from an underlying probability distribution.\n", "\n", "Ansatz: real life data follow certain underlying (unknown) probability distribution.\n", "\n", "Two most common descriptive statistics for a dataset are: \n", "* Mean (average);\n", "* Standard deviation (how spread the data are). \n", "* Min and max\n", "\n", "If time allows, we will learn a bit about normalization.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# homework 2 problem 3 distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numpy.random module" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class exercise 1: Uniform distribution\n", "Use [`np.random.uniform`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.uniform.html) function to obtain 100 samples (`size=` parameter) from the uniform distribution between 0 and 1. Then use `pyplot`'s `hist` function to plot the histogram using various numbers of bins (let the `bins` parameter to be 2, 10, 100)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normal distribution\n", "\n", "There are two functions for the normal distribution: `np.random.normal()` and `np.random.randn()` in `numpy.random` submodule." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# numpy array with Gaussian (normal) random numbers\n", "np.random.normal(loc=0.0, scale=1.0, size=(2,3))\n", "# below are samples drawn from this distrubtion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# np.random.randn() returns r.v. ~ N(0,1)\n", "np.random.seed(42)\n", "np.random.randn(2,3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we want to use `random.randn()` in numpy to generate normal distribution with mean $\\mu$ and standard dev $\\sigma$?\n", "

\n", "*HINT* : $ (X -\\mu)/\\sigma \\sim \\mathcal{N}(0,1)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# answer\n", "X = 2*np.random.randn(3,3) + 5 # samples drawn from N(5,2)\n", "print(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 5000 # sample size\n", "mu = 0.0 # mean\n", "sigma = 1.0 # standard dev\n", "\n", "X = np.random.normal(loc=mu, scale=sigma, size=N)\n", "plt.axis([-6, 6, 0, N/100]) # set up the axis\n", "plt.hist(X, bins = 500) # bin size = (total sample)/(bins)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## random.choice: chooses a random element of an ndarray" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words = [\"love\", \"hate\", \"tender\", \"care\", \"deep\"]\n", "words = np.array(words)\n", "np.random.choice(words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "poem = np.empty([4, 3], dtype=object)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(4): # rows\n", " for j in range(3): # columns\n", " poem[i,j] = random.choice(words) \n", "\n", "print(poem)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# if we want to format better\n", "for row in poem:\n", " print()\n", " for words in row:\n", " print(''.join(str(words)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mean and Standard Deviation of a Data-set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say I am given a numpy array `X` full of numbers (e.g. grades).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.array([67, 62, 78, 67, 64, 52, 50, 80, 50, 94, \n", " 77, 62, 78, 67, 44, 52, 70, 80, 50, 94, \n", " 100, 61, 59, 56, 30, 91, 60, 54, 34, 98])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.min(X), np.max(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(X, bins=8,edgecolor='black')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** If we knew that these numbers came from a normal distribution $N(\\mu, \\sigma)$, what is the most likely normal distribution that this data would have come from?\n", "\n", "\n", "**Answer:** It's the normal distribution $N(\\mu, \\sigma)$ where $\\mu$ is the mean of the data, and $\\sigma$ is the standard deviation of the data. Mean is the average of the data, standard deviation is the square root of the average of the square of the distance to the mean of the data... better to write down the formula:\n", "\n", "$$\\mu = \\frac{1}{N}\\sum_{i=1}^N x_i$$\n", "\n", "$$\\sigma = \\sqrt{\\frac{1}{N}\\sum_{i=1}^N (x_i - \\mu)^2}$$\n", "\n", "In python, we can use `np.mean` and `np.std` to compute these." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mean of a dataset vs Expected value of a random variable\n", "If $X$ is a random variable:\n", "$$\\operatorname {E} [X]=\\sum _{i=1}^{\\infty }x_{i}\\,p_{i},$$\n", "or\n", "$$\\operatorname {E} [X] = \\int _{-\\infty }^{+\\infty }x p(x)\\,dx.$$\n", "\n", "When the *mean* is discussed, we mean the *sample mean* (funny word play there). We compute the sample mean on a given set of samples, that is a set of **outcomes** of a probability distribution. This mean may yield different properties with regards to the estimation of the \"actual average\" of the underlying probability distribution. \n", "

\n", "For instance you may consider how the mathematical definition of the sample mean behaves when passing to the limit (taking the sample size to infinity), etc. However the expected value above is functionally associated to distribution with a given parameter, a distribution that can further generate samples with different sample means.\n", "

\n", "If $\\{x_i\\}$ are samples drawn from the distribution for randome variable $X$, in general\n", "$$\n", "\\mu = \\frac{\\sum_{k=1}^N x_i}{N}\\neq \\operatorname {E} [X] .\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "mu, sigma = np.mean(X), np.std(X)\n", "print(\"mean of the samples is: \", mu)\n", "print(\"standard deviation of the samples is: \", sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the Gaussian distribution $N(\\mu, \\sigma)$ has density function:\n", "\n", "$$ p(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp\\left({-\\frac{(x-\\mu)^2}{2\\sigma^2}}\\right)$$\n", "\n", "We will now draw the bell curve:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from math import sqrt, pi, e" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 300 # grid size, not sample size\n", "Z = np.linspace(0,100,N)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gaussian_func = lambda mu, sigma: (lambda x: 1/(sqrt(2*pi)*sigma) * e**(-0.5*(x - mu)*(x-mu)/(sigma**2)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(Z, guassian_func(mu, sigma)(Z)) # blue curve\n", "plt.hist(X, bins= 8, density=True, edgecolor='black') \n", "# orange histogram, \"density = True\" means we plot the density histogram instead of the absolute numbers\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Conclusion:** For any data-set, the mean and standard deviation of the best-fitting Gaussian distribution are the mean and standard deviation of the data-set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing real life data (optional)\n", "\n", "In this week's lab practice, we have seen how to import real life data on [UCI machine learning dataset repository](https://www.kaggle.com/uciml). Today we will try `numpy`'s built-in loading.\n", "\n", "* Download `winequality-red.csv` from [https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009/](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) or from Canvas and put it in the same directory with this notebook.\n", "\n", "* Check the csv file using Excel on the lab computer. Import the data using the following command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine_data = np.loadtxt('winequality-red.csv', delimiter=',', skiprows=1) # Kaggle version." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class exercise 2:\n", "Repeat above procedure for each column of wine_data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plt.hist first, then creat a linspace with max and min of each column using certain number of points\n", "print(\"max is: \", max(wine_data[:,0]))\n", "print(\"min is: \",min(wine_data[:,0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = np.mean(wine_data[:,0])\n", "sigma = np.std(wine_data[:,0])\n", "print(\"mean is:\", mu)\n", "print(\"stanard dev is:\", sigma)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z = np.linspace(4.6, 15.9, 300)\n", "plt.plot(z,gaussian_func(mu,sigma)(z))\n", "plt.hist(wine_data[:,0], bins= 100, density=True, edgecolor='black')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }