{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Definition of Statistical Measures – Central Tendency and Spread\n", "### Dr. Tirthajyoti Sarkar, Fremont, CA 94536\n", "---\n", "This notebook discusses fundamentals concepts of descriptive statistics such as central tendency and dispersion (spread) measures - mean/median/mode and variance. \n", "\n", "We show how one can compute such descriptive statistics using basic Python code (without using any library) as well as using `NumPy` functions.\n", "\n", "### Central tendency\n", "A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:\n", "\n", "* **Mean**: Mean is the sum of all values divided by the total number of values.\n", "\n", "$$ \\mu = \\frac{\\sum{n_i}}{N} \\\\ \\text{where } N = \\sum{i} \\text{ : total number of observations}$$\n", "\n", "* **Median**: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it.\n", "* **Mode**: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.\n", "\n", "Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode.\n", "\n", "### Spread\n", "The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large\n", "amount from the mean (and each other), then there is a large spread in the data.\n", "\n", "* **Variance**: This is the most common measure of spread. Variance is the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out.\n", "\n", "$$V = \\frac{\\sum{(n_i-\\mu)^2}}{N}$$\n", "\n", "* **Standard Deviation**: Because variance is produced by squaring the distance from the mean, its unit does not match that of the original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance.\n", "\n", "$$\\sigma = \\sqrt{\\frac{\\sum{(n_i-\\mu)^2}}{N}}$$\n", "\n", "> **NOTE**: When we later build regression models, we will revisit these definitions in the conext of statistical estimation. There, the sample variance will be given by a slightly different formula (the denominator will change),\n", "\n", "$$V = \\frac{\\sum{(n_i-\\mu)^2}}{N-2}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's measure central tendency of an array of numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Somewhat naive way to do it\n", "We can simply write a 'for' loop, add the numbers, and divide by the length of the array" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "array = [3,4,4,7,5,6,5.5,8,5,6.5,9,7.5,6]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Arithmetic Mean: 5.884615384615385\n" ] } ], "source": [ "sum = 0\n", "for num in array:\n", " sum+=num\n", "mean = sum/len(array)\n", "print(\"Arithmetic Mean: \",mean)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from time import time" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 5.884615384615385\n", "Average time taken for computing the mean using for loop: 1.1469221115112304e-06 seconds \n" ] } ], "source": [ "t1 = time()\n", "for _ in range(100000):\n", " sum = 0\n", " for num in array:\n", " sum+=num\n", " mean = sum/len(array)\n", "t2 = time()\n", "\n", "print(\"Mean: {}\\nAverage time taken for computing the mean using for loop: {} seconds \".format(mean,(t2-t1)/100000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using NumPy with `ndarray.mean()` method\n", "\n", "**What is Numpy**? - NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.\n", "\n", "https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 5.884615384615385\n" ] } ], "source": [ "import numpy as np\n", "np_array = np.array(array)\n", "print(\"Mean: \",np_array.mean())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 5.884615384615385\n", "Average time taken for computing the mean using NumPy: 3.880062103271484e-06 seconds \n" ] } ], "source": [ "t1 = time()\n", "np_array = np.array(array)\n", "for _ in range(100000):\n", " mean = np_array.mean()\n", "t2 = time()\n", "\n", "print(\"Mean: {}\\nAverage time taken for computing the mean using NumPy: {} seconds \".format(mean,(t2-t1)/100000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### So, the `NumPy` method does not offer significant boost in performance. But what happens when the array is large?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from random import randint\n", "lst = []\n", "for _ in range(1000000):\n", " lst.append(randint(1,100))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(lst)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 50.539717\n", "Average time taken for computing the mean using for loop: 0.06872276782989502 seconds \n" ] } ], "source": [ "t1 = time()\n", "for _ in range(100):\n", " sum = 0\n", " for num in lst:\n", " sum+=num\n", " mean = sum/len(lst)\n", "t2 = time()\n", "\n", "print(\"Mean: {}\\nAverage time taken for computing the mean using for loop: {} seconds \".format(mean,(t2-t1)/100))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 50.539717\n", "Average time taken for computing the mean using NumPy: 0.001326603889465332 seconds \n" ] } ], "source": [ "t1 = time()\n", "np_lst = np.array(lst)\n", "for _ in range(100):\n", " mean = np_lst.mean()\n", "t2 = time()\n", "\n", "print(\"Mean: {}\\nAverage time taken for computing the mean using NumPy: {} seconds \".format(mean,(t2-t1)/100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An utility function to generate random arrays" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def random_array(num_elements,lower=1,upper=100):\n", " \"\"\"\n", " \"\"\"\n", " from random import randint\n", " lst = []\n", " for _ in range(num_elements):\n", " lst.append(randint(lower,upper))\n", " return lst" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[99, 68, 65, 99, 87]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_array(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-6, -19, -17, 9, 15, -9, -4, 15, 14, -5]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_array(10,-20,20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing median using both naive method and Numpy function `np.median()`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "array_2 = random_array(15,10,30)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[25, 13, 11, 19, 13, 24, 27, 18, 30, 14, 30, 20, 18, 19, 23]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array_2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Using the built-in Python 'sorted' method \n", "array_sorted = sorted(array_2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[11, 13, 13, 14, 18, 18, 19, 19, 20, 23, 24, 25, 27, 30, 30]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array_sorted" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def median (array):\n", " \"\"\"\n", " Computes median of a given numeric array\n", " \"\"\"\n", " num_elements = len(array)\n", " array_sorted = sorted(array)\n", " if num_elements%2==1:\n", " median = array_sorted[int(((num_elements+1)/2)-1)]\n", " else:\n", " median = (array_sorted[int(((num_elements+1)/2)-1)]+array_sorted[int(((num_elements+1)/2))])/2.0\n", " return median" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median(array_2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.median(np.array(array_2))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "array_3 = random_array(16,100,200)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[183, 199, 136, 101, 196, 107, 153, 173, 122, 157, 117, 118, 125, 161, 171, 169]\n" ] } ], "source": [ "print(array_3)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "155.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median(array_3)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "155.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.median(np.array(array_3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: Unlike `mean()`, an Numpy array does not have `median()` method. We have to use `np.median()` and pass on the array as the argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variance and standard deviation\n", "* `ndarray.var()`\n", "* `ndarray.std()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We will still practice one naive way to compute variance" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def mean(array):\n", " \"\"\"\n", " Computes mean\n", " \"\"\"\n", " length = len(array)\n", " sum = 0\n", " for i in range(length):\n", " sum+=array[i]\n", " mean = sum/length\n", " return mean" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def variance(array):\n", " \"\"\"\n", " Computes variance\n", " \"\"\"\n", " length = len(array)\n", " avg = mean(array)\n", " sumsq = 0\n", " for i in range(length):\n", " sumsq+=(array[i]-avg)**2\n", " variance = sumsq/length\n", " return variance" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def std_dev(array):\n", " \"\"\"\n", " Computes std. deviation\n", " \"\"\"\n", " from math import sqrt\n", " return (sqrt(variance(array)))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "array_4 = random_array(100,1,100)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[12, 14, 9, 89, 31, 38, 60, 45, 18, 48, 61, 21, 80, 47, 91, 83, 57, 92, 85, 60, 43, 61, 76, 71, 100, 18, 35, 77, 27, 18, 95, 15, 71, 50, 92, 78, 64, 58, 49, 5, 9, 55, 19, 20, 36, 27, 62, 81, 42, 64, 95, 89, 40, 66, 75, 44, 54, 57, 41, 39, 34, 87, 33, 64, 61, 84, 51, 6, 1, 69, 5, 14, 54, 42, 94, 24, 34, 78, 56, 98, 35, 40, 11, 90, 7, 16, 7, 60, 32, 16, 64, 68, 85, 48, 91, 38, 34, 9, 95, 1]\n" ] } ], "source": [ "print(array_4)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "790.0474999999996" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "variance(array_4)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28.10778361948874" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std_dev(array_4)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "790.0474999999999" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.var(np.array(array_4))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28.107783619488746" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.std(np.array(array_4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What if there are `NaN` values in the array\n", "* `nanmean()`\n", "* `nanmedian()`\n", "* `nanstd()`\n", "* `nanvar()`" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "array = random_array(20,1,50)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[33, 1, 7, 34, 47, 34, 2, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]\n" ] } ], "source": [ "print(array)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "array[2]=np.nan\n", "array[6]=np.nan" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[33, 1, nan, 34, 47, 34, nan, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]\n" ] } ], "source": [ "print(array)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "array = np.array(array)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: nan\n", "Var: nan\n" ] } ], "source": [ "print(\"Mean:\",array.mean())\n", "print(\"Var:\",array.var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using special functions which ignore `NaN`. \n", "Notice they are methods of the base Numpy (`np`) class, and not of an individual array" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean ignoring NaN: 26.333333333333332\n", "Var ignoring NaN: 271.44444444444446\n", "Std. dev ignoring NaN: 16.47557114167653\n", "Median ignoring NaN: 26.5\n" ] } ], "source": [ "print(\"Mean ignoring NaN:\",np.nanmean(array))\n", "print(\"Var ignoring NaN:\",np.nanvar(array))\n", "print(\"Std. dev ignoring NaN:\",np.nanstd(array))\n", "print(\"Median ignoring NaN:\",np.nanmedian(array))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other descriptive statistics measures\n", "* Min and max\n", "* Range\n", "* Quantile\n", "* Percentile\n", "\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[29 59 16 95 64 3 40 7 61 4 99 32 6 38 59 26 84 53 51 69]\n" ] } ], "source": [ "array = random_array(20,1,100)\n", "array = np.array(array)\n", "sorted_array = sorted(array)\n", "print(array)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max of the array: 99\n", "Max of the array: 99\n" ] } ], "source": [ "# Using np.amax()\n", "print(\"Max of the array:\",np.amax(array))\n", "# Using array.max()\n", "print(\"Max of the array:\",array.max())" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min of the array: 3\n", "Min of the array: 3\n" ] } ], "source": [ "# Using np.amin()\n", "print(\"Min of the array:\",np.amin(array))\n", "# Using array.max()\n", "print(\"Min of the array:\",array.min())" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Range of the array: 96\n", "Range of the array: 96\n" ] } ], "source": [ "# Compute range by using max() and min() functions\n", "print(\"Range of the array: \", array.max()-array.min())\n", "# Compute range by using ptp() function\n", "print(\"Range of the array: \", np.ptp(array))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20th percentile of the array: 14.200000000000003\n" ] } ], "source": [ "# Percentile\n", "print(\"20th percentile of the array: \", np.percentile(array,20))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.25-th quantile of the array: 23.5\n", "0.5-th quantile of the array: 45.5\n", "0.75-th quantile of the array: 61.75\n" ] } ], "source": [ "# Quantile\n", "print(\"0.25-th quantile of the array: \", np.quantile(array,0.25))\n", "print(\"0.5-th quantile of the array: \", np.quantile(array,0.5))\n", "print(\"0.75-th quantile of the array: \", np.quantile(array,0.75))" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 4, 6, 7, 16, 'HERE', 29, 32, 38, 40, 'HERE', 53, 59, 59, 61, 'HERE', 69, 84, 95, 99]\n" ] } ], "source": [ "sorted_array[5]='HERE'\n", "sorted_array[10]='HERE'\n", "sorted_array[15]='HERE'\n", "print(sorted_array)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 4 }