{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Measures of Dispersion\n",
    "By Evgenia \"Jenny\" Nitishinskaya, Maxwell Margenot, and Delaney Mackenzie.\n",
    "\n",
    "Part of the Quantopian Lecture Series:\n",
    "\n",
    "* [www.quantopian.com/lectures](https://www.quantopian.com/lectures)\n",
    "* [github.com/quantopian/research_public](https://github.com/quantopian/research_public)\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "Dispersion measures how spread out a set of data is. This is especially important in finance because one of the main ways risk is measured is in how spread out returns have been historically. If returns have been very tight around a central value, then we have less reason to worry. If returns have been all over the place, that is risky.\n",
    "\n",
    "Data with low dispersion is heavily clustered around the mean, while data with high dispersion indicates many very large and very small values.\n",
    "\n",
    "Let's generate an array of random integers to work with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(121)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X: [ 3  8 34 39 46 52 52 52 54 57 60 65 66 75 83 85 88 94 95 96]\n",
      "Mean of X: 60.2\n"
     ]
    }
   ],
   "source": [
    "# Generate 20 random integers < 100\n",
    "X = np.random.randint(100, size=20)\n",
    "\n",
    "# Sort them\n",
    "X = np.sort(X)\n",
    "print 'X: %s' %(X)\n",
    "\n",
    "mu = np.mean(X)\n",
    "print 'Mean of X:', mu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Range\n",
    "\n",
    "Range is simply the difference between the maximum and minimum values in a dataset. Not surprisingly, it is very sensitive to outliers. We'll use `numpy`'s peak to peak (ptp) function for this.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Range of X: 93\n"
     ]
    }
   ],
   "source": [
    "print 'Range of X: %s' %(np.ptp(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mean Absolute Deviation (MAD)\n",
    "\n",
    "The mean absolute deviation is the average of the distances of observations from the arithmetic mean. We use the absolute value of the deviation, so that 5 above the mean and 5 below the mean both contribute 5, because otherwise the deviations always sum to 0.\n",
    "\n",
    "$$ MAD = \\frac{\\sum_{i=1}^n |X_i - \\mu|}{n} $$\n",
    "\n",
    "where $n$ is the number of observations and $\\mu$ is their mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean absolute deviation of X: 20.52\n"
     ]
    }
   ],
   "source": [
    "abs_dispersion = [np.abs(mu - x) for x in X]\n",
    "MAD = np.sum(abs_dispersion)/len(abs_dispersion)\n",
    "print 'Mean absolute deviation of X:', MAD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variance and standard deviation\n",
    "\n",
    "The variance $\\sigma^2$ is defined as the average of the squared deviations around the mean:\n",
    "$$ \\sigma^2 = \\frac{\\sum_{i=1}^n (X_i - \\mu)^2}{n} $$\n",
    "\n",
    "This is sometimes more convenient than the mean absolute deviation because absolute value is not differentiable, while squaring is smooth, and some optimization algorithms rely on differentiability.\n",
    "\n",
    "Standard deviation is defined as the square root of the variance, $\\sigma$, and it is the easier of the two to interpret because it is in the same units as the observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Variance of X: 670.16\n",
      "Standard deviation of X: 25.8874486962\n"
     ]
    }
   ],
   "source": [
    "print 'Variance of X:', np.var(X)\n",
    "print 'Standard deviation of X:', np.std(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One way to interpret standard deviation is by referring to Chebyshev's inequality. This tells us that the proportion of samples within $k$ standard deviations (that is, within a distance of $k \\cdot$ standard deviation) of the mean is at least $1 - 1/k^2$ for all $k>1$.\n",
    "\n",
    "Let's check that this is true for our data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Observations within 1.25 stds of mean: [34, 39, 46, 52, 52, 52, 54, 57, 60, 65, 66, 75, 83, 85, 88]\n",
      "Confirming that 0.75 > 0.36\n"
     ]
    }
   ],
   "source": [
    "k = 1.25\n",
    "dist = k*np.std(X)\n",
    "l = [x for x in X if abs(x - mu) <= dist]\n",
    "print 'Observations within', k, 'stds of mean:', l\n",
    "print 'Confirming that', float(len(l))/len(X), '>', 1 - 1/k**2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The bound given by Chebyshev's inequality seems fairly loose in this case. This bound is rarely strict, but it is useful because it holds for all data sets and distributions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semivariance and semideviation\n",
    "\n",
    "Although variance and standard deviation tell us how volatile a quantity is, they do not differentiate between deviations upward and deviations downward. Often, such as in the case of returns on an asset, we are more worried about deviations downward. This is addressed by semivariance and semideviation, which only count the observations that fall below the mean. Semivariance is defined as\n",
    "$$ \\frac{\\sum_{X_i < \\mu} (X_i - \\mu)^2}{n_<} $$\n",
    "where $n_<$ is the number of observations which are smaller than the mean. Semideviation is the square root of the semivariance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Semivariance of X: 689.512727273\n",
      "Semideviation of X: 26.2585743572\n"
     ]
    }
   ],
   "source": [
    "# Because there is no built-in semideviation, we'll compute it ourselves\n",
    "lows = [e for e in X if e <= mu]\n",
    "\n",
    "semivar = np.sum( (lows - mu) ** 2 ) / len(lows)\n",
    "\n",
    "print 'Semivariance of X:', semivar\n",
    "print 'Semideviation of X:', np.sqrt(semivar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A related notion is target semivariance (and target semideviation), where we average the distance from a target of values which fall below that target:\n",
    "$$ \\frac{\\sum_{X_i < B} (X_i - B)^2}{n_{<B}} $$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Target semivariance of X: 188\n",
      "Target semideviation of X: 13.7113092008\n"
     ]
    }
   ],
   "source": [
    "B = 19\n",
    "lows_B = [e for e in X if e <= B]\n",
    "semivar_B = sum(map(lambda x: (x - B)**2,lows_B))/len(lows_B)\n",
    "\n",
    "print 'Target semivariance of X:', semivar_B\n",
    "print 'Target semideviation of X:', np.sqrt(semivar_B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# These are Only Estimates\n",
    "\n",
    "All of these computations will give you sample statistics, that is standard deviation of a sample of data. Whether or not this reflects the current true population standard deviation is not always obvious, and more effort has to be put into determining that. This is especially problematic in finance because all data are time series and the mean and variance may change over time. There are many different techniques and subtleties here, some of which are addressed in other lectures in the [Quantopian Lecture Series](https://www.quantopian.com/lectures).\n",
    "\n",
    "In general do not assume that because something is true of your sample, it will remain true going forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "* \"Quantitative Investment Analysis\", by DeFusco, McLeavey, Pinto, and Runkle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. (\"Quantopian\"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company.  In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}