{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Online mean and variance estimates\n",
    "\n",
    "The first technique we'll introduce isn't a probabilistic structure at all, but it will serve as a warm-up to introduce some of the more involved concepts we'll look at later.  We'll look at [Chan's formula](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm) for online mean and variance estimates, so that we can calculate estimated mean and variance in a single pass over a large data set.  As we'll see, this technique will also let us combine estimates for several data sets (i.e., for processing a partitioned collection in parallel)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class StreamMV(object):\n",
    "    from sys import float_info\n",
    "\n",
    "    def __init__(self, count=0, min=float_info.max,\n",
    "                 max=-float_info.max, m1=0.0, m2=0.0):\n",
    "        (self.count, self.min, self.max) = (count, min, max)\n",
    "        (self.m1, self.m2) = (m1, m2)\n",
    "\n",
    "    def __lshift__(self, sample):\n",
    "        (self.max, self.min) = (max(self.max, sample), min(self.min, sample))\n",
    "        dev = sample - self.m1\n",
    "        self.m1 = self.m1 + (dev / (self.count + 1))\n",
    "        self.m2 = self.m2 + (dev * dev) * self.count / (self.count + 1)\n",
    "        self.count += 1\n",
    "        return self\n",
    "\n",
    "    def mean(self): \n",
    "        return self.m1\n",
    "\n",
    "    def variance(self): \n",
    "        return self.m2 / self.count\n",
    "\n",
    "    def stddev(self): \n",
    "        return math.sqrt(self.variance)\n",
    "    \n",
    "    def merge_from(self, other):\n",
    "        if other.count == 0:\n",
    "            return self\n",
    "        if self.count == 0:\n",
    "            (self.m1, self.m2) = (other.m1, other.m2)\n",
    "            self.count = other.count\n",
    "            (self.min, self.max) = (other.min, other.max)\n",
    "            return self\n",
    "        else:\n",
    "            dev = other.m1 - self.m1\n",
    "            new_count = other.count + self.count\n",
    "            self.m1 = (self.count * self.m1 + other.count * other.m1) / new_count\n",
    "            self.m2 = self.m2 + other.m2 + (dev * dev) * self.count * other.count / new_count\n",
    "            self.count = new_count\n",
    "            self.max = max(self.max, other.max)\n",
    "            self.min = min(self.min, other.min)\n",
    "            return self"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test this code by sampling from a random distribution with known mean and variance.  (We're using the Poisson distribution with a $\\lambda$ parameter of 7, which should have a mean and variance of 7, but you could try with any other distribution if you wanted.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import poisson\n",
    "sink = StreamMV()\n",
    "\n",
    "for p in poisson.rvs(7, size=10000):\n",
    "    sink << p\n",
    "\n",
    "print (sink.mean(), sink.variance())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that we can also parallelize this work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import poisson\n",
    "s1, s2 = StreamMV(), StreamMV()\n",
    "\n",
    "for p in poisson.rvs(7, size=10000):\n",
    "    s1 << p\n",
    "\n",
    "\n",
    "for p in poisson.rvs(7, size=10000):\n",
    "    s2 << p\n",
    "\n",
    "print(\"s1 mean %f, variance %f, count %d\" % (s1.mean(), s1.variance(), s1.count))\n",
    "print(\"s2 mean %f, variance %f, count %d\" % (s2.mean(), s2.variance(), s2.count))\n",
    "\n",
    "s1.merge_from(s2)\n",
    "\n",
    "print(\"s1+s2 mean %f, variance %f, count %d\" % (s1.mean(), s1.variance(), s1.count))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mean and variance estimate technique we've just shown has a few things in common with the other techniques we'll look at:\n",
    "\n",
    "1.  It's *incremental*, meaning that it is possible to update an estimate with a single sample at a time (this also implies that it's *single-pass*, meaning that you only need to see each sample once).\n",
    "2.  It's *parallel*, meaning that it is possible to combine estimates for subsets of the population of interest and get an estimate for their union, and\n",
    "3.  It's *scalable*, meaning that it requires a constant amount of space no matter how many samples it processes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "jupyter"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}