{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Online mean and variance estimates\n", "\n", "The first technique we'll introduce isn't a probabilistic structure at all, but it will serve as a warm-up to introduce some of the more involved concepts we'll look at later. We'll look at [Chan's formula](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm) for online mean and variance estimates, so that we can calculate estimated mean and variance in a single pass over a large data set. As we'll see, this technique will also let us combine estimates for several data sets (i.e., for processing a partitioned collection in parallel)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class StreamMV(object):\n", " from sys import float_info\n", "\n", " def __init__(self, count=0, min=float_info.max,\n", " max=-float_info.max, m1=0.0, m2=0.0):\n", " (self.count, self.min, self.max) = (count, min, max)\n", " (self.m1, self.m2) = (m1, m2)\n", "\n", " def __lshift__(self, sample):\n", " (self.max, self.min) = (max(self.max, sample), min(self.min, sample))\n", " dev = sample - self.m1\n", " self.m1 = self.m1 + (dev / (self.count + 1))\n", " self.m2 = self.m2 + (dev * dev) * self.count / (self.count + 1)\n", " self.count += 1\n", " return self\n", "\n", " def mean(self): \n", " return self.m1\n", "\n", " def variance(self): \n", " return self.m2 / self.count\n", "\n", " def stddev(self): \n", " return math.sqrt(self.variance)\n", " \n", " def merge_from(self, other):\n", " if other.count == 0:\n", " return self\n", " if self.count == 0:\n", " (self.m1, self.m2) = (other.m1, other.m2)\n", " self.count = other.count\n", " (self.min, self.max) = (other.min, other.max)\n", " return self\n", " else:\n", " dev = other.m1 - self.m1\n", " new_count = other.count + self.count\n", " self.m1 = (self.count * self.m1 + other.count * other.m1) / new_count\n", " self.m2 = self.m2 + other.m2 + (dev * dev) * self.count * other.count / new_count\n", " self.count = new_count\n", " self.max = max(self.max, other.max)\n", " self.min = min(self.min, other.min)\n", " return self" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test this code by sampling from a random distribution with known mean and variance. (We're using the Poisson distribution with a $\\lambda$ parameter of 7, which should have a mean and variance of 7, but you could try with any other distribution if you wanted.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import poisson\n", "sink = StreamMV()\n", "\n", "for p in poisson.rvs(7, size=10000):\n", " sink << p\n", "\n", "print (sink.mean(), sink.variance())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that we can also parallelize this work:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import poisson\n", "s1, s2 = StreamMV(), StreamMV()\n", "\n", "for p in poisson.rvs(7, size=10000):\n", " s1 << p\n", "\n", "\n", "for p in poisson.rvs(7, size=10000):\n", " s2 << p\n", "\n", "print(\"s1 mean %f, variance %f, count %d\" % (s1.mean(), s1.variance(), s1.count))\n", "print(\"s2 mean %f, variance %f, count %d\" % (s2.mean(), s2.variance(), s2.count))\n", "\n", "s1.merge_from(s2)\n", "\n", "print(\"s1+s2 mean %f, variance %f, count %d\" % (s1.mean(), s1.variance(), s1.count))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean and variance estimate technique we've just shown has a few things in common with the other techniques we'll look at:\n", "\n", "1. It's *incremental*, meaning that it is possible to update an estimate with a single sample at a time (this also implies that it's *single-pass*, meaning that you only need to see each sample once).\n", "2. It's *parallel*, meaning that it is possible to combine estimates for subsets of the population of interest and get an estimate for their union, and\n", "3. It's *scalable*, meaning that it requires a constant amount of space no matter how many samples it processes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "jupyter" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }