{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# HyperLogLog\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HyperLogLog is trickier to understand than Bloom filters or count-min sketch, so let's start with some intuitions.\n", "\n", "If we have a source from which we can sample uniformly-distributed _n_-bit integers, we can also see it as a source for drawing _n_ coin flips -- each bit in an integer sampled from the population of uniformly-distributed _n_-bit integers is independent of the others and is equally likely to be true or false.\n", "\n", "Because each bit is independent and equally likely to be true or false, runs of consecutive bits with the same value become increasingly unlikely with length. The probability of seeing _n_ consecutive zeros, for example, is $1$ in $2^n$. Similarly, if the largest number of leading zeros we've seen in a stream of random numbers is _n_, we can estimate that we've seen $2^n$ numbers.\n", "\n", "To see this in action, let's sample some random numbers and plot the distribution of leading-zero counts. We'll start by importing the libraries we need (`numpy` and plotting support):\n", "\n", "\n", "\n", "We'll start with a function to count leading zeros:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def leading_zeros(bs):\n", " \"\"\" Return the index of the leftmost one in an \n", " integer represented as an array of bytes \"\"\"\n", " first = 0\n", " for b in bs:\n", " if b == 0:\n", " first += 8\n", " else:\n", " for bit in range(7, -1, -1):\n", " if ((1 << bit) & b) > 0:\n", " return first\n", " else:\n", " first += 1\n", " return first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll then generate some 32-bit random integers and plot the distribution of leading-zero counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def lz_experiment(ct):\n", " from numpy.random import randint as ri\n", " result = []\n", " for _ in range(ct):\n", " result.append(leading_zeros(bytes([ri(255), ri(255), ri(255), ri(255)])))\n", "\n", " return result\n", "\n", "lz = lz_experiment(4096)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import altair as alt\n", "alt.renderers.enable(\"notebook\")\n", "\n", "from statsmodels.distributions.empirical_distribution import ECDF\n", "import pandas as pd\n", "\n", "df = pd.DataFrame()\n", "\n", "ecdf = ECDF(lz)\n", "df[\"leading zeroes\"] = lz\n", "df[\"percentage of numbers with at most\"] = ecdf(lz)\n", "\n", "alt.Chart(df.drop_duplicates()).mark_line().encode(x=\"leading zeroes\", y=\"percentage of numbers with at most\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see from inspecting the cumulative distribution plot, about 50% of the samples have no leading zeros, about 75% have one or fewer leading zeros, about 87.5% of samples have two or fewer leading zeros, and so on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hashlib import sha1\n", "import pickle\n", "\n", "def h64(v):\n", " bvalue = type(v) == bytes and v or pickle.dumps(v)\n", " return int.from_bytes(sha1(bvalue).digest()[:8], 'little')\n", "\n", "def get_alpha(p):\n", " return {\n", " 4: 0.673,\n", " 5: 0.697,\n", " 6: 0.709,\n", " }.get(p, 0.7213 / (1.0 + 1.079 / (1 << p)))\n", "\n", "def first_set_bit(i, isize):\n", " return isize - i.bit_length() + 1\n", "\n", "class HLL(object):\n", " def __init__(self, p=4):\n", " self.p = min(max(p, 4), 12)\n", " self.m = int(2 ** self.p)\n", " self.alpha = get_alpha(self.p)\n", " self._registers = np.zeros(self.m, np.uint8)\n", " self._zeros = self.m\n", " \n", " def add(self, v):\n", " h = h64(v)\n", " idx = h & (self.m - 1)\n", " h >>= self.p\n", " fsb = first_set_bit(h, 64 - self.p)\n", " if self._zeros > 0 and self._registers[idx] == 0 and fsb > 0:\n", " self._zeros -= 1\n", " self._registers[idx] = max(self._registers[idx], fsb)\n", " \n", " def approx_count(self):\n", " from math import log\n", " from scipy.stats import hmean\n", " \n", " if self._zeros > 0:\n", " # if we have empty registers (and thus probably a small set),\n", " # use a different approximation that will be more precise\n", " return self.m * math.log(float(self.m) / self._zeros)\n", " else:\n", " # return the harmonic mean of 2 to the power of every register, \n", " # scaled by the number of registers\n", " return self.alpha * self.m * hmean(np.power(2.0, self._registers))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hll = HLL()\n", "\n", "import random\n", "\n", "for i in range(20000):\n", " hll.add(random.getrandbits(64).to_bytes(8, \"big\"))\n", "\n", "hll.approx_count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like Bloom filters and count-min sketches, HyperLogLog estimates can also be added together so that you can summarize large data sets in parallel. To combine two HyperLogLog estimates with the same number of registers, simply take the maximum of each pair of registers with the same index. (As an easy exercise, implement this above and convince yourself that it works the same as using a single estimate for a large stream.)\n", "\n", "If you're interested in learning more about HyperLogLog, a great place to start is [\"HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm\"](https://research.google.com/pubs/pub40671.html). As an exercise, try implementing some of their techniques to improve the performance of the code above!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }