{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bloom filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A conventional hash table (or hash table-backed set structure) consists of a series of _buckets_. Hash table insert looks like this:\n", "\n", "1. First, use the hash value of the key to identify the index of the bucket that should contain it. \n", "2. If the bucket is empty, update the bucket to contain the key and value (with a trivial value in the case of a hashed set). \n", "3. If the bucket is not empty and the key stored in it is not the one you've hashed, handle this _hash collision_. There are several strategies to handle hash collisions precisely; most involve extra lookups (e.g., having a second hash function or going to the next available bucket) or extra space (e.g., having a linked list of keys and values in each bucket).\n", "\n", "Hash table lookup proceeds similarly: \n", "\n", "1. Looking up the index of the bucket that should contain a key (as above).\n", "2. Check to see if that bucket contains the key. \n", " - If the bucket contains the key, return the value (or \"true\" in the case of a hash-backed set).\n", " - If the bucket contains nothing, then the key is not in the table. \n", " - If the bucket contains something else, follow the strategy for resolving collisions until finding a bucket that contains the key or exhausting all possible buckets.\n", " \n", "### Exercise: implement an array-backed hash set \n", "\n", "As a warm-up exercise, let's implement a precise array-backed hash set. We'll store lists in each hash bucket and handle collisions by appending to a list. Since we're storing a set, we'll just store keys (not keys and values). You'll implement the `add`, `intersection`, and `union` methods on the `HashSet` class. \n", "\n", "Here are some hints to keep in mind:\n", "\n", "- the `index_for` method tells you which bucket to put an item in\n", "- if you have a reference to a `HashSet`, you can iterate over all of its items with a `for` loop, like `for it in hs:`\n", "- if you're new to Python, remember that you can type `help(obj)` in a cell to get documentation for `obj`, no matter what `obj` is!\n", "\n", "If you get stuck, check the [solution](solutions/hashset.ipynb)!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class HashSet(object):\n", " def __init__(self, sz=256):\n", " \n", " # initialize elements to empty lists\n", " self.items = [[] for _ in range(sz)]\n", " self.size = sz\n", " \n", " \n", " def __len__(self):\n", " \"\"\" Returns the number of elements in this set \"\"\"\n", " return sum([len(it) for it in self.items])\n", " \n", " \n", " def __iter__(self):\n", " import itertools\n", " return itertools.chain(*self.items)\n", " \n", " \n", " def index_for(self, item):\n", " \"\"\" Returns the index of the hash bucket for _item_ \"\"\"\n", " return hash(item) % self.size\n", " \n", " \n", " def contains(self, item):\n", " \"\"\" Returns True if this set contains _item_ and False otherwise \"\"\"\n", " for i in self.items[self.index_for(item)]:\n", " if i == item:\n", " return True\n", " \n", " return False\n", " \n", " \n", " def add(self, item):\n", " \"\"\" If _item_ is not already in the set, add it to the appropriate\n", " bucket. If _item_ is already in the set, do nothing. \"\"\"\n", " \n", " # FIXME: implement this!\n", " \n", " pass\n", " \n", " \n", " def add_all(self, items):\n", " for item in items:\n", " self.add(item)\n", "\n", " \n", " def intersection(self, other):\n", " \"\"\" Returns a new set containing all the items that are members of \n", " both this set and _other_ \"\"\"\n", " result = HashSet()\n", " \n", " # FIXME: implement this! \n", " \n", " return result\n", " \n", " \n", " def union(self, other):\n", " \"\"\" Returns a new set containing all the items that are members of \n", " either this set or _other_ \"\"\"\n", " result = HashSet()\n", " \n", " # FIXME: implement this! \n", " \n", " return result\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you've finished implementing a precise hash set structure, you can run some ad-hoc tests to make sure it behaves the way you'd expect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test1 = HashSet()\n", "for item in [\"a\", \"b\", \"c\", \"d\", \"e\", \"a\", \"b\", \"f\"]:\n", " pre_insert = test1.contains(item)\n", " test1.add(item)\n", " post_insert = test1.contains(item)\n", " print(item, len(test1), sorted(test1), pre_insert, post_insert)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We expect the previous cell to print out\n", "```\n", "a 1 ['a'] False True\n", "b 2 ['a', 'b'] False True\n", "c 3 ['a', 'b', 'c'] False True\n", "d 4 ['a', 'b', 'c', 'd'] False True\n", "e 5 ['a', 'b', 'c', 'd', 'e'] False True\n", "a 5 ['a', 'b', 'c', 'd', 'e'] True True\n", "b 5 ['a', 'b', 'c', 'd', 'e'] True True\n", "f 6 ['a', 'b', 'c', 'd', 'e', 'f'] False True```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we're confident that adding an element to a set works, we can also run some tests to ensure that intersection and union work the way we'd expect. For these tests, we'll check to make sure that our set works the same way as Python's built-in `set` type for given inputs." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from itertools import combinations\n", "\n", "failures = 0\n", "\n", "for t in combinations(combinations(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], 4), 2):\n", " left = HashSet()\n", " right = HashSet()\n", " \n", " left.add_all(t[0])\n", " right.add_all(t[1])\n", "\n", " lr = (repr(sorted(left)), repr(sorted(right)))\n", "\n", " if sorted(left.union(right)) != sorted(right.union(left)):\n", " failures += 1\n", " print(\"uh oh, union isn't commutative for %s and %s\" % lr)\n", "\n", " if sorted(left.intersection(right)) != sorted(right.intersection(left)):\n", " failures += 1\n", " print(\"uh oh, intersection isn't commutative for %s and %s\" % lr)\n", "\n", " if sorted(left.union(right)) != sorted(set(t[0]).union(set(t[1]))):\n", " failures += 1\n", " print(\"union wasn't what we expected for %s and %s\" % lr)\n", "\n", " if sorted(left.intersection(right)) != sorted(set(t[0]).intersection(set(t[1]))):\n", " failures += 1\n", " print(\"intersection wasn't what we expected for %s and %s\" % lr)\n", " \n", "print(\"finished tests with %d failures\" % failures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our last test checks that we handle hash collisions appropriately." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hs = HashSet()\n", "\n", "for i in range(1024):\n", " if len(hs) != i:\n", " print(\"len(hs) was %d; expected %d\" % (len(hs), i))\n", " hs.add(i)\n", " \n", " if not hs.contains(i):\n", " print(\"hs didn't contain %d; expected it to\" % i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Precise sets must use space proportional to the number of elements in the set. In our implementation, we handle collisions by appending an entry to a list. This has a performance impact as the number of elements in the hash table continues to grow beyond the number of buckets, since we're no longer looking up an entry in an array keyed by a hash value (which takes *constant time*); we're now looking up an entry in a list (which takes time proportional to the number of elements in the list).\n", "\n", "To see this performance impact, let's plot the average time it takes to do one insert and a corresponding lookup as the number of elements grows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasketching import plot\n", "plot.hash_experiment(HashSet(256), 5, 16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bloom filters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Think of a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) as a hashed set structure that has no precise way to handle collisions. Instead, the Bloom filter ameliorates the impact of hash collisions by using _multiple hash functions_. The buckets in the Bloom filter are merely bits: they do not have the identities of keys. When a value is inserted into the Bloom filter, multiple hash functions are used to select which buckets should be set to true (buckets that are already true are not changed). This means that if _all_ of the buckets for a given key are true, then the Bloom filter _may_ contain it, but that if _any_ of the buckets for a given key are false, then the Bloom filter _must not_ contain it.\n", "\n", "Let's see an implementation. We'll start with a [basic bit vector class](./bit-vector.ipynb) so that we can efficiently store values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasketching.BitVector import BitVector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now implement the Bloom filter using the bit vector to store values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Bloom(object):\n", " def __init__(self, size, hashes):\n", " \"\"\" Initializes a Bloom filter with the\n", " given size and a collection of hashes, \n", " which are functions taking arbitrary \n", " values and returning integers. \n", " \n", " hashes can be either a function taking \n", " a value and returning a list of results\n", " or a list of functions. In the latter \n", " case, this constructor will synthesize \n", " the former \"\"\"\n", " self.__buckets = BitVector(size)\n", " self.__size = len(self.__buckets)\n", " \n", " if hasattr(hashes, '__call__'):\n", " self.__hashes = hashes\n", " else:\n", " funs = hashes[:]\n", " def h(value):\n", " return [f(value) for f in funs]\n", " self.__hashes = h\n", " \n", " def size(self):\n", " return self.__size\n", " \n", " def insert(self, value):\n", " \"\"\" Inserts a value into this set \"\"\"\n", " for h in self.__hashes(value):\n", " self.__buckets[h % self.__size] = True\n", " \n", " def lookup(self, value):\n", " \"\"\" Returns true if value may be in this set\n", " (i.e., may return false positives) \"\"\"\n", " for h in self.__hashes(value):\n", " if self.__buckets[h % self.__size] == False:\n", " return False\n", " return True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll need some different hash functions to use in our Bloom filter. [We can simulate multiple hashes](./hashing.ipynb) by using one of the hashes supplied in `hashlib` and simply masking out parts of the digest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasketching.hashing import hashes_for" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's construct a Bloom filter using our three hashes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a Bloom filter with three hashes, \n", "# each of which is 32 bits (8 hex digits)\n", "bloom = Bloom(1024, hashes_for(3, 8))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bloom.insert(\"foobar\")\n", "bloom.lookup(\"foobar\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bloom.lookup(\"absent\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, so good! \n", "\n", "We can tell that the Bloom filter uses constant time for inserts no matter how many elements are in the set by running the same experiment we ran with our open-addressed hash set against our Bloom filter. To be directly comparable with the HashSet experiment, we'll use different hash functions (based on Python's `hash` builtin) -- in our other Bloom filter experiments, we're serializing input data and using a slower but better hash function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasketching import plot\n", "from datasketching.hashing import fast_hashes_for\n", "\n", "import cProfile\n", "\n", "plot.hash_experiment(Bloom(256, fast_hashes_for()), 5, 18)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tradeoff, of course, is that as the Bloom filter fills up, the false positive rate gets worse. Let's run an experiment to see how our false positive rate changes over time. We're going to construct a random stream of values and insert them into a Bloom filter -- but we're going to look them up first. Since it is extremely improbable that we'll get the same random values twice in a short simulation (the period of the Mersenne Twister that Python uses is too large to allow this), we can be fairly certain that any values for which `lookup` returns true before we've inserted them are false positives. We'll collect the false positive rate at every 100 samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bloom_experiment(sample_count, size, hashes, seed=0x15300625):\n", " import random\n", " from collections import namedtuple\n", " \n", " random.seed(seed)\n", " bloom = Bloom(size, hashes)\n", " \n", " result = []\n", " false_positives = 0\n", " \n", " for i in range(sample_count):\n", " bits = random.getrandbits(64)\n", " if bloom.lookup(bits):\n", " false_positives = false_positives + 1\n", " bloom.insert(bits)\n", " \n", " if i % 100 == 0:\n", " result.append((i + 1, false_positives / float(i + 1)))\n", " result.append((i + 1, false_positives / float(i + 1)))\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set up plotting (using the [Altair API](https://altair-viz.github.io) for the [Vega-Lite](https://vega.github.io) visualization grammar):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import altair as alt\n", "alt.renderers.enable('notebook')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can run an experiment and plot the results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pandas import DataFrame\n", "\n", "results = bloom_experiment(1 << 18, 4096, hashes_for(3, 8))\n", "df = DataFrame.from_records(results)\n", "df.rename(columns={0: \"cardinality\", 1: \"FPR\"}, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df).mark_line().encode(alt.X(\"cardinality\", scale=alt.Scale(type=\"log\", base=2)), y=\"FPR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see how increasing the size of the filter changes our results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = bloom_experiment(1 << 18, 16384, hashes_for(3, 8))\n", "df = DataFrame.from_records(results )\n", "df.rename(columns={0: \"cardinality\", 1: \"FPR\"}, inplace=True)\n", "\n", "alt.Chart(df).mark_line().encode(alt.X(\"cardinality\", scale=alt.Scale(type=\"log\", base=2)), y=\"FPR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analytic properties\n", "\n", "We can analytically predict a false positive rate for a given Bloom filter. If $k$ is the number of hash functions, $m$ is the size of the Bloom filter in bits, and $n$ is the number of elements in the set, we can expect a false positive rate of $ ( 1 - e^{- kn / m} )^k $. Let's plot that function for our previous example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = []\n", "import math\n", "hash_count = 3\n", "filter_size = 16384\n", "\n", "entries = 0\n", "while entries < 1 << 18:\n", " results.append((entries + 1, math.pow(1 - math.pow(math.e, -((hash_count * (entries + 1)) / filter_size)), hash_count)))\n", " entries = entries + 100\n", "\n", "df = DataFrame.from_records(results)\n", "df.rename(columns={0: \"cardinality\", 1: \"FPR\"}, inplace=True)\n", "\n", "alt.Chart(df).mark_line().encode(alt.X(\"cardinality\", scale=alt.Scale(type=\"log\", base=2)), y=\"FPR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, our expected false positive rate lines up very closely to our actual false positive rate.\n", "\n", "## Other useful properties\n", "\n", "Since it is possible to incrementally update a Bloom filter by adding a single element, the Bloom filter is suitable for stream processing.\n", "\n", "However, it is also possible to find the _union_ of two Bloom filters if they have the same size and were constructed with the same hash functions, which means it is possible to use the Bloom filter for parallel batch processing (i.e., approximating a very large set by combining the Bloom filters approximating its subsets). The union of Bloom filters approximating sets $A$ and $B$ is the bucketwise OR of $A$ and $B$. The union of Bloom filters approximating sets $A$ and $B$ will produce the same result as the Bloom filter approximating the set $A \\cup B$.\n", "\n", "It is also possible to find the _intersection_ of two Bloom filters by taking their bucketwise AND. $ \\mathrm{Bloom}(A) \\cap \\mathrm{Bloom}(B) $ may be less precise than $ \\mathrm{Bloom}(A \\cap B) $; the upper bound on the false positive rate for $ \\mathrm{Bloom}(A) \\cap \\mathrm{Bloom}(B) $ will be the greater of the false positive rates for $ \\mathrm{Bloom}(A) $ and $ \\mathrm{Bloom}(B) $." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Bloom(object):\n", " def __init__(self, size, hashes):\n", " \"\"\" Initializes a Bloom filter with the\n", " given size and a collection of hashes, \n", " which are functions taking arbitrary \n", " values and returning integers. \n", " \n", " hashes can be either a function taking \n", " a value and returning a list of results\n", " or a list of functions. In the latter \n", " case, this constructor will synthesize \n", " the former \"\"\"\n", " self.__buckets = BitVector(size)\n", " self.__size = len(self.__buckets)\n", " \n", " if hasattr(hashes, '__call__'):\n", " self.__hashes = hashes\n", " else:\n", " funs = hashes[:]\n", " def h(value):\n", " return [int(f(value)) for f in funs]\n", " self.__hashes = h\n", " \n", " def size(self):\n", " return self.__size\n", " \n", " def insert(self, value):\n", " \"\"\" Inserts a value into this set \"\"\"\n", " for h in self.__hashes(value):\n", " self.__buckets[h % self.__size] = True\n", " \n", " def lookup(self, value):\n", " \"\"\" Returns true if value may be in this set\n", " (i.e., may return false positives) \"\"\"\n", " for h in self.__hashes(value):\n", " if self.__buckets[h % self.__size] == False:\n", " return False\n", " return True\n", " \n", " def merge_from(self, other):\n", " \"\"\" Merges other in to this filter by \n", " taking the bitwise OR of this and \n", " other. Updates this filter in place. \"\"\"\n", " self.__buckets.merge_from(other.__buckets)\n", " \n", " def intersect(self, other):\n", " \"\"\" Takes the approximate intersection of \n", " this and other, returning a new filter \n", " approximating the membership of the \n", " intersection of the set approximated \n", " by self and the set approximated by other.\n", " \n", " The upper bound on the false positive rate \n", " of the resulting filter is the greater of \n", " the false positive rates of self and other \n", " (but the FPR may be worse than the FPR of \n", " a Bloom filter constructed only from the \n", " values in the intersection of the sets \n", " approximated by self and other). \"\"\"\n", " \n", " b = Bloom(self.size(), self.__hashes)\n", " b.__buckets.merge_from(self.__buckets)\n", " b.__buckets.intersect_from(other.__buckets)\n", " return b\n", " \n", " def union(self, other):\n", " \"\"\" Generates a Bloom filter approximating the \n", " membership of the union of the set approximated\n", " by self and the set approximated by other.\n", " \n", " Unlike intersect, this does not affect the \n", " precision of the filter (i.e., its precision\n", " will be identical to that of a Bloom filter \n", " built up from the union of the two sets). \"\"\"\n", " \n", " b = Bloom(self.size(), self.__hashes)\n", " b.__buckets.merge_from(self.__buckets)\n", " b.__buckets.merge_from(other.__buckets)\n", " return b\n", " \n", " \n", " def dup(self):\n", " b = Bloom(self.size(), self.__hashes)\n", " b.merge_from(self)\n", " return b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see these in action:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b1 = Bloom(1024, hashes_for(3, 8))\n", "b2 = Bloom(1024, hashes_for(3, 8))\n", "\n", "b1.insert(\"foo\")\n", "b1.insert(\"bar\")\n", "b2.insert(\"foo\")\n", "b2.insert(\"blah\")\n", "\n", "b_intersect = b1.intersect(b2)\n", "b_intersect.lookup(\"foo\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b_intersect.lookup(\"blah\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b_union = b1.union(b2) \n", "b_union.lookup(\"blah\"), b_union.lookup(\"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Partitioned Bloom Filters\n", "\n", "The _partitioned Bloom filter_ simply divides the set of buckets into several partitions (one for each hash function) so that, e.g., a bit in partition 0 can only be set by hash 0, and so on. A major advantage of the partitioned Bloom filter is that it has a better false positive rate under intersection (see the reference to Jeffrey and Steffan below), which can be better used to identify potential conflicts between very large sets.\n", "\n", "Because we track the count of hash functions explicitly (in the count of partitions), we can also easily adapt the cardinality estimation technique of [Swamidass and Baldi](http://www.igb.uci.edu/~pfbaldi/publications/journals/2007/ci600526a.pdf)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class PartitionedBloom(object):\n", " def __init__(self, size, hashes):\n", " \"\"\" Initializes a Bloom filter with the\n", " given per-partition size and a collection \n", " of hashes, which are functions taking \n", " arbitrary values and returning integers. \n", " The partition count is the number of hashes.\n", " \n", " hashes can be either a function taking \n", " a value and returning a list of results\n", " or a list of functions. In the latter \n", " case, this constructor will synthesize \n", " the former \"\"\"\n", " if hasattr(hashes, '__call__'):\n", " self.__hashes = hashes\n", " # inspect the tuple returned by the hash function to get a depth\n", " self.__depth = len(hashes(bytes()))\n", " else:\n", " funs = hashes[:]\n", " self.__depth = len(hashes)\n", " def h(value):\n", " return [int(f(value)) for f in funs]\n", " self.__hashes = h\n", " \n", " self.__buckets = BitVector(size * self.__depth)\n", " self.__size = size\n", "\n", " \n", " def size(self):\n", " return self.__size\n", " \n", " def partitions(self):\n", " return self.__depth\n", " \n", " def insert(self, value):\n", " \"\"\" Inserts a value into this set \"\"\"\n", " for (p, row) in enumerate(self.__hashes(value)):\n", " self.__buckets[(p * self.__size) + (row % self.__size)] = True\n", " \n", " def lookup(self, value):\n", " \"\"\" Returns true if value may be in this set\n", " (i.e., may return false positives) \"\"\"\n", " for (p, row) in enumerate(self.__hashes(value)):\n", " if not self.__buckets[(p * self.__size) + (row % self.__size)]:\n", " return False\n", " return True\n", " \n", " def merge_from(self, other):\n", " \"\"\" Merges other in to this filter by \n", " taking the bitwise OR of this and \n", " other. Updates this filter in place. \"\"\"\n", " self.__buckets.merge_from(other.__buckets)\n", " \n", " def intersect(self, other):\n", " \"\"\" Takes the approximate intersection of \n", " this and other, returning a new filter \n", " approximating the membership of the \n", " intersection of the set approximated \n", " by self and the set approximated by other.\n", " \n", " The upper bound on the false positive rate \n", " of the resulting filter is the greater of \n", " the false positive rates of self and other \n", " (but the FPR may be worse than the FPR of \n", " a Bloom filter constructed only from the \n", " values in the intersection of the sets \n", " approximated by self and other). \"\"\"\n", " \n", " b = PartitionedBloom(self.size(), self.__hashes)\n", " b.__buckets.merge_from(self.__buckets)\n", " b.__buckets.intersect_from(other.__buckets)\n", " return b\n", " \n", " def union(self, other):\n", " \"\"\" Generates a Bloom filter approximating the \n", " membership of the union of the set approximated\n", " by self and the set approximated by other.\n", " \n", " Unlike intersect, this does not affect the \n", " precision of the filter (i.e., its precision\n", " will be identical to that of a Bloom filter \n", " built up from the union of the two sets). \"\"\"\n", " \n", " b = PartitionedBloom(self.size(), self.__hashes)\n", " b.__buckets.merge_from(self.__buckets)\n", " b.__buckets.merge_from(other.__buckets)\n", " return b\n", " \n", " \n", " def dup(self):\n", " b = PartitionedBloom(self.size(), self.__hashes)\n", " b.merge_from(self)\n", " return b\n", " \n", " def approx_cardinality(self):\n", " \"\"\" Returns an estimate of the cardinality of\n", " the set modeled by this filter. Uses\n", " a technique due to Swamidass and Baldi. \"\"\"\n", " from math import log\n", " m, k = self.size() * self.partitions(), self.partitions()\n", " X = self.__buckets.count_set_bits()\n", " print(m, k, X)\n", " return -(m / k) * log(1 - (X / m))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pbloom_experiment(sample_count, size, hashes, mod1=3, mod2=7, seed=0x15300625):\n", " import random\n", " from collections import namedtuple\n", " \n", " random.seed(seed)\n", " pb1 = PartitionedBloom(size, hashes)\n", " pb2 = PartitionedBloom(size, hashes)\n", " \n", " b1 = Bloom(pb1.size() * pb1.partitions(), hashes)\n", " b2 = Bloom(pb1.size() * pb1.partitions(), hashes)\n", " result = []\n", " pb_fp, b_fp = 0, 0\n", " \n", " count = 0\n", " \n", " for i in range(sample_count):\n", " bits = random.getrandbits(64)\n", " if i % mod1 == 0:\n", " pb1.insert(bits)\n", " b1.insert(bits)\n", " if i % mod2 == 0:\n", " pb2.insert(bits)\n", " b2.insert(bits)\n", " if i % mod1 == 0:\n", " count += 1\n", " \n", " pb = pb1.intersect(pb2)\n", " b = b1.intersect(b2)\n", " \n", " random.seed(seed)\n", " \n", " for i in range(sample_count):\n", " bits = random.getrandbits(64)\n", " if pb.lookup(bits) and ((i % mod1 != 0) or (i % mod2 != 0)):\n", " pb_fp += 1\n", " if b.lookup(bits) and ((i % mod1 != 0) or (i % mod2 != 0)):\n", " b_fp += 1\n", " return (count, b_fp, pb_fp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = []\n", "\n", "for pwr in range(10, 17):\n", " for count in [1 << pwr, (1 << pwr) + (1 << (pwr - 1))]:\n", " tp, bfp, pbfp = pbloom_experiment(count, 16384, hashes_for(8, 4))\n", " results.append((\"Bloom\", count, bfp / (float(tp) + bfp)))\n", " results.append((\"partitioned Bloom\", count, pbfp / (float(tp) + pbfp)))\n", " \n", "df = DataFrame.from_records(results )\n", "df.rename(columns={0: \"kind\", 1: \"cardinality\", 2: \"FPR\"}, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import altair as alt\n", "\n", "alt.renderers.enable('notebook')\n", "base = alt.Chart(df).encode(alt.X(\"cardinality\", scale=alt.Scale(type=\"log\", base=2)), y=\"FPR\", color=\"kind\")\n", "base.mark_point() + base.mark_line()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applications\n", "\n", "* The application Bloom used as a case study in [his paper introducing the structure](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.2080) was a hyphenation program, in which roughly 90% of words could be hyphenated by simple rules but 10% required a dictionary lookup -- and the dictionary was too large to hold in core. By using a small Bloom filter to record the words that required dictionary lookup, it would be possible possible to dramatically reduce disk accesses without impacting the correctness of the application.\n", "* [Bloom join](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.134.5196) is a classic technique to optimize joins in distributed databases. The result of a join is, logically, the subset of the Cartesian product of two relations that satisfies some predicate (typically an equality relation on a field). There are many strategies for implementing joins in conventional databases, but these may prove prohibitive in a distributed database, where different relations may reside on different machines. Bloom join optimizes these joins by allowing local filtering of the relations involved. For example, consider the SQL statement `SELECT * FROM A, B WHERE A.x = B.x`: by broadcasting Bloom filters of the sets of values for `x` in both `A` and `B`, it is possible to filter out many tuples that would never appe\n", "* Bloom filters are often implemented in hardware, since a range of microarchitectural features can benefit from fast approximate set membership queries. For one example application, see [Jeffrey and Steffan](http://www.eecg.toronto.edu/~steffan/papers/jeffrey_spaa11.pdf), in which the motivating example involves using Bloom filters to show that two hardware transactions do not interfere before allowing them to commit. (This technique is not their innovation; rather, the focus of Jeffrey and Steffan's work is to show that _partitioned Bloom filters_ admit a smaller false positive rate for the intersection of Bloom filters, and thus set disjointedness.)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }