{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Bloom Filter\n", "\n", "Conventional bloom filter with two hash function optimisation based on [Less Hashing, Same Performance:\n", "Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf)\n", "Adam Kirsch, Michael Mitzenmacher\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "struct BloomFilter\n", " bits::BitArray\n", " k::Int\n", "end" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.004836 seconds (4 allocations: 11.426 MiB)\n" ] }, { "data": { "text/plain": [ "BloomFilter: ~11701 KiB, 7 hashes, fpp=0.0078125" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function BloomFilter(n, fpp)\n", " k = -log(fpp)/log(2)\n", " m = n*k/log(2)\n", " BloomFilter(falses(Int(ceil(m))), Int(ceil(k)))\n", "end\n", "\n", "fpp(bf::BloomFilter) = .5^bf.k\n", "\n", "function Base.show(io::IO, bf::BloomFilter)\n", " print(io, \"BloomFilter: ~\", Int(ceil(length(bf.bits)/8/1024)), \" KiB, \", bf.k, \" hashes, fpp=\", fpp(bf))\n", "end\n", "\n", "@time bf = BloomFilter(10_000_000, 0.01)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "function add!(bf::BloomFilter, s)\n", " m=length(bf.bits)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " for i in 1:bf.k\n", " bf.bits[(hash1+i*hash2)%m] = 1\n", " end\n", "end\n", "\n", "add!(bf, \"hello\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "function in(s, bf::BloomFilter)\n", " m=length(bf.bits)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " !any(b->b, (!bf.bits[(hash1+i*hash2)%m] for i in 1:bf.k))\n", "end\n", "\n", "@assert \"hello\" in bf\n", "@assert !(\"not hello\" in bf)\n", "@assert !(\"hello not\" in bf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting Bloom Filter\n", "\n", "See [Analysis of Counting Bloom Filters Used for Count Thresholding](https://www.mdpi.com/2079-9292/8/7/779) - by Kibeom Kim, Yongjo Jeong, Youngjoo Lee and Sunggu Lee" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "struct CountingBloomFilter{T <: Integer}\n", " counters::Array{T}\n", " k::Int\n", "end" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.007179 seconds (3 allocations: 18.282 MiB)\n" ] }, { "data": { "text/plain": [ "CountingBloomFilter: ~18721 KiB, 7 hashes, fpp=0.0078125" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function CountingBloomFilter{T}(n::Int, fpp::Real) where T<:Integer\n", " k = -log(fpp)/log(2) # See paper for a better way of estimating optimal value\n", " m = n*k/log(2)\n", " CountingBloomFilter(zeros(T, Int(ceil(m))), Int(ceil(k)))\n", "end\n", "\n", "fpp(cbf::CountingBloomFilter) = .5^cbf.k\n", "\n", "function Base.show(io::IO, bf::CountingBloomFilter{T}) where T\n", " print(io, \"CountingBloomFilter: ~\", Int(ceil(length(bf.counters)*sizeof(T)/1024)), \" KiB, \", bf.k, \" hashes, fpp=\", fpp(bf))\n", "end\n", "\n", "@time nsbf = CountingBloomFilter{Int16}(1_000_000, 0.01)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "function add!(bf::CountingBloomFilter, s)\n", " m=length(bf.counters)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " for i in 1:bf.k\n", " bf.counters[(hash1+i*hash2)%m] += 1\n", " end\n", "end\n", "\n", "add!(nsbf, \"hello\")\n", "add!(nsbf, \"hello\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "function count(bf::CountingBloomFilter, s)\n", " m=length(bf.counters)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " minimum(bf.counters[(hash1+i*hash2)%m] for i in 1:bf.k)\n", "end\n", "\n", "@assert count(nsbf, \"hello\") == 2\n", "@assert count(nsbf, \"hellooo\") == 0\n", "\n", "add!(nsbf, \"hello\")\n", "@assert count(nsbf, \"hello\") == 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Spectral Bloom Filters\n", "\n", "See [Spectral Bloom Filters](http://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf) Saar Cohen, Yossi Matias\n", "and [MSc thesis](http://theory.stanford.edu/~matias/papers/sbf_thesis.pdf) for fuller treatment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dynamic Count (Bloom) Filters\n", "\n", "Based on [Dynamic Count Filters](https://dl.acm.org/doi/pdf/10.1145/1121995.1122000) - J. Aguilar-Saborit, P. Trancoso, V. Muntes-Mulero, J.L. Larriba-Pey\n", "\n", "(the below is not fully complete and could be tidyied up considerably but it shows the idea)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Increment a range inside a bit array by one, returns carry bit\n", "function incr!(ba::BitArray, r::UnitRange)\n", " carry = true\n", " i = r.start\n", " while carry && i < r.stop\n", " if ba[i] == 1\n", " ba[i] = 0\n", " else\n", " ba[i] = 1\n", " carry = false\n", " end\n", " i += 1\n", " end\n", " carry \n", "end\n", "\n", "# Returns the value of a range inside a bit array\n", "function value(ba::BitArray, r::UnitRange)\n", " bar = ba[r]\n", " length(bar) >0 ? sum(bar[i+1]*(2^i) for i in 0:length(bar)-1) : 0\n", "end\n", "\n", "testArray = falses(10)\n", "@assert value(testArray, 3:5) == 0\n", "@assert incr!(testArray, 3:5) == false\n", "@assert value(testArray, 3:5) == 1\n", "@assert incr!(testArray, 3:5) == false\n", "@assert value(testArray, 3:5) == 2\n", "@assert incr!(testArray, 3:5) == false\n", "@assert value(testArray, 3:5) == 3\n", "@assert incr!(testArray, 3:5) == true\n", "@assert value(testArray, 3:5) == 0\n", "@assert incr!(testArray, 3:5) == false\n", "@assert value(testArray, 3:5) == 1" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "mutable struct DynamicCountFilter{T <: Integer}\n", " CBFV::Array{T}\n", " OFV::BitArray\n", " k::Int\n", " y::Int\n", "end" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.000002 seconds (4 allocations: 304 bytes)\n" ] }, { "data": { "text/plain": [ "DynamicCountFilter: ≈1 KiB, 7 hashes, fpp=0.0078125" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function DynamicCountFilter{T}(n::Int, fpp::Real) where T<:Integer\n", " k = -log(fpp)/log(2)\n", " m = Int(ceil(n*k/log(2)))\n", " y = 0\n", " DynamicCountFilter(zeros(T, m), falses(y*m), Int(ceil(k)), y)\n", "end\n", "\n", "fpp(dcf::DynamicCountFilter) = .5^dcf.k\n", "\n", "function Base.show(io::IO, dcf::DynamicCountFilter{T}) where T\n", " cbfv_bytes = length(dcf.CBFV)*sizeof(T)\n", " ofv_bytes = length(dcf.OFV)/8\n", " total_mem_kb = Int(ceil((cbfv_bytes+ofv_bytes)/1024))\n", " print(io, \"DynamicCountFilter: ≈\", total_mem_kb, \" KiB, \", dcf.k, \" hashes, fpp=\", fpp(dcf))\n", "end\n", "\n", "@time dcf = DynamicCountFilter{UInt8}(5, 0.01)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(dcf.OFV, dcf.CBFV) = (Bool[], UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00])\n" ] } ], "source": [ "function add!(dcf::DynamicCountFilter{T}, s) where T\n", " m=length(dcf.CBFV)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " for i in 1:dcf.k\n", " j = (hash1+i*hash2)%m\n", " if dcf.CBFV[j+1] == typemax(T)\n", " y = dcf.y\n", " carry = incr!(dcf.OFV, (y*j)+1:y*(j+1))\n", " if (carry)\n", " # Rebuild\n", " ny = y+1\n", " newOFV = falses(m*ny)\n", " for l in 0:m-1\n", " newOFV[(ny*l)+1:(ny*l)+y] = dcf.OFV[(y*l)+1:y*(l+1)]\n", " end\n", " dcf.OFV = newOFV\n", " dcf.y = ny\n", " \n", " # set new carry bit\n", " dcf.OFV[ny*(j+1)] = 1\n", " dcf.CBFV[j+1] = 0\n", " end\n", " else\n", " dcf.CBFV[j+1] += 1\n", " end\n", " end\n", "end\n", "\n", "@show dcf.OFV, dcf.CBFV;" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function count(dcf::DynamicCountFilter, s)\n", " m=length(dcf.CBFV)\n", " hash1 = hash(s, UInt(9)) \n", " hash2 = hash(s, UInt(17))\n", " mₓ=typemax(Int64)\n", " for i in 1:dcf.k\n", " b = (hash1+i*hash2)%m\n", " cₓ = (value(dcf.OFV, (dcf.y*b)+1:dcf.y*(b+1))<<8) + dcf.CBFV[b]\n", " if cₓ < mₓ\n", " mₓ = cₓ\n", " end\n", " end\n", " mₓ\n", "end\n", "\n", "count(dcf, \"hello\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "for i in 1:256\n", " add!(dcf, \"hello\")\n", "end" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(dcf.OFV, dcf.CBFV) = (Bool[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0], UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0xff])\n" ] } ], "source": [ "@show dcf.OFV, dcf.CBFV;" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "256" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count(dcf, \"hello\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.4.1", "language": "julia", "name": "julia-1.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.4.1" } }, "nbformat": 4, "nbformat_minor": 4 }