{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Minhash\n", "\n", "To calculate the similarity of two sets, we can use the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index), which divides the size of the sets' intersection by the size of their union. As with the other problems we've discussed so far, keeping explicit representations of sets around is intractable for very large sets, but it is also intractable if we have very many sets, for example, if we're building a search engine. We would like a way to construct _signatures_ of sets in such a way that we can calculate their approximate similarity scalably.\n", "\n", "Minhash is a technique for constructing signatures of sets that will allow us to estimate their approximate similarity. Here's the basic technique, which tracks document signatures by keeping track of the _minimum_ value seen for multiple hash functions across every element in the set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "import pickle" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils.murmurhash import murmurhash3_bytes_u32 as mhash\n", "\n", "\n", "def murmurmaker(seed):\n", " \"\"\" \n", " return a function to calculate a 32-bit murmurhash of v \n", " (an object or bytes), using the given seed\n", " \"\"\"\n", " def m(v):\n", " bvalue = None\n", " \n", " if type(v) == str:\n", " bvalue = v.encode(\"utf-8\")\n", " elif hasattr(v, \"to_bytes\"):\n", " bvalue = v.to_bytes(128, \"big\")\n", " elif type(v) == bytes:\n", " bvalue = v\n", " else:\n", " bvalue = pickle.dumps(v)\n", "\n", " return mhash(bvalue, seed=seed)\n", " \n", " return m\n", "\n", "\n", "class SimpleMinhash(object):\n", " \"\"\" This is a very basic implementation of minhash \"\"\"\n", " def __init__(self, hashes):\n", " rng = np.random.RandomState(seed=int.from_bytes(b\"rad!\", \"big\"))\n", " self.buckets = np.full(hashes, (1 << 32) - 1)\n", " self.hashes = [murmurmaker(seed) for seed in rng.randint(0, (1<<32) - 1, hashes)]\n", " \n", " def add(self, obj):\n", " self.buckets = np.minimum(self.buckets, [h(obj) for h in self.hashes])\n", " \n", " def similarity(self, other):\n", " \"\"\" \"\"\"\n", " return np.count_nonzero(self.buckets==other.buckets) / float(len(self.buckets))\n", " \n", " def merge(self, other):\n", " \"\"\" returns a newly-allocated minhash structure containing \n", " the merge of this hash and another \"\"\"\n", " result = SimpleMinhash(0)\n", " result.buckets = np.minimum(self.buckets, other.buckets)\n", " result.hashes = self.hashes\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test a small Minhash with random values to see how well the approximate Jaccard index implementation works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test_minhash(count=50000, expected_percentage=.20):\n", " m1 = SimpleMinhash(1024)\n", " m2 = SimpleMinhash(1024)\n", " for i in range(count):\n", " bits = random.getrandbits(64).to_bytes(8, \"big\")\n", " if i % 1000 < (1000 * expected_percentage):\n", " m1.add(bits)\n", " m2.add(bits)\n", " elif i % 2 == 0:\n", " m1.add(bits)\n", " else:\n", " m2.add(bits)\n", " return m1.similarity(m2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_minhash()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very common application for these kinds of document signatures is identifying similar documents based on the words that they contain -- this is useful, e.g., for detecting plagiarized prose or grouping similar web pages or news articles together. Unfortunately, even having an efficient way to calculate pairwise similarities is insufficient for this application: it doesn't matter how cheap it is to do a pairwise comparison if we have to compare every pair in a large document collection! We can use _locality-sensitive hashing_ to quickly identify similar documents without explicit pairwise comparisons. The basic idea is that we'll return a set of keys, each corresponding to the hash of a subset of the signature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LSHMinhash(object):\n", " \"\"\" This is a very basic implementation of minhash with locality-sensitive hashing \"\"\"\n", " def __init__(self, rows, bands):\n", " rng = np.random.RandomState(seed=int.from_bytes(b\"rad!\", \"big\"))\n", " hashes = rows * bands\n", " self.rows = rows\n", " self.bands = bands\n", " self.buckets = np.full(hashes, (1 << 32) - 1)\n", " self.hashes = [murmurmaker(seed) for seed in rng.randint(0, (1<<32) - 1, hashes)]\n", " \n", " def add(self, obj):\n", " self.buckets = np.minimum(self.buckets, [h(obj) for h in self.hashes])\n", " \n", " def similarity(self, other):\n", " \"\"\" \"\"\"\n", " return np.count_nonzero(self.buckets==other.buckets) / float(len(self.buckets))\n", " \n", " def merge(self, other):\n", " \"\"\" returns a newly-allocated minhash structure containing \n", " the merge of this hash and another \"\"\"\n", " result = LSHMinhash(self.rows, self.bands)\n", " numpy.minimum(self.buckets, other.buckets, result.buckets)\n", " result.hashes = self.hashes\n", " return result\n", " \n", " def lsh_keys(self):\n", " return [self.hashes[0]([b for b in band]) for band in self.buckets.copy().reshape((self.rows, self.bands))]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test_lsh_minhash(count=50000, expected_percentage=.20):\n", " m1 = LSHMinhash(64, 16)\n", " m2 = LSHMinhash(64, 16)\n", " for i in range(count):\n", " bits = random.getrandbits(64).to_bytes(8, \"big\")\n", " if i % 1000 < (1000 * expected_percentage):\n", " m1.add(bits)\n", " m2.add(bits)\n", " elif i % 2 == 0:\n", " m1.add(bits)\n", " else:\n", " m2.add(bits)\n", " return (m1.similarity(m2), m1.lsh_keys(), m2.lsh_keys())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tup = test_lsh_minhash(expected_percentage=.95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then group cells by keys (or even by parts of their keys) to identify candidate matches, which lets us only check a subset of all potential matches for similarity:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for t in zip(tup[1], tup[2]):\n", " if t[0] == t[1]:\n", " print(t)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def lsh_minhash_many(count=5000, expected_percentage = 0.2):\n", " lsh_minhashes = [LSHMinhash(64, 16) for i in range(32)]\n", " \n", " for i in range(count):\n", " bits = random.getrandbits(64).to_bytes(8, \"big\")\n", " if i % 1000 < (1000 * expected_percentage):\n", " [lshm.add(bits) for lshm in lsh_minhashes]\n", " else:\n", " lsh_minhashes[i % 32].add(bits)\n", " \n", " return lsh_minhashes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lshm_results = lsh_minhash_many(5000,0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "bands = [defaultdict(lambda: list()) for i in range(64)]\n", "\n", "for ind, lshm in enumerate(lshm_results):\n", " for idx, key in enumerate(lshm.lsh_keys()):\n", " bands[idx][key % (1 << 14)].append(ind)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bands" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To learn more about Minhash, locality-sensitive hashing, and similar techniques, see [Chapter 3](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) of [_Mining of Massive Datasets_](http://infolab.stanford.edu/~ullman/mmds/book.pdf) by Leskovec, Rajaraman, and Ullman." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises\n", "\n", "- (very easy) Implement the Jaccard index for explicit sets\n", "- Explain in your own words why Minhash approximates the Jaccard index\n", "- What are the tradeoffs in designing a technique for locality-sensitive hashing? How would you evaluate them?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }