{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Min hashing and Jaccard similarity\n", "\n", "#### Introduction\n", "\n", "There are many ways to measure how similar two strings are: [Hamming distance](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_EditDist.ipynb), [edit distance](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_EditDist.ipynb) and [global alignment value](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_Global.ipynb) for example. Another way is to turn each string into a set, e.g. the set of its constituent $k$-mers, then consider how similar the sets are.\n", "\n", "We define a function that, given a string, returns the set of its constituent $k$-mers." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def string_to_kmer_set(Astr, k):\n", " return set([Astr[i:i+k] for i in range(len(Astr)-k+1)])" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ell', 'hel', 'llo'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_to_kmer_set(\"hello\", 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Jaccard similarity coefficient $J(A, B)$ of non-empty sets $A$ and $B$ is:\n", "\n", "$$\\frac{|A \\cap B|}{|A \\cup B|}$$\n", "\n", "It equals 1 when the sets are identical and 0 when they are disjoint. Otherwise it is between 0 and 1." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def jaccard(Aset, Bset):\n", " # return Jaccard similarity coefficient between two sets\n", " isz = len(Aset.intersection(Bset))\n", " return float(isz) / (len(Aset) + len(Bset) - isz)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def jaccard_kmer(Astr, Bstr, k):\n", " # turn two strings into sets, then return Jaccard similarity coefficient of those sets\n", " return jaccard(string_to_kmer_set(Astr, k),\n", " string_to_kmer_set(Bstr, k))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3333333333333333" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_kmer(\"ABC\", \"ABD\", 2)\n", "# intersection: {AB}, union: {AB, BC, BD}\n", "# so answer = 1/3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluating use of sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explicitly building and intersecting sets of strings seems inefficient. Let's see how long it takes to run on many randomly generated pairs of similar strings." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "def add_mutations(string, num_muts):\n", " \"\"\" Add num_muts random substitution mutations to string \"\"\"\n", " for _ in range(num_muts):\n", " rndi = random.randint(0, len(string)-1)\n", " string = string[:rndi] + random.choice('ACGT') + string[rndi+1:]\n", " return string\n", "\n", "def random_jaccard_kmer(length, k):\n", " \"\"\" Make a random string and a second string separated from the\n", " first by a few mutations, then return the two strings and\n", " their jaccard similarity coefficient. \"\"\"\n", " str1 = ''.join([random.choice('ACGT') for _ in range(length)])\n", " str2 = add_mutations(str1, random.randint(0, float(length)/20))\n", " return str1, str2, jaccard_kmer(str1, str2, k)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ACGCGCGT'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.seed(77)\n", "add_mutations('ACGTACGT', 2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GTTCGATCGGTTCAGGCGAA', 'GTTCGATCGGTTCAGGCGTA', 0.7777777777777778)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.seed(76)\n", "random_jaccard_kmer(20, 4)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16.9908575999998" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import timeit\n", "timeit.timeit('random_jaccard_kmer(1000, 10)',\n", " setup='''\n", "from __main__ import random_jaccard_kmer;\n", "import random;\n", "random.seed(223)''',\n", " number=10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes >10 seconds to find Jaccard similarities between 10,000 random pairs of 100-long DNA strings, using $k$-mer length of 10." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Min hashing\n", "\n", "#### Introduction\n", "\n", "How about: instead of using the set of all $k$-mers from each string, we pick one representative $k$-mer from each string. Let's pick the *minimum alphabetically*. For example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def string_to_min_kmer(Astr, k):\n", " return min([Astr[i:i+k] for i in range(len(Astr)-k+1)])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ell'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_to_min_kmer(\"hello\", 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare two strings by comparing their minimal $k$-mers:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def jaccard_min_kmer(Astr, Bstr, k):\n", " return 1 if string_to_min_kmer(Astr, k) == string_to_min_kmer(Bstr, k) else 0" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer(\"ABC\", \"ABD\", 2)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer(\"ABC\", \"ACB\", 2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer(\"DBC\", \"ABC\", 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can yield a Jaccard similarity of 0 or 1; we cannot distinguish intermedaite amounts of similarity. On the other hand, we avoided building any sets.\n", "\n", "#### Adding a hash function\n", "\n", "We'll use the [mmh3 library], which contains an implementation of [MurmurHash3], a fast and widely used non-cryptographic hash function. Instead of taking our representative as being the minimal $k$-mer, we'll first *hash* the $k$-mers, then take the $k$-mer with minimal *hash value*:\n", "\n", "[MurmurHash3]: https://code.google.com/p/smhasher/wiki/MurmurHash3\n", "[mmh3 library]: https://pypi.python.org/pypi/mmh3" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# you might need to 'pip install mmh3' first\n", "import mmh3" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def string_to_min_hash(Astr, k):\n", " return min([mmh3.hash (Astr[i:i+k]) for i in range(len(Astr)-k+1)])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-173395898" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_to_min_hash(\"hello\", 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's the minimum among the hash values of the 3-mers of \"hello\"." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def jaccard_min_kmer_hash(Astr, Bstr, k):\n", " return 1 if string_to_min_hash(Astr, k) == string_to_min_hash(Bstr, k) else 0" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hash(\"ABC\", \"ABD\", 2)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hash(\"ABC\", \"ACB\", 2)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hash(\"DBC\", \"ABC\", 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`jaccard_min_kmer_hash`'s return value won't necessarily match `jaccard_min_kmer`'s, since the function permutes the alphabetical order of the $k$-mers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Multiple hash functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`jaccard_min_kmer` and `jaccard_min_kmer_hash` return 0 or 1 -- not very precise! We can get a better estimate by calling `jaccard_min_kmer_hash` multiple times, each time using a different hash function.\n", "\n", "Let's rewrite `string_to_min_hash` to include a `seed` parameter that [\"salts\"] the hash function.\n", "\n", "[\"salts\"]: http://en.wikipedia.org/wiki/Salt_(cryptography)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def string_to_min_hash(Astr, k, seed=0):\n", " return min([mmh3.hash(Astr[i:i+k], seed) for i in range(len(Astr)-k+1)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can call `string_to_min_hash` with various hash functions, or various saltings of the same function." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-1948827108,\n", " -1610908706,\n", " -1823680268,\n", " -1885168061,\n", " -1068521670,\n", " -1692363780,\n", " -1923178236,\n", " -85412340,\n", " -1121674942,\n", " -2094403364]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[string_to_min_hash(\"hello\", 3, k) for k in range(10, 20)]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def jaccard_min_kmer_hashes(Astr, Bstr, k, seeds=[0]):\n", " tot = sum(string_to_min_hash(Astr, k, seed) == string_to_min_hash(Bstr, k, seed) for seed in seeds)\n", " return float(tot) / len(seeds)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not a terrible estimate." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.38" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, not terrible." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3299" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(10000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very good estimate: off by only about 1%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does this function give an estimate of Jaccard similarity? Each hash function *permutes* the ordering of the $k$-mers differently. For each permutation, some $k$-mer from the union of all $k$-mers becomes the minimal one. By calculating the fraction of the hash functions for which this minimal $k$-mer is present in both sets, we're estimating the size of the intersection divided by the size of the union: the Jaccard similarity." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "91.05219180000131" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def random_jaccard_kmer(length, k):\n", " str1 = ''.join([random.choice('ACGT') for _ in range(length)])\n", " str2 = add_mutations(str1, random.randint(0, length/20))\n", " return str1, str2, jaccard_min_kmer_hashes(str1, str2, k, seeds=range(10))\n", "\n", "import timeit\n", "timeit.timeit('random_jaccard_kmer(1000, 10)',\n", " setup='''\n", "from __main__ import random_jaccard_kmer;\n", "import random;\n", "random.seed(223)''',\n", " number=10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's slower than what we had before, but for large enough sets it could be faster." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }