{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Min hashing and Jaccard similarity\n",
    "\n",
    "#### Introduction\n",
    "\n",
    "There are many ways to measure how similar two strings are: [Hamming distance](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_EditDist.ipynb), [edit distance](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_EditDist.ipynb) and [global alignment value](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_DP_Global.ipynb) for example.  Another way is to turn each string into a set, e.g. the set of its constituent $k$-mers, then consider how similar the sets are.\n",
    "\n",
    "We define a function that, given a string, returns the set of its constituent $k$-mers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "def string_to_kmer_set(Astr, k):\n",
    "    return set([Astr[i:i+k] for i in range(len(Astr)-k+1)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ell', 'hel', 'llo'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_to_kmer_set(\"hello\", 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Jaccard similarity coefficient $J(A, B)$ of non-empty sets $A$ and $B$ is:\n",
    "\n",
    "$$\\frac{|A \\cap B|}{|A \\cup B|}$$\n",
    "\n",
    "It equals 1 when the sets are identical and 0 when they are disjoint.  Otherwise it is between 0 and 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard(Aset, Bset):\n",
    "    # return Jaccard similarity coefficient between two sets\n",
    "    isz = len(Aset.intersection(Bset))\n",
    "    return float(isz) / (len(Aset) + len(Bset) - isz)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard_kmer(Astr, Bstr, k):\n",
    "    # turn two strings into sets, then return Jaccard similarity coefficient of those sets\n",
    "    return jaccard(string_to_kmer_set(Astr, k),\n",
    "                   string_to_kmer_set(Bstr, k))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3333333333333333"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_kmer(\"ABC\", \"ABD\", 2)\n",
    "# intersection: {AB}, union: {AB, BC, BD}\n",
    "# so answer = 1/3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluating use of sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Explicitly building and intersecting sets of strings seems inefficient.  Let's see how long it takes to run on many randomly generated pairs of similar strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "def add_mutations(string, num_muts):\n",
    "    \"\"\" Add num_muts random substitution mutations to string \"\"\"\n",
    "    for _ in range(num_muts):\n",
    "        rndi = random.randint(0, len(string)-1)\n",
    "        string = string[:rndi] + random.choice('ACGT') + string[rndi+1:]\n",
    "    return string\n",
    "\n",
    "def random_jaccard_kmer(length, k):\n",
    "    \"\"\" Make a random string and a second string separated from the\n",
    "        first by a few mutations, then return the two strings and\n",
    "        their jaccard similarity coefficient. \"\"\"\n",
    "    str1 = ''.join([random.choice('ACGT') for _ in range(length)])\n",
    "    str2 = add_mutations(str1, random.randint(0, float(length)/20))\n",
    "    return str1, str2, jaccard_kmer(str1, str2, k)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ACGCGCGT'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random.seed(77)\n",
    "add_mutations('ACGTACGT', 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('GTTCGATCGGTTCAGGCGAA', 'GTTCGATCGGTTCAGGCGTA', 0.7777777777777778)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random.seed(76)\n",
    "random_jaccard_kmer(20, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16.9908575999998"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import timeit\n",
    "timeit.timeit('random_jaccard_kmer(1000, 10)',\n",
    "              setup='''\n",
    "from __main__ import random_jaccard_kmer;\n",
    "import random;\n",
    "random.seed(223)''',\n",
    "              number=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It takes >10 seconds to find Jaccard similarities between 10,000 random pairs of 100-long DNA strings, using $k$-mer length of 10."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Min hashing\n",
    "\n",
    "#### Introduction\n",
    "\n",
    "How about: instead of using the set of all $k$-mers from each string, we pick one representative $k$-mer from each string.  Let's pick the *minimum alphabetically*.  For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def string_to_min_kmer(Astr, k):\n",
    "    return min([Astr[i:i+k] for i in range(len(Astr)-k+1)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ell'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_to_min_kmer(\"hello\", 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare two strings by comparing their minimal $k$-mers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard_min_kmer(Astr, Bstr, k):\n",
    "    return 1 if string_to_min_kmer(Astr, k) == string_to_min_kmer(Bstr, k) else 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer(\"ABC\", \"ABD\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer(\"ABC\", \"ACB\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer(\"DBC\", \"ABC\", 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can yield a Jaccard similarity of 0 or 1; we cannot distinguish intermedaite amounts of similarity.  On the other hand, we avoided building any sets.\n",
    "\n",
    "#### Adding a hash function\n",
    "\n",
    "We'll use the [mmh3 library], which contains an implementation of [MurmurHash3], a fast and widely used non-cryptographic hash function.  Instead of taking our representative as being the minimal $k$-mer, we'll first *hash* the $k$-mers, then take the $k$-mer with minimal *hash value*:\n",
    "\n",
    "[MurmurHash3]: https://code.google.com/p/smhasher/wiki/MurmurHash3\n",
    "[mmh3 library]: https://pypi.python.org/pypi/mmh3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you might need to 'pip install mmh3' first\n",
    "import mmh3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def string_to_min_hash(Astr, k):\n",
    "    return min([mmh3.hash (Astr[i:i+k]) for i in range(len(Astr)-k+1)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-173395898"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_to_min_hash(\"hello\", 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's the minimum among the hash values of the 3-mers of \"hello\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard_min_kmer_hash(Astr, Bstr, k):\n",
    "    return 1 if string_to_min_hash(Astr, k) == string_to_min_hash(Bstr, k) else 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hash(\"ABC\", \"ABD\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hash(\"ABC\", \"ACB\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hash(\"DBC\", \"ABC\", 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`jaccard_min_kmer_hash`'s return value won't necessarily match `jaccard_min_kmer`'s, since the function permutes the alphabetical order of the $k$-mers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multiple hash functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`jaccard_min_kmer` and `jaccard_min_kmer_hash` return 0 or 1 -- not very precise!  We can get a better estimate by calling `jaccard_min_kmer_hash` multiple times, each time using a different hash function.\n",
    "\n",
    "Let's rewrite `string_to_min_hash` to include a `seed` parameter that [\"salts\"] the hash function.\n",
    "\n",
    "[\"salts\"]: http://en.wikipedia.org/wiki/Salt_(cryptography)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def string_to_min_hash(Astr, k, seed=0):\n",
    "    return min([mmh3.hash(Astr[i:i+k], seed) for i in range(len(Astr)-k+1)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can call `string_to_min_hash` with various hash functions, or various saltings of the same function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[-1948827108,\n",
       " -1610908706,\n",
       " -1823680268,\n",
       " -1885168061,\n",
       " -1068521670,\n",
       " -1692363780,\n",
       " -1923178236,\n",
       " -85412340,\n",
       " -1121674942,\n",
       " -2094403364]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[string_to_min_hash(\"hello\", 3, k) for k in range(10, 20)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jaccard_min_kmer_hashes(Astr, Bstr, k, seeds=[0]):\n",
    "    tot = sum(string_to_min_hash(Astr, k, seed) == string_to_min_hash(Bstr, k, seed) for seed in seeds)\n",
    "    return float(tot) / len(seeds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not a terrible estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.38"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, not terrible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3299"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jaccard_min_kmer_hashes(\"ABC\", \"ABD\", 2, seeds=range(10000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very good estimate: off by only about 1%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why does this function give an estimate of Jaccard similarity?  Each hash function *permutes* the ordering of the $k$-mers differently.  For each permutation, some $k$-mer from the union of all $k$-mers becomes the minimal one.  By calculating the fraction of the hash functions for which this minimal $k$-mer is present in both sets, we're estimating the size of the intersection divided by the size of the union: the Jaccard similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "91.05219180000131"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def random_jaccard_kmer(length, k):\n",
    "    str1 = ''.join([random.choice('ACGT') for _ in range(length)])\n",
    "    str2 = add_mutations(str1, random.randint(0, length/20))\n",
    "    return str1, str2, jaccard_min_kmer_hashes(str1, str2, k, seeds=range(10))\n",
    "\n",
    "import timeit\n",
    "timeit.timeit('random_jaccard_kmer(1000, 10)',\n",
    "              setup='''\n",
    "from __main__ import random_jaccard_kmer;\n",
    "import random;\n",
    "random.seed(223)''',\n",
    "              number=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's slower than what we had before, but for large enough sets it could be faster."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}