{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nWord Mover's Distance\n=====================\n\nDemonstrates using Gensim's implemenation of the WMD.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Word Mover's Distance (WMD) is a promising new tool in machine learning that\nallows us to submit a query and return the most relevant documents. This\ntutorial introduces WMD and shows how you can compute the WMD distance\nbetween two documents using ``wmdistance``.\n\nWMD Basics\n----------\n\nWMD enables us to assess the \"distance\" between two documents in a meaningful\nway even when they have no words in common. It uses `word2vec\n<http://rare-technologies.com/word2vec-tutorial/>`_ [4] vector embeddings of\nwords. It been shown to outperform many of the state-of-the-art methods in\nk-nearest neighbors classification [3].\n\nWMD is illustrated below for two very similar sentences (illustration taken\nfrom `Vlad Niculae's blog\n<http://vene.ro/blog/word-movers-distance-in-python.html>`_). The sentences\nhave no words in common, but by matching the relevant words, WMD is able to\naccurately measure the (dis)similarity between the two sentences. The method\nalso uses the bag-of-words representation of the documents (simply put, the\nword's frequencies in the documents), noted as $d$ in the figure below. The\nintuition behind the method is that we find the minimum \"traveling distance\"\nbetween documents, in other words the most efficient way to \"move\" the\ndistribution of document 1 to the distribution of document 2.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Image from https://vene.ro/images/wmd-obama.png\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('wmd-obama.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This method was introduced in the article \"From Word Embeddings To Document\nDistances\" by Matt Kusner et al. (\\ `link to PDF\n<http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf>`_\\ ). It is inspired\nby the \"Earth Mover's Distance\", and employs a solver of the \"transportation\nproblem\".\n\nIn this tutorial, we will learn how to use Gensim's WMD functionality, which\nconsists of the ``wmdistance`` method for distance computation, and the\n``WmdSimilarity`` class for corpus based similarity queries.\n\n.. Important::\n   If you use Gensim's WMD functionality, please consider citing [1], [2] and [3].\n\nComputing the Word Mover's Distance\n-----------------------------------\n\nTo use WMD, you need some existing word embeddings.\nYou could train your own Word2Vec model, but that is beyond the scope of this tutorial\n(check out `sphx_glr_auto_examples_tutorials_run_word2vec.py` if you're interested).\nFor this tutorial, we'll be using an existing Word2Vec model.\n\nLet's take some sentences to compute the distance between.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Initialize logging.\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n\nsentence_obama = 'Obama speaks to the media in Illinois'\nsentence_president = 'The president greets the press in Chicago'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "These sentences have very similar content, and as such the WMD should be low.\nBefore we compute the WMD, we want to remove stopwords (\"the\", \"to\", etc.),\nas these do not contribute a lot to the information in the sentences.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Import and download stopwords from NLTK.\nfrom nltk.corpus import stopwords\nfrom nltk import download\ndownload('stopwords')  # Download stopwords list.\nstop_words = stopwords.words('english')\n\ndef preprocess(sentence):\n    return [w for w in sentence.lower().split() if w not in stop_words]\n\nsentence_obama = preprocess(sentence_obama)\nsentence_president = preprocess(sentence_president)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, as mentioned earlier, we will be using some downloaded pre-trained\nembeddings. We load these into a Gensim Word2Vec model class.\n\n.. Important::\n  The embeddings we have chosen here require a lot of memory.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import gensim.downloader as api\nmodel = api.load('word2vec-google-news-300')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "So let's compute WMD using the ``wmdistance`` method.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "distance = model.wmdistance(sentence_obama, sentence_president)\nprint('distance = %.4f' % distance)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sentence_orange = preprocess('Oranges are my favorite fruit')\ndistance = model.wmdistance(sentence_obama, sentence_orange)\nprint('distance = %.4f' % distance)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "References\n----------\n\n1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*, 2008.\n2. Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*, 2009.\n3. Matt Kusner et al. *From Embeddings To Document Distances*, 2015.\n4. Tom\u00e1\u0161 Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*, 2013.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}