{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nWord Mover's Distance\n=====================\n\nDemonstrates using Gensim's implemenation of the WMD.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word Mover's Distance (WMD) is a promising new tool in machine learning that\nallows us to submit a query and return the most relevant documents. This\ntutorial introduces WMD and shows how you can compute the WMD distance\nbetween two documents using ``wmdistance``.\n\nWMD Basics\n----------\n\nWMD enables us to assess the \"distance\" between two documents in a meaningful\nway even when they have no words in common. It uses `word2vec\n`_ [4] vector embeddings of\nwords. It been shown to outperform many of the state-of-the-art methods in\nk-nearest neighbors classification [3].\n\nWMD is illustrated below for two very similar sentences (illustration taken\nfrom `Vlad Niculae's blog\n`_). The sentences\nhave no words in common, but by matching the relevant words, WMD is able to\naccurately measure the (dis)similarity between the two sentences. The method\nalso uses the bag-of-words representation of the documents (simply put, the\nword's frequencies in the documents), noted as $d$ in the figure below. The\nintuition behind the method is that we find the minimum \"traveling distance\"\nbetween documents, in other words the most efficient way to \"move\" the\ndistribution of document 1 to the distribution of document 2.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Image from https://vene.ro/images/wmd-obama.png\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('wmd-obama.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method was introduced in the article \"From Word Embeddings To Document\nDistances\" by Matt Kusner et al. (\\ `link to PDF\n`_\\ ). It is inspired\nby the \"Earth Mover's Distance\", and employs a solver of the \"transportation\nproblem\".\n\nIn this tutorial, we will learn how to use Gensim's WMD functionality, which\nconsists of the ``wmdistance`` method for distance computation, and the\n``WmdSimilarity`` class for corpus based similarity queries.\n\n.. Important::\n If you use Gensim's WMD functionality, please consider citing [1], [2] and [3].\n\nComputing the Word Mover's Distance\n-----------------------------------\n\nTo use WMD, you need some existing word embeddings.\nYou could train your own Word2Vec model, but that is beyond the scope of this tutorial\n(check out `sphx_glr_auto_examples_tutorials_run_word2vec.py` if you're interested).\nFor this tutorial, we'll be using an existing Word2Vec model.\n\nLet's take some sentences to compute the distance between.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Initialize logging.\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n\nsentence_obama = 'Obama speaks to the media in Illinois'\nsentence_president = 'The president greets the press in Chicago'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These sentences have very similar content, and as such the WMD should be low.\nBefore we compute the WMD, we want to remove stopwords (\"the\", \"to\", etc.),\nas these do not contribute a lot to the information in the sentences.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import and download stopwords from NLTK.\nfrom nltk.corpus import stopwords\nfrom nltk import download\ndownload('stopwords') # Download stopwords list.\nstop_words = stopwords.words('english')\n\ndef preprocess(sentence):\n return [w for w in sentence.lower().split() if w not in stop_words]\n\nsentence_obama = preprocess(sentence_obama)\nsentence_president = preprocess(sentence_president)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, as mentioned earlier, we will be using some downloaded pre-trained\nembeddings. We load these into a Gensim Word2Vec model class.\n\n.. Important::\n The embeddings we have chosen here require a lot of memory.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim.downloader as api\nmodel = api.load('word2vec-google-news-300')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's compute WMD using the ``wmdistance`` method.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "distance = model.wmdistance(sentence_obama, sentence_president)\nprint('distance = %.4f' % distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sentence_orange = preprocess('Oranges are my favorite fruit')\ndistance = model.wmdistance(sentence_obama, sentence_orange)\nprint('distance = %.4f' % distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "References\n----------\n\n1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*, 2008.\n2. Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*, 2009.\n3. Matt Kusner et al. *From Embeddings To Document Distances*, 2015.\n4. Tom\u00e1\u0161 Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*, 2013.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }