{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nSoft Cosine Measure\n===================\n\nDemonstrates using Gensim's implemenation of the SCM.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Soft Cosine Measure (SCM) is a promising new tool in machine learning that\nallows us to submit a query and return the most relevant documents. This\ntutorial introduces SCM and shows how you can compute the SCM similarities\nbetween two documents using the ``inner_product`` method.\n\nSoft Cosine Measure basics\n--------------------------\n\nSoft Cosine Measure (SCM) is a method that allows us to assess the similarity\nbetween two documents in a meaningful way, even when they have no words in\ncommon. It uses a measure of similarity between words, which can be derived\n[2] using [word2vec][] [4] vector embeddings of words. It has been shown to\noutperform many of the state-of-the-art methods in the semantic text\nsimilarity task in the context of community question answering [2].\n\n\nSCM is illustrated below for two very similar sentences. The sentences have\nno words in common, but by modeling synonymy, SCM is able to accurately\nmeasure the similarity between the two sentences. The method also uses the\nbag-of-words vector representation of the documents (simply put, the word's\nfrequencies in the documents). The intution behind the method is that we\ncompute standard cosine similarity assuming that the document vectors are\nexpressed in a non-orthogonal basis, where the angle between two basis\nvectors is derived from the angle between the word2vec embeddings of the\ncorresponding words.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('scm-hello.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method was perhaps first introduced in the article \u201cSoft Measure and\nSoft Cosine Measure: Measure of Features in Vector Space Model\u201d by Grigori\nSidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto.\n\nIn this tutorial, we will learn how to use Gensim's SCM functionality, which\nconsists of the ``inner_product`` method for one-off computation, and the\n``SoftCosineSimilarity`` class for corpus-based similarity queries.\n\n.. Important::\n If you use Gensim's SCM functionality, please consider citing [1], [2] and [3].\n\nComputing the Soft Cosine Measure\n---------------------------------\nTo use SCM, you need some existing word embeddings.\nYou could train your own Word2Vec model, but that is beyond the scope of this tutorial\n(check out `sphx_glr_auto_examples_tutorials_run_word2vec.py` if you're interested).\nFor this tutorial, we'll be using an existing Word2Vec model.\n\nLet's take some sentences to compute the distance between.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Initialize logging.\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n\nsentence_obama = 'Obama speaks to the media in Illinois'\nsentence_president = 'The president greets the press in Chicago'\nsentence_orange = 'Oranges are my favorite fruit'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first two sentences sentences have very similar content, and as such the\nSCM should be high. By contrast, the third sentence is unrelated to the first\ntwo and the SCM should be low.\n\nBefore we compute the SCM, we want to remove stopwords (\"the\", \"to\", etc.),\nas these do not contribute a lot to the information in the sentences.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import and download stopwords from NLTK.\nfrom nltk.corpus import stopwords\nfrom nltk import download\ndownload('stopwords') # Download stopwords list.\nstop_words = stopwords.words('english')\n\ndef preprocess(sentence):\n return [w for w in sentence.lower().split() if w not in stop_words]\n\nsentence_obama = preprocess(sentence_obama)\nsentence_president = preprocess(sentence_president)\nsentence_orange = preprocess(sentence_orange)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will build a dictionary and a TF-IDF model, and we will convert the\nsentences to the bag-of-words format.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.corpora import Dictionary\ndocuments = [sentence_obama, sentence_president, sentence_orange]\ndictionary = Dictionary(documents)\n\nsentence_obama = dictionary.doc2bow(sentence_obama)\nsentence_president = dictionary.doc2bow(sentence_president)\nsentence_orange = dictionary.doc2bow(sentence_orange)\n\nfrom gensim.models import TfidfModel\ndocuments = [sentence_obama, sentence_president, sentence_orange]\ntfidf = TfidfModel(documents)\n\nsentence_obama = tfidf[sentence_obama]\nsentence_president = tfidf[sentence_president]\nsentence_orange = tfidf[sentence_orange]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, as mentioned earlier, we will be using some downloaded pre-trained\nembeddings. We load these into a Gensim Word2Vec model class and we build\na term similarity mextrix using the embeddings.\n\n.. Important::\n The embeddings we have chosen here require a lot of memory.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim.downloader as api\nmodel = api.load('word2vec-google-news-300')\n\nfrom gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex\ntermsim_index = WordEmbeddingSimilarityIndex(model)\ntermsim_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, tfidf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's compute SCM using the ``inner_product`` method.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "similarity = termsim_matrix.inner_product(sentence_obama, sentence_president, normalized=(True, True))\nprint('similarity = %.4f' % similarity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the same thing with two completely unrelated sentences.\nNotice that the similarity is smaller.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "similarity = termsim_matrix.inner_product(sentence_obama, sentence_orange, normalized=(True, True))\nprint('similarity = %.4f' % similarity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "References\n----------\n\n1. Grigori Sidorov et al. *Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*, 2014.\n2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017.\n3. V\u00edt Novotn\u00fd. *Implementation Notes for the Soft Cosine Measure*, 2018.\n4. Tom\u00e1\u0161 Mikolov et al. Efficient Estimation of Word Representations in Vector Space, 2013.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }