{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nFast Similarity Queries with Annoy and Word2Vec\n===============================================\n\nIntroduces the Annoy library for similarity queries on top of vectors learned by Word2Vec.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "LOGS = False # Set to True if you want to see progress in logs.\nif LOGS:\n import logging\n logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Annoy \"Approximate Nearest Neighbors Oh Yeah\"\n`_ library enables similarity queries with\na Word2Vec model. The current implementation for finding k nearest neighbors\nin a vector space in Gensim has linear complexity via brute force in the\nnumber of indexed documents, although with extremely low constant factors.\nThe retrieved results are exact, which is an overkill in many applications:\napproximate results retrieved in sub-linear time may be enough. Annoy can\nfind approximate nearest neighbors much faster.\n\nOutline\n-------\n\n1. Download Text8 Corpus\n2. Train the Word2Vec model\n3. Construct AnnoyIndex with model & make a similarity query\n4. Compare to the traditional indexer\n5. Persist indices to disk\n6. Save memory by via memory-mapping indices saved to disk\n7. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n8. Work with Google's word2vec C formats\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Download Text8 corpus\n------------------------\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim.downloader as api\ntext8_path = api.load('text8', return_path=True)\nprint(\"Using corpus from\", text8_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Train the Word2Vec model\n---------------------------\n\nFor more details, see `sphx_glr_auto_examples_tutorials_run_word2vec.py`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import Word2Vec, KeyedVectors\nfrom gensim.models.word2vec import Text8Corpus\n\n# Using params from Word2Vec_FastText_Comparison\nparams = {\n 'alpha': 0.05,\n 'vector_size': 100,\n 'window': 5,\n 'epochs': 5,\n 'min_count': 5,\n 'sample': 1e-4,\n 'sg': 1,\n 'hs': 0,\n 'negative': 5,\n}\nmodel = Word2Vec(Text8Corpus(text8_path), **params)\nwv = model.wv\nprint(\"Using trained model\", wv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Construct AnnoyIndex with model & make a similarity query\n------------------------------------------------------------\n\nAn instance of ``AnnoyIndexer`` needs to be created in order to use Annoy in Gensim.\nThe ``AnnoyIndexer`` class is located in ``gensim.similarities.annoy``.\n\n``AnnoyIndexer()`` takes two parameters:\n\n* **model**: A ``Word2Vec`` or ``Doc2Vec`` model.\n* **num_trees**: A positive integer. ``num_trees`` effects the build\n time and the index size. **A larger value will give more accurate results,\n but larger indexes**. More information on what trees in Annoy do can be found\n `here `__. The relationship\n between ``num_trees``\\ , build time, and accuracy will be investigated later\n in the tutorial.\n\nNow that we are ready to make a query, lets find the top 5 most similar words\nto \"science\" in the Text8 corpus. To make a similarity query we call\n``Word2Vec.most_similar`` like we would traditionally, but with an added\nparameter, ``indexer``.\n\nApart from Annoy, Gensim also supports the NMSLIB indexer. NMSLIB is a similar library to\nAnnoy \u2013 both support fast, approximate searches for similar vectors.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.similarities.annoy import AnnoyIndexer\n\n# 100 trees are being used in this example\nannoy_index = AnnoyIndexer(model, 100)\n# Derive the vector for the word \"science\" in our model\nvector = wv[\"science\"]\n# The instance of AnnoyIndexer we just created is passed\napproximate_neighbors = wv.most_similar([vector], topn=11, indexer=annoy_index)\n# Neatly print the approximate_neighbors and their corresponding cosine similarity values\nprint(\"Approximate Neighbors\")\nfor neighbor in approximate_neighbors:\n print(neighbor)\n\nnormal_neighbors = wv.most_similar([vector], topn=11)\nprint(\"\\nExact Neighbors\")\nfor neighbor in normal_neighbors:\n print(neighbor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The closer the cosine similarity of a vector is to 1, the more similar that\nword is to our query, which was the vector for \"science\". There are some\ndifferences in the ranking of similar words and the set of words included\nwithin the 10 most similar words.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Compare to the traditional indexer\n-------------------------------------\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set up the model and vector that we are using in the comparison\nannoy_index = AnnoyIndexer(model, 100)\n\n# Dry run to make sure both indexes are fully in RAM\nnormed_vectors = wv.get_normed_vectors()\nvector = normed_vectors[0]\nwv.most_similar([vector], topn=5, indexer=annoy_index)\nwv.most_similar([vector], topn=5)\n\nimport time\nimport numpy as np\n\ndef avg_query_time(annoy_index=None, queries=1000):\n \"\"\"Average query time of a most_similar method over 1000 random queries.\"\"\"\n total_time = 0\n for _ in range(queries):\n rand_vec = normed_vectors[np.random.randint(0, len(wv))]\n start_time = time.process_time()\n wv.most_similar([rand_vec], topn=5, indexer=annoy_index)\n total_time += time.process_time() - start_time\n return total_time / queries\n\nqueries = 1000\n\ngensim_time = avg_query_time(queries=queries)\nannoy_time = avg_query_time(annoy_index, queries=queries)\nprint(\"Gensim (s/query):\\t{0:.5f}\".format(gensim_time))\nprint(\"Annoy (s/query):\\t{0:.5f}\".format(annoy_time))\nspeed_improvement = gensim_time / annoy_time\nprint (\"\\nAnnoy is {0:.2f} times faster on average on this particular run\".format(speed_improvement))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**This speedup factor is by no means constant** and will vary greatly from\nrun to run and is particular to this data set, BLAS setup, Annoy\nparameters(as tree size increases speedup factor decreases), machine\nspecifications, among other factors.\n\n.. Important::\n Initialization time for the annoy indexer was not included in the times.\n The optimal knn algorithm for you to use will depend on how many queries\n you need to make and the size of the corpus. If you are making very few\n similarity queries, the time taken to initialize the annoy indexer will be\n longer than the time it would take the brute force method to retrieve\n results. If you are making many queries however, the time it takes to\n initialize the annoy indexer will be made up for by the incredibly fast\n retrieval times for queries once the indexer has been initialized\n\n.. Important::\n Gensim's 'most_similar' method is using numpy operations in the form of\n dot product whereas Annoy's method isnt. If 'numpy' on your machine is\n using one of the BLAS libraries like ATLAS or LAPACK, it'll run on\n multiple cores (only if your machine has multicore support ). Check `SciPy\n Cookbook\n `_\n for more details.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Persisting indices to disk\n-----------------------------\n\nYou can save and load your indexes from/to disk to prevent having to\nconstruct them each time. This will create two files on disk, *fname* and\n*fname.d*. Both files are needed to correctly restore all attributes. Before\nloading an index, you will have to create an empty AnnoyIndexer object.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fname = '/tmp/mymodel.index'\n\n# Persist index to disk\nannoy_index.save(fname)\n\n# Load index back\nimport os.path\nif os.path.exists(fname):\n annoy_index2 = AnnoyIndexer()\n annoy_index2.load(fname)\n annoy_index2.model = model\n\n# Results should be identical to above\nvector = wv[\"science\"]\napproximate_neighbors2 = wv.most_similar([vector], topn=11, indexer=annoy_index2)\nfor neighbor in approximate_neighbors2:\n print(neighbor)\n\nassert approximate_neighbors == approximate_neighbors2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be sure to use the same model at load that was used originally, otherwise you\nwill get unexpected behaviors.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Save memory via memory-mapping indexes saved to disk\n-------------------------------------------------------\n\nAnnoy library has a useful feature that indices can be memory-mapped from\ndisk. It saves memory when the same index is used by several processes.\n\nBelow are two snippets of code. First one has a separate index for each\nprocess. The second snipped shares the index between two processes via\nmemory-mapping. The second example uses less total RAM as it is shared.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Remove verbosity from code below (if logging active)\nif LOGS:\n logging.disable(logging.CRITICAL)\n\nfrom multiprocessing import Process\nimport os\nimport psutil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bad example: two processes load the Word2vec model from disk and create their\nown Annoy index from that model.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.save('/tmp/mymodel.pkl')\n\ndef f(process_id):\n print('Process Id: {}'.format(os.getpid()))\n process = psutil.Process(os.getpid())\n new_model = Word2Vec.load('/tmp/mymodel.pkl')\n vector = new_model.wv[\"science\"]\n annoy_index = AnnoyIndexer(new_model, 100)\n approximate_neighbors = new_model.wv.most_similar([vector], topn=5, indexer=annoy_index)\n print('\\nMemory used by process {}: {}\\n---'.format(os.getpid(), process.memory_info()))\n\n# Create and run two parallel processes to share the same index file.\np1 = Process(target=f, args=('1',))\np1.start()\np1.join()\np2 = Process(target=f, args=('2',))\np2.start()\np2.join()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good example: two processes load both the Word2vec model and index from disk\nand memory-map the index.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.save('/tmp/mymodel.pkl')\n\ndef f(process_id):\n print('Process Id: {}'.format(os.getpid()))\n process = psutil.Process(os.getpid())\n new_model = Word2Vec.load('/tmp/mymodel.pkl')\n vector = new_model.wv[\"science\"]\n annoy_index = AnnoyIndexer()\n annoy_index.load('/tmp/mymodel.index')\n annoy_index.model = new_model\n approximate_neighbors = new_model.wv.most_similar([vector], topn=5, indexer=annoy_index)\n print('\\nMemory used by process {}: {}\\n---'.format(os.getpid(), process.memory_info()))\n\n# Creating and running two parallel process to share the same index file.\np1 = Process(target=f, args=('1',))\np1.start()\np1.join()\np2 = Process(target=f, args=('2',))\np2.start()\np2.join()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n-----------------------------------------------------------------------------\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build dataset of initialization times and accuracy measures:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "exact_results = [element[0] for element in wv.most_similar([normed_vectors[0]], topn=100)]\n\nx_values = []\ny_values_init = []\ny_values_accuracy = []\n\nfor x in range(1, 300, 10):\n x_values.append(x)\n start_time = time.time()\n annoy_index = AnnoyIndexer(model, x)\n y_values_init.append(time.time() - start_time)\n approximate_results = wv.most_similar([normed_vectors[0]], topn=100, indexer=annoy_index)\n top_words = [result[0] for result in approximate_results]\n y_values_accuracy.append(len(set(top_words).intersection(exact_results)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot results:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(1, figsize=(12, 6))\nplt.subplot(121)\nplt.plot(x_values, y_values_init)\nplt.title(\"num_trees vs initalization time\")\nplt.ylabel(\"Initialization time (s)\")\nplt.xlabel(\"num_trees\")\nplt.subplot(122)\nplt.plot(x_values, y_values_accuracy)\nplt.title(\"num_trees vs accuracy\")\nplt.ylabel(\"%% accuracy\")\nplt.xlabel(\"num_trees\")\nplt.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above, we can see that the initialization time of the annoy indexer\nincreases in a linear fashion with num_trees. Initialization time will vary\nfrom corpus to corpus. In the graph above we used the (tiny) Lee corpus.\n\nFurthermore, in this dataset, the accuracy seems logarithmically related to\nthe number of trees. We see an improvement in accuracy with more trees, but\nthe relationship is nonlinear.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Work with Google's word2vec files\n------------------------------------\n\nOur model can be exported to a word2vec C format. There is a binary and a\nplain text word2vec format. Both can be read with a variety of other\nsoftware, or imported back into Gensim as a ``KeyedVectors`` object.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# To export our model as text\nwv.save_word2vec_format('/tmp/vectors.txt', binary=False)\n\nfrom smart_open import open\n# View the first 3 lines of the exported file\n# The first line has the total number of entries and the vector dimension count.\n# The next lines have a key (a string) followed by its vector.\nwith open('/tmp/vectors.txt', encoding='utf8') as myfile:\n for i in range(3):\n print(myfile.readline().strip())\n\n# To import a word2vec text model\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)\n\n# To export a model as binary\nwv.save_word2vec_format('/tmp/vectors.bin', binary=True)\n\n# To import a word2vec binary model\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)\n\n# To create and save Annoy Index from a loaded `KeyedVectors` object (with 100 trees)\nannoy_index = AnnoyIndexer(wv, 100)\nannoy_index.save('/tmp/mymodel.index')\n\n# Load and test the saved word vectors and saved Annoy index\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)\nannoy_index = AnnoyIndexer()\nannoy_index.load('/tmp/mymodel.index')\nannoy_index.model = wv\n\nvector = wv[\"cat\"]\napproximate_neighbors = wv.most_similar([vector], topn=11, indexer=annoy_index)\n# Neatly print the approximate_neighbors and their corresponding cosine similarity values\nprint(\"Approximate Neighbors\")\nfor neighbor in approximate_neighbors:\n print(neighbor)\n\nnormal_neighbors = wv.most_similar([vector], topn=11)\nprint(\"\\nExact Neighbors\")\nfor neighbor in normal_neighbors:\n print(neighbor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recap\n-----\n\nIn this notebook we used the Annoy module to build an indexed approximation\nof our word embeddings. To do so, we did the following steps:\n\n1. Download Text8 Corpus\n2. Train Word2Vec Model\n3. Construct AnnoyIndex with model & make a similarity query\n4. Persist indices to disk\n5. Save memory by via memory-mapping indices saved to disk\n6. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n7. Work with Google's word2vec C formats\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 0 }