{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# *2Vec File-based Training: API Tutorial\n", "\n", "This tutorial introduces a new file-based training mode for **`gensim.models.{Word2Vec, FastText, Doc2Vec}`** which leads to (much) faster training on machines with many cores. Below we demonstrate how to use this new mode, with Python examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In this tutorial\n", "\n", "1. We will show how to use the new training mode on Word2Vec, FastText and Doc2Vec.\n", "2. Evaluate the performance of file-based training on the English Wikipedia and compare it to the existing queue-based training.\n", "3. Show that model quality (analogy accuracies on `question-words.txt`) are almost the same for both modes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "\n", "The original implementation of Word2Vec training in Gensim is already super fast (covered in [this blog series](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), see also [benchmarks against other implementations in Tensorflow, DL4J, and C](https://rare-technologies.com/machine-learning-hardware-benchmarks/)) and flexible, allowing you to train on arbitrary Python streams. We had to jump through [some serious hoops](https://www.youtube.com/watch?v=vU4TlwZzTfU) to make it so, avoiding the Global Interpreter Lock (the dreaded GIL, the main bottleneck for any serious high performance computation in Python).\n", "\n", "The end result worked great for modest machines (< 8 cores), but for higher-end servers, the GIL reared its ugly head again. Simply managing the input stream iterators and worker queues, which has to be done in Python holding the GIL, was becoming the bottleneck. Simply put, the Python implementation didn't scale linearly with cores, as the original C implementation by Tomáš Mikolov did." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![scaling of word2vec file-based training](word2vec_file_scaling.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We decided to change that. After [much](https://github.com/RaRe-Technologies/gensim/pull/2127) [experimentation](https://github.com/RaRe-Technologies/gensim/pull/2048#issuecomment-401494412) and [benchmarking](https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html), including some pretty [hardcore outlandish ideas](https://github.com/RaRe-Technologies/gensim/pull/2127#issuecomment-405937741), we figured there's no way around the GIL limitations—not at the level of fine-tuned performance needed here. Remember, we're talking >500k words (training instances) per second, using highly optimized C code. Way past the naive \"vectorize with NumPy arrays\" territory.\n", "\n", "So we decided to introduce a new code path, which has *less flexibility* in favour of *more performance*. We call this code path **`file-based training`**, and it's realized by passing a new `corpus_file` parameter to training. The existing `sentences` parameter (queue-based training) is still available, and you can continue using without any change: there's **full backward compatibility**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How it works\n", "\n", "\n", "\n", "| *code path* | *input parameter* | *advantages* | *disadvantages*\n", "| :-------- | :-------- | :--------- | :----------- |\n", "| queue-based training (existing) | `sentences` (Python iterable) | Input can be generated dynamically from any storage, or even on-the-fly. | Scaling plateaus after 8 cores. |\n", "| file-based training (new) | `corpus_file` (file on disk) | Scales linearly with CPU cores. | Training corpus must be serialized to disk in a specific format. |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you specify `corpus_file`, the model will read and process different portions of the file with different workers. The entire bulk of work is done outside of GIL, using no Python structures at all. The workers update the same weight matrix, but otherwise there's no communication, each worker munches on its data portion completely independently. This is the same approach the original C tool uses. \n", "\n", "Training with `corpus_file` yields a **significant performance boost**: for example, in the experiment belows training is 3.7x faster with 32 workers in comparison to training with `sentences` argument. It even outperforms the original Word2Vec C tool in terms of words/sec processing speed on high-core machines.\n", "\n", "The limitation of this approach is that `corpus_file` argument accepts a path to your corpus file, which must be stored on disk in a specific format. The format is simply the well-known [gensim.models.word2vec.LineSentence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence): one sentence per line, with words separated by spaces." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to use it\n", "\n", "You only need to:\n", "\n", "1. Save your corpus in the LineSentence format to disk (you may use [gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)](https://radimrehurek.com/gensim/utils.html#gensim.utils.save_as_line_sentence) for convenience).\n", "2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.\n", "\n", "\n", "A short Word2Vec example:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n" ] } ], "source": [ "import gensim\n", "import gensim.downloader as api\n", "from gensim.utils import save_as_line_sentence\n", "from gensim.models.word2vec import Word2Vec\n", "\n", "print(gensim.models.word2vec.CORPUSFILE_VERSION) # must be >= 0, i.e. optimized compiled version\n", "\n", "corpus = api.load(\"text8\")\n", "save_as_line_sentence(corpus, \"my_corpus.txt\")\n", "\n", "model = Word2Vec(corpus_file=\"my_corpus.txt\", iter=5, size=300, workers=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's prepare the full Wikipedia dataset as training corpus\n", "\n", "We load wikipedia dump from `gensim-data`, perform text preprocessing with Gensim functions, and finally save processed corpus in LineSentence format." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "CORPUS_FILE = 'wiki-en-20171001.txt'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "from gensim.parsing.preprocessing import preprocess_string\n", "\n", "def processed_corpus():\n", " raw_corpus = api.load('wiki-english-20171001')\n", " for article in raw_corpus:\n", " # concatenate all section titles and texts of each Wikipedia article into a single \"sentence\"\n", " doc = '\\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))\n", " yield preprocess_string(doc)\n", "\n", "# serialize the preprocessed corpus into a single file on disk, using memory-efficient streaming\n", "save_as_line_sentence(processed_corpus(), CORPUS_FILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word2Vec\n", "\n", "We train two models:\n", "* With `sentences` argument\n", "* With `corpus_file` argument\n", "\n", "\n", "Then, we compare the timings and accuracy on `question-words.txt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.word2vec import LineSentence\n", "import time\n", "\n", "start_time = time.time()\n", "model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)\n", "sent_time = time.time() - start_time\n", "\n", "start_time = time.time()\n", "model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)\n", "file_time = time.time() - start_time" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training model with `sentences` took 9494.237 seconds\n", "Training model with `corpus_file` took 2566.170 seconds\n" ] } ], "source": [ "print(\"Training model with `sentences` took {:.3f} seconds\".format(sent_time))\n", "print(\"Training model with `corpus_file` took {:.3f} seconds\".format(file_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Training with `corpus_file` took 3.7x less time!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's compare the accuracies:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from gensim.test.utils import datapath" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word analogy accuracy with `sentences`: 75.4%\n", "Word analogy accuracy with `corpus_file`: 74.8%\n" ] } ], "source": [ "model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `sentences`: {:.1f}%\".format(100.0 * model_sent_accuracy))\n", "\n", "model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `corpus_file`: {:.1f}%\".format(100.0 * model_corp_file_accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The accuracies are approximately the same." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## FastText" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Short example:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import gensim.downloader as api\n", "from gensim.utils import save_as_line_sentence\n", "from gensim.models.fasttext import FastText\n", "\n", "corpus = api.load(\"text8\")\n", "save_as_line_sentence(corpus, \"my_corpus.txt\")\n", "\n", "model = FastText(corpus_file=\"my_corpus.txt\", iter=5, size=300, workers=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's compare the timings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.word2vec import LineSentence\n", "from gensim.models.fasttext import FastText\n", "import time\n", "\n", "start_time = time.time()\n", "model_corp_file = FastText(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)\n", "file_time = time.time() - start_time\n", "\n", "start_time = time.time()\n", "model_sent = FastText(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)\n", "sent_time = time.time() - start_time" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training model with `sentences` took 17963.283 seconds\n", "Training model with `corpus_file` took 10725.931 seconds\n" ] } ], "source": [ "print(\"Training model with `sentences` took {:.3f} seconds\".format(sent_time))\n", "print(\"Training model with `corpus_file` took {:.3f} seconds\".format(file_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We see a 1.67x performance boost!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now, accuracies:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word analogy accuracy with `sentences`: 64.2%\n", "Word analogy accuracy with `corpus_file`: 66.2%\n" ] } ], "source": [ "from gensim.test.utils import datapath\n", "\n", "model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `sentences`: {:.1f}%\".format(100.0 * model_sent_accuracy))\n", "\n", "model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `corpus_file`: {:.1f}%\".format(100.0 * model_corp_file_accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Doc2Vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Short example:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import gensim.downloader as api\n", "from gensim.utils import save_as_line_sentence\n", "from gensim.models.doc2vec import Doc2Vec\n", "\n", "corpus = api.load(\"text8\")\n", "save_as_line_sentence(corpus, \"my_corpus.txt\")\n", "\n", "model = Doc2Vec(corpus_file=\"my_corpus.txt\", epochs=5, vector_size=300, workers=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's compare the timings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument\n", "import time\n", "\n", "start_time = time.time()\n", "model_corp_file = Doc2Vec(corpus_file=CORPUS_FILE, epochs=5, vector_size=300, workers=32)\n", "file_time = time.time() - start_time\n", "\n", "start_time = time.time()\n", "model_sent = Doc2Vec(documents=TaggedLineDocument(CORPUS_FILE), epochs=5, vector_size=300, workers=32)\n", "sent_time = time.time() - start_time" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training model with `sentences` took 20427.949 seconds\n", "Training model with `corpus_file` took 3085.256 seconds\n" ] } ], "source": [ "print(\"Training model with `sentences` took {:.3f} seconds\".format(sent_time))\n", "print(\"Training model with `corpus_file` took {:.3f} seconds\".format(file_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**A 6.6x speedup!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accuracies:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word analogy accuracy with `sentences`: 71.7%\n", "Word analogy accuracy with `corpus_file`: 67.8%\n" ] } ], "source": [ "from gensim.test.utils import datapath\n", "\n", "model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `sentences`: {:.1f}%\".format(100.0 * model_sent_accuracy))\n", "\n", "model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", "print(\"Word analogy accuracy with `corpus_file`: {:.1f}%\".format(100.0 * model_corp_file_accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TL;DR: Conclusion\n", "\n", "In case your training corpus already lives on disk, you lose nothing by switching to the new `corpus_file` training mode. Training will be much faster.\n", "\n", "In case your corpus is generated dynamically, you can either serialize it to disk first with `gensim.utils.save_as_line_sentence` (and then use the fast `corpus_file`), or if that's not possible continue using the existing `sentences` training mode.\n", "\n", "------\n", "\n", "This new code branch was created by [@persiyanov](https://github.com/persiyanov) as a Google Summer of Code 2018 project in the [RARE Student Incubator](https://rare-technologies.com/incubator/).\n", "\n", "Questions, comments? Use our Gensim [mailing list](https://groups.google.com/g/gensim) and [twitter](https://twitter.com/gensim_py). Happy training!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }