{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison of FastText and Word2Vec " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Facebook Research open sourced a great project yesterday - [fastText](https://github.com/facebookresearch/fastText), a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Download data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk\n", "nltk.download() \n", "# Only the brown corpus is needed in case you don't have it.\n", "# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training\n", "\n", "# Generate brown corpus text file\n", "with open('brown_corp.txt', 'w+') as f:\n", " for word in nltk.corpus.brown.words():\n", " f.write('{word} '.format(word=word))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)\n", "# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training\n", "!wget http://mattmahoney.net/dc/text8.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# download the file questions-words.txt to be used for comparing word embeddings\n", "!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wish to avoid training, you can download pre-trained models instead in the next section.\n", "For training the fastText models yourself, you'll have to follow the setup instructions for [fastText](https://github.com/facebookresearch/fastText) and run the training with -" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!./fasttext skipgram -input brown_corp.txt -output brown_ft\n", "!./fasttext skipgram -input text8.txt -output text8_ft" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For training the gensim models -" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.corpus import brown\n", "from gensim.models import Word2Vec\n", "from gensim.models.word2vec import Text8Corpus\n", "import logging\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')\n", "logging.root.setLevel(level=logging.INFO)\n", "\n", "MODELS_DIR = 'models/'\n", "\n", "brown_gs = Word2Vec(brown.sents())\n", "brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')\n", "\n", "text8_gs = Word2Vec(Text8Corpus('text8'))\n", "text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Download models\n", "In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with - " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# download the fastText and gensim models trained on the brown corpus and text8 corpus\n", "!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have downloaded or trained the models (make sure they're in the `models/` directory, or that you've appropriately changed `MODELS_DIR`) and downloaded `questions-words.txt`, you're ready to run the comparison." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparisons" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Loading FastText embeddings\n", "Accuracy for FastText:\n", "Evaluating...\n", "\n", "0/1, 0.00%, Section: capital-common-countries\n", "0/1, 0.00%, Section: capital-world\n", "0/1, 0.00%, Section: currency\n", "0/1, 0.00%, Section: city-in-state\n", "27/182, 14.84%, Section: family\n", "539/702, 76.78%, Section: gram1-adjective-to-adverb\n", "106/132, 80.30%, Section: gram2-opposite\n", "656/1056, 62.12%, Section: gram3-comparative\n", "136/210, 64.76%, Section: gram4-superlative\n", "439/650, 67.54%, Section: gram5-present-participle\n", "0/1, 0.00%, Section: gram6-nationality-adjective\n", "165/1260, 13.10%, Section: gram7-past-tense\n", "327/552, 59.24%, Section: gram8-plural\n", "245/342, 71.64%, Section: gram9-plural-verbs\n", "2640/5086, 51.91%, Section: total\n", "\n", "Semantic: 27/182, Accuracy: 14.84%\n", "Syntactic: 2613/4904, Accuracy: 53.28%\n", "\n", "\n", "Loading Gensim embeddings\n", "Accuracy for word2vec:\n", "Evaluating...\n", "\n", "0/1, 0.00%, Section: capital-common-countries\n", "0/1, 0.00%, Section: capital-world\n", "0/1, 0.00%, Section: currency\n", "0/1, 0.00%, Section: city-in-state\n", "53/182, 29.12%, Section: family\n", "8/702, 1.14%, Section: gram1-adjective-to-adverb\n", "0/132, 0.00%, Section: gram2-opposite\n", "75/1056, 7.10%, Section: gram3-comparative\n", "0/210, 0.00%, Section: gram4-superlative\n", "16/650, 2.46%, Section: gram5-present-participle\n", "0/1, 0.00%, Section: gram6-nationality-adjective\n", "30/1260, 2.38%, Section: gram7-past-tense\n", "4/552, 0.72%, Section: gram8-plural\n", "8/342, 2.34%, Section: gram9-plural-verbs\n", "194/5086, 3.81%, Section: total\n", "\n", "Semantic: 53/182, Accuracy: 29.12%\n", "Syntactic: 141/4904, Accuracy: 2.88%\n", "\n" ] } ], "source": [ "from gensim.models import Word2Vec\n", "\n", "def print_accuracy(model, questions_file):\n", " print('Evaluating...\\n')\n", " acc = model.accuracy(questions_file)\n", " for section in acc:\n", " correct = len(section['correct'])\n", " total = len(section['correct']) + len(section['incorrect'])\n", " total = total if total else 1\n", " accuracy = 100*float(correct)/total\n", " print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))\n", " sem_correct = sum((len(acc[i]['correct']) for i in range(5)))\n", " sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))\n", " print('\\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))\n", " \n", " syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))\n", " syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))\n", " print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))\n", "\n", "MODELS_DIR = 'models/'\n", "\n", "word_analogies_file = 'questions-words.txt'\n", "print('\\nLoading FastText embeddings')\n", "ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')\n", "print('Accuracy for FastText:')\n", "print_accuracy(ft_model, word_analogies_file)\n", "\n", "print('\\nLoading Gensim embeddings')\n", "gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')\n", "print('Accuracy for word2vec:')\n", "print_accuracy(gs_model, word_analogies_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based. \n", "\n", "Let me explain that better.\n", "\n", "According to the paper [[1]](https://arxiv.org/abs/1607.04606), embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for `apparently` would include information from both character n-grams `apparent` and `ly` (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.\n", "\n", "Example analogy:\n", "\n", "`amazing amazingly calm calmly`\n", "\n", "This analogy is marked correct if: \n", "\n", "`embedding(amazing)` - `embedding(amazingly)` = `embedding(calm)` - `embedding(calmly)`\n", "\n", "Both these subtractions would result in a very similar set of remaining ngrams.\n", "No surprise the fastText embeddings do extremely well on this.\n", "\n", "A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).\n", "\n", "\n", "Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading FastText embeddings\n", "Accuracy for FastText:\n", "Evaluating...\n", "\n", "298/506, 58.89%, Section: capital-common-countries\n", "625/1452, 43.04%, Section: capital-world\n", "37/268, 13.81%, Section: currency\n", "291/1511, 19.26%, Section: city-in-state\n", "151/306, 49.35%, Section: family\n", "567/756, 75.00%, Section: gram1-adjective-to-adverb\n", "188/306, 61.44%, Section: gram2-opposite\n", "809/1260, 64.21%, Section: gram3-comparative\n", "303/506, 59.88%, Section: gram4-superlative\n", "528/992, 53.23%, Section: gram5-present-participle\n", "1291/1371, 94.16%, Section: gram6-nationality-adjective\n", "451/1332, 33.86%, Section: gram7-past-tense\n", "853/992, 85.99%, Section: gram8-plural\n", "360/650, 55.38%, Section: gram9-plural-verbs\n", "6752/12208, 55.31%, Section: total\n", "\n", "Semantic: 1402/4043, Accuracy: 34.68%\n", "Syntactic: 5350/8165, Accuracy: 65.52%\n", "\n", "Loading Gensim embeddings\n", "Accuracy for word2vec:\n", "Evaluating...\n", "\n", "138/506, 27.27%, Section: capital-common-countries\n", "248/1452, 17.08%, Section: capital-world\n", "28/268, 10.45%, Section: currency\n", "158/1571, 10.06%, Section: city-in-state\n", "227/306, 74.18%, Section: family\n", "85/756, 11.24%, Section: gram1-adjective-to-adverb\n", "54/306, 17.65%, Section: gram2-opposite\n", "739/1260, 58.65%, Section: gram3-comparative\n", "178/506, 35.18%, Section: gram4-superlative\n", "297/992, 29.94%, Section: gram5-present-participle\n", "718/1371, 52.37%, Section: gram6-nationality-adjective\n", "325/1332, 24.40%, Section: gram7-past-tense\n", "389/992, 39.21%, Section: gram8-plural\n", "200/650, 30.77%, Section: gram9-plural-verbs\n", "3784/12268, 30.84%, Section: total\n", "\n", "Semantic: 799/4103, Accuracy: 19.47%\n", "Syntactic: 2985/8165, Accuracy: 36.56%\n", "\n" ] } ], "source": [ "print('Loading FastText embeddings')\n", "ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')\n", "print('Accuracy for FastText:')\n", "print_accuracy(ft_model, word_analogies_file)\n", "\n", "print('Loading Gensim embeddings')\n", "gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')\n", "print('Accuracy for word2vec:')\n", "print_accuracy(gs_model, word_analogies_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec\n", "\n", "These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1] [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606v1.pdf)\n", "\n", "[2] [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }