{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison of FastText and Word2Vec " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Facebook Research open sourced a great project recently - [fastText](https://github.com/facebookresearch/fastText), a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec. \n", "\n", "I've used gensim to train the word2vec models, and the analogical reasoning task (described in Section 4.1 of [[2]](https://arxiv.org/pdf/1301.3781v3.pdf)) for comparing the word2vec and fastText models. I've compared embeddings trained using the skipgram architecture." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Download data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package brown to /home/misha/nltk_data...\n", "[nltk_data] Package brown is already up-to-date!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "--2019-05-12 19:40:14-- https://mattmahoney.net/dc/enwik9.zip\n", "Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75\n", "Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 322592222 (308M) [application/zip]\n", "Saving to: ‘enwik9.zip’\n", "\n", "enwik9.zip 47%[========> ] 145.49M 218KB/s in 10m 2s \n", "\n", "2019-05-12 19:50:17 (247 KB/s) - Connection closed at byte 152553031. Retrying.\n", "\n", "--2019-05-12 19:50:18-- (try: 2) https://mattmahoney.net/dc/enwik9.zip\n", "Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.\n", "HTTP request sent, awaiting response... 206 Partial Content\n", "Length: 322592222 (308M), 170039191 (162M) remaining [application/zip]\n", "Saving to: ‘enwik9.zip’\n", "\n", "enwik9.zip 100%[+++++++++==========>] 307.65M 344KB/s in 8m 38s \n", "\n", "2019-05-12 19:58:57 (320 KB/s) - ‘enwik9.zip’ saved [322592222/322592222]\n", "\n", "Archive: enwik9.zip\n", " inflating: enwik9 \n", "Can't open perl script \"fastText/wikifil.pl\": No such file or directory\n" ] } ], "source": [ "import nltk\n", "from smart_open import smart_open\n", "nltk.download('brown') \n", "# Only the brown corpus is needed in case you don't have it.\n", "\n", "# Generate brown corpus text file\n", "with smart_open('brown_corp.txt', 'w+') as f:\n", " for word in nltk.corpus.brown.words():\n", " f.write('{word} '.format(word=word))\n", "\n", "# Make sure you set FT_HOME to your fastText directory root\n", "FT_HOME = 'fastText/'\n", "# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)\n", "import os.path\n", "if not os.path.isfile('text8'):\n", " !wget -c https://mattmahoney.net/dc/text8.zip\n", " !unzip text8.zip\n", "# download and preprocess the text9 corpus\n", "if not os.path.isfile('text9'):\n", " !wget -c https://mattmahoney.net/dc/enwik9.zip\n", " !unzip enwik9.zip\n", " !perl {FT_HOME}wikifil.pl enwik9 > text9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For training the models yourself, you'll need to have both [Gensim](https://github.com/RaRe-Technologies/gensim) and [FastText](https://github.com/facebookresearch/fastText) set up on your machine." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training fasttext on brown_corp.txt corpus..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 6.02 ms, sys: 314 µs, total: 6.33 ms\n", "Wall time: 109 ms\n", "\n", "Training fasttext on brown_corp.txt corpus (without char n-grams)..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 2.12 ms, sys: 12.9 ms, total: 15 ms\n", "Wall time: 124 ms\n", "\n", "Training word2vec on brown_corp.txt corpus..\n", "CPU times: user 19.2 s, sys: 0 ns, total: 19.2 s\n", "Wall time: 6.71 s\n", "\n", "Saved gensim model as brown_gs.vec\n" ] } ], "source": [ "MODELS_DIR = 'models/'\n", "!mkdir -p {MODELS_DIR}\n", "\n", "lr = 0.05\n", "dim = 100\n", "ws = 5\n", "epoch = 5\n", "minCount = 5\n", "neg = 5\n", "loss = 'ns'\n", "t = 1e-4\n", "\n", "from gensim.models import Word2Vec, KeyedVectors\n", "from gensim.models.word2vec import Text8Corpus\n", "\n", "# Same values as used for fastText training above\n", "params = {\n", " 'alpha': lr,\n", " 'size': dim,\n", " 'window': ws,\n", " 'iter': epoch,\n", " 'min_count': minCount,\n", " 'sample': t,\n", " 'sg': 1,\n", " 'hs': 0,\n", " 'negative': neg\n", "}\n", "\n", "def train_models(corpus_file, output_name):\n", " output_file = '{:s}_ft'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('Training fasttext on {:s} corpus..'.format(corpus_file))\n", " %time !{FT_HOME}fasttext skipgram -input {corpus_file} -output {MODELS_DIR+output_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t}\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", " \n", " output_file = '{:s}_ft_no_ng'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('\\nTraining fasttext on {:s} corpus (without char n-grams)..'.format(corpus_file))\n", " %time !{FT_HOME}fasttext skipgram -input {corpus_file} -output {MODELS_DIR+output_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t} -maxn 0\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", " \n", " output_file = '{:s}_gs'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('\\nTraining word2vec on {:s} corpus..'.format(corpus_file))\n", " \n", " # Text8Corpus class for reading space-separated words file\n", " %time gs_model = Word2Vec(Text8Corpus(corpus_file), **params); gs_model\n", " # Direct local variable lookup doesn't work properly with magic statements (%time)\n", " locals()['gs_model'].wv.save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))\n", " print('\\nSaved gensim model as {:s}.vec'.format(output_file))\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", "\n", "evaluation_data = {}\n", "train_models('brown_corp.txt', 'brown')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training fasttext on text8 corpus..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 7.37 ms, sys: 0 ns, total: 7.37 ms\n", "Wall time: 109 ms\n", "\n", "Training fasttext on text8 corpus (without char n-grams)..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 12.3 ms, sys: 0 ns, total: 12.3 ms\n", "Wall time: 115 ms\n", "\n", "Training word2vec on text8 corpus..\n", "CPU times: user 7min 12s, sys: 0 ns, total: 7min 12s\n", "Wall time: 2min 26s\n", "\n", "Saved gensim model as text8_gs.vec\n" ] } ], "source": [ "train_models(corpus_file='text8', output_name='text8')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training fasttext on text9 corpus..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 8.81 ms, sys: 0 ns, total: 8.81 ms\n", "Wall time: 111 ms\n", "\n", "Training fasttext on text9 corpus (without char n-grams)..\n", "/bin/sh: 1: fastText/fasttext: not found\n", "CPU times: user 10.7 ms, sys: 0 ns, total: 10.7 ms\n", "Wall time: 115 ms\n", "\n", "Training word2vec on text9 corpus..\n" ] }, { "ename": "RuntimeError", "evalue": "you must first build vocabulary before training the model", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/word2vec.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, sentences, corpus_file, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, ns_exponent, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss, callbacks, max_final_vocab)\u001b[0m\n\u001b[1;32m 781\u001b[0m \u001b[0mcallbacks\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcallbacks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_words\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbatch_words\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrim_rule\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtrim_rule\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msg\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0malpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwindow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mwindow\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 782\u001b[0m \u001b[0mseed\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mseed\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mhs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnegative\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnegative\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcbow_mean\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcbow_mean\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmin_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmin_alpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcompute_loss\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcompute_loss\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 783\u001b[0;31m fast_version=FAST_VERSION)\n\u001b[0m\u001b[1;32m 784\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 785\u001b[0m def _do_train_epoch(self, corpus_file, thread_id, offset, cython_vocab, thread_private_mem, cur_epoch,\n", "\u001b[0;32m~/git/gensim/gensim/models/base_any2vec.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, sentences, corpus_file, workers, vector_size, epochs, callbacks, batch_words, trim_rule, sg, alpha, window, seed, hs, negative, ns_exponent, cbow_mean, min_alpha, compute_loss, fast_version, **kwargs)\u001b[0m\n\u001b[1;32m 761\u001b[0m \u001b[0msentences\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcorpus_file\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcorpus_file\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtotal_examples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcorpus_count\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 762\u001b[0m \u001b[0mtotal_words\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcorpus_total_words\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mepochs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstart_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0malpha\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 763\u001b[0;31m end_alpha=self.min_alpha, compute_loss=compute_loss)\n\u001b[0m\u001b[1;32m 764\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 765\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtrim_rule\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/word2vec.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks)\u001b[0m\n\u001b[1;32m 908\u001b[0m \u001b[0msentences\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcorpus_file\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcorpus_file\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtotal_examples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtotal_examples\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtotal_words\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtotal_words\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 909\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mepochs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstart_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstart_alpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mend_alpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword_count\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mword_count\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 910\u001b[0;31m queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)\n\u001b[0m\u001b[1;32m 911\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 912\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mscore\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtotal_sentences\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1e6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchunksize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueue_factor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreport_delay\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/base_any2vec.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m 1079\u001b[0m \u001b[0mtotal_words\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtotal_words\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mepochs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstart_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstart_alpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend_alpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mend_alpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword_count\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mword_count\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1080\u001b[0m \u001b[0mqueue_factor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mqueue_factor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreport_delay\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreport_delay\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcompute_loss\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcompute_loss\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcallbacks\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcallbacks\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1081\u001b[0;31m **kwargs)\n\u001b[0m\u001b[1;32m 1082\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1083\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_get_job_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcur_epoch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/base_any2vec.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(self, data_iterable, corpus_file, epochs, total_examples, total_words, queue_factor, report_delay, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m 534\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mepochs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 535\u001b[0m \u001b[0mtotal_examples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtotal_examples\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 536\u001b[0;31m total_words=total_words, **kwargs)\n\u001b[0m\u001b[1;32m 537\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 538\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcallback\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcallbacks\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/base_any2vec.py\u001b[0m in \u001b[0;36m_check_training_sanity\u001b[0;34m(self, epochs, total_examples, total_words, **kwargs)\u001b[0m\n\u001b[1;32m 1185\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1186\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvocab\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# should be set by `build_vocab`\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1187\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRuntimeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"you must first build vocabulary before training the model\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1188\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvectors\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1189\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mRuntimeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"you must initialize vectors before training the model\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mRuntimeError\u001b[0m: you must first build vocabulary before training the model" ] }, { "ename": "KeyError", "evalue": "'gs_model'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtrain_models\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus_file\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'text9'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'text9'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtrain_models\u001b[0;34m(corpus_file, output_name)\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'time'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gs_model = Word2Vec(Text8Corpus(corpus_file), **params); gs_model'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0;31m# Direct local variable lookup doesn't work properly with magic statements (%time)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m \u001b[0mlocals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'gs_model'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_word2vec_format\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mMODELS_DIR\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'{:s}.vec'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 52\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\nSaved gensim model as {:s}.vec'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 'gs_model'" ] } ], "source": [ "train_models(corpus_file='text9', output_name='text9')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparisons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# download the file questions-words.txt to be used for comparing word embeddings\n", "!wget https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have downloaded or trained the models and downloaded `questions-words.txt`, you're ready to run the comparison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n", "\n", "# Training times in seconds\n", "evaluation_data['brown'] = [(18, 54.3, 32.5)]\n", "evaluation_data['text8'] = [(402, 942, 496)]\n", "evaluation_data['text9'] = [(3218, 6589, 3550)]\n", "\n", "def print_accuracy(model, questions_file):\n", " print('Evaluating...\\n')\n", " acc = model.accuracy(questions_file)\n", "\n", " sem_correct = sum((len(acc[i]['correct']) for i in range(5)))\n", " sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))\n", " sem_acc = 100*float(sem_correct)/sem_total\n", " print('\\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, sem_acc))\n", " \n", " syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))\n", " syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))\n", " syn_acc = 100*float(syn_correct)/syn_total\n", " print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\\n'.format(syn_correct, syn_total, syn_acc))\n", " return (sem_acc, syn_acc)\n", "\n", "word_analogies_file = 'questions-words.txt'\n", "accuracies = []\n", "print('\\nLoading Gensim embeddings')\n", "brown_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')\n", "print('Accuracy for Word2Vec:')\n", "accuracies.append(print_accuracy(brown_gs, word_analogies_file))\n", "\n", "print('\\nLoading FastText embeddings')\n", "brown_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')\n", "print('Accuracy for FastText (with n-grams):')\n", "accuracies.append(print_accuracy(brown_ft, word_analogies_file))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `accuracy` takes an optional parameter `restrict_vocab`, which limits the vocabulary of model considered for fast approximate evaluation (default is 30000).\n", "\n", "Word2Vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based. \n", "\n", "Let me explain that better.\n", "\n", "According to the paper [[1]](https://arxiv.org/abs/1607.04606), embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for `apparently` would include information from both character n-grams `apparent` and `ly` (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.\n", "\n", "Example analogy:\n", "\n", "`amazing amazingly calm calmly`\n", "\n", "This analogy is marked correct if: \n", "\n", "`embedding(amazing)` - `embedding(amazingly)` = `embedding(calm)` - `embedding(calmly)`\n", "\n", "Both these subtractions would result in a very similar set of remaining ngrams.\n", "No surprise the fastText embeddings do extremely well on this.\n", "\n", "Let's do a small test to validate this hypothesis - fastText differs from word2vec only in that it uses char n-gram embeddings as well as the actual word embedding in the scoring function to calculate scores and then likelihoods for each word, given a context word. In case char n-gram embeddings are not present, this reduces (at least theoretically) to the original word2vec model. This can be implemented by setting 0 for the max length of char n-grams for fastText.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Loading FastText embeddings')\n", "brown_ft_no_ng = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_ft_no_ng.vec')\n", "print('Accuracy for FastText (without n-grams):')\n", "accuracies.append(print_accuracy(brown_ft_no_ng, word_analogies_file))\n", "evaluation_data['brown'] += [[acc[0] for acc in accuracies], [acc[1] for acc in accuracies]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A-ha! The results for FastText with no n-grams and Word2Vec look a lot more similar (as they should) - the differences could easily result from differences in implementation between fastText and Gensim, and randomization. Especially telling is that the semantic accuracy for FastText has improved slightly after removing n-grams, while the syntactic accuracy has taken a giant dive. Our hypothesis that the char n-grams result in better performance on syntactic analogies seems fair. It also seems possible that char n-grams hurt semantic accuracy a little. However, the brown corpus is too small to be able to draw any definite conclusions - the accuracies seem to vary significantly over different runs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try with a larger corpus now - text8 (collection of wiki articles). I'm also curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracies = []\n", "print('Loading Gensim embeddings')\n", "text8_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')\n", "print('Accuracy for word2vec:')\n", "accuracies.append(print_accuracy(text8_gs, word_analogies_file))\n", "\n", "print('Loading FastText embeddings (with n-grams)')\n", "text8_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')\n", "print('Accuracy for FastText (with n-grams):')\n", "accuracies.append(print_accuracy(text8_ft, word_analogies_file))\n", "\n", "print('Loading FastText embeddings')\n", "text8_ft_no_ng = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_ft_no_ng.vec')\n", "print('Accuracy for FastText (without n-grams):')\n", "accuracies.append(print_accuracy(text8_ft_no_ng, word_analogies_file))\n", "\n", "evaluation_data['text8'] += [[acc[0] for acc in accuracies], [acc[1] for acc in accuracies]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the text8 corpus, we observe a similar pattern. Semantic accuracy falls by a small but significant amount when n-grams are included in FastText, while FastText with n-grams performs far better on the syntactic analogies. FastText without n-grams are largely similar to Word2Vec.\n", "\n", "My hypothesis for semantic accuracy being lower for the FastText-with-ngrams model is that most of the words in the semantic analogies are standalone words and are unrelated to their morphemes (eg: father, mother, France, Paris), hence inclusion of the char n-grams into the scoring function actually makes the embeddings worse.\n", "\n", "This trend is observed in the original paper too where the performance of embeddings with n-grams is worse on semantic tasks than both word2vec cbow and skipgram models.\n", "\n", "Let's do a quick comparison on an even larger corpus - text9 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracies = []\n", "print('Loading Gensim embeddings')\n", "text9_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text9_gs.vec')\n", "print('Accuracy for word2vec:')\n", "accuracies.append(print_accuracy(text9_gs, word_analogies_file))\n", "\n", "print('Loading FastText embeddings (with n-grams)')\n", "text9_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text9_ft.vec')\n", "print('Accuracy for FastText (with n-grams):')\n", "accuracies.append(print_accuracy(text9_ft, word_analogies_file))\n", "\n", "print('Loading FastText embeddings')\n", "text9_ft_no_ng = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text9_ft_no_ng.vec')\n", "print('Accuracy for FastText (without n-grams):')\n", "accuracies.append(print_accuracy(text9_ft_no_ng, word_analogies_file))\n", "\n", "evaluation_data['text9'] += [[acc[0] for acc in accuracies], [acc[1] for acc in accuracies]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "def plot(ax, data, corpus_name='brown'):\n", " width = 0.25\n", " pos = [(i, i + width, i + 2*width) for i in range(len(data))]\n", " colors = ['#EE3224', '#F78F1E', '#FFC222']\n", " acc_ax = ax.twinx()\n", " # Training time\n", " ax.bar(pos[0],\n", " data[0],\n", " width,\n", " alpha=0.5,\n", " color=colors\n", " )\n", " # Semantic accuracy\n", " acc_ax.bar(pos[1],\n", " data[1],\n", " width,\n", " alpha=0.5,\n", " color=colors\n", " )\n", "\n", " # Syntactic accuracy\n", " acc_ax.bar(pos[2],\n", " data[2],\n", " width,\n", " alpha=0.5,\n", " color=colors\n", " )\n", "\n", " ax.set_ylabel('Training time (s)')\n", " acc_ax.set_ylabel('Accuracy (%)')\n", " ax.set_title(corpus_name)\n", "\n", " acc_ax.set_xticks([p[0] + 1.5 * width for p in pos])\n", " acc_ax.set_xticklabels(['Training Time', 'Semantic Accuracy', 'Syntactic Accuracy'])\n", "\n", " # Proxy plots for adding legend correctly\n", " proxies = [ax.bar([0], [0], width=0, color=c, alpha=0.5)[0] for c in colors]\n", " models = ('Gensim', 'FastText', 'FastText (no-ngrams)')\n", " ax.legend((proxies), models, loc='upper left')\n", " \n", " ax.set_xlim(pos[0][0]-width, pos[-1][0]+width*4)\n", " ax.set_ylim([0, max(data[0])*1.1] )\n", " acc_ax.set_ylim([0, max(data[1] + data[2])*1.1] )\n", "\n", " plt.grid()\n", "\n", "# Plotting the bars\n", "fig = plt.figure(figsize=(10,15))\n", "for corpus, subplot in zip(sorted(evaluation_data.keys()), [311, 312, 313]):\n", " ax = fig.add_subplot(subplot)\n", " plot(ax, evaluation_data[corpus], corpus)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results from text9 seem to confirm our hypotheses so far. Briefly summarising the main points -\n", "\n", "1. FastText models with n-grams do significantly better on syntactic tasks, because of the syntactic questions being related to morphology of the words\n", "2. Both Gensim word2vec and the fastText model with no n-grams do slightly better on the semantic tasks, presumably because words from the semantic questions are standalone words and unrelated to their char n-grams\n", "3. In general, the performance of the models seems to get closer with the increasing corpus size. However, this might possibly be due to the size of the model staying constant at 100, and a larger model size for large corpora might result in higher performance gains.\n", "4. The semantic accuracy for all models increases significantly with the increase in corpus size.\n", "5. However, the increase in syntactic accuracy from the increase in corpus size for the n-gram FastText model is lower (in both relative and absolute terms). This could possibly indicate that advantages gained by incorporating morphological information could be less significant in case of larger corpus sizes (the corpuses used in the original paper seem to indicate this too)\n", "6. Training times for gensim are slightly lower than the fastText no-ngram model, and significantly lower than the n-gram variant. This is quite impressive considering fastText is implemented in C++ and Gensim in Python (with calls to low-level BLAS routines for much of the heavy lifting). You could read [this post](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/) for more details regarding word2vec optimisation in Gensim. Note that these times include importing any dependencies and serializing the models to disk, and not just the training times." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These preliminary results seem to indicate fastText embeddings are significantly better than word2vec at encoding syntactic information. This is expected, since most syntactic analogies are morphology based, and the char n-gram approach of fastText takes such information into account. The original word2vec model seems to perform better on semantic tasks, since words in semantic analogies are unrelated to their char n-grams, and the added information from irrelevant char n-grams worsens the embeddings. It'd be interesting to see how transferable these embeddings are for different kinds of tasks by comparing their performance in a downstream supervised task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1] [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606v1.pdf)\n", "\n", "[2] [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }