{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison of WordRank, Word2Vec and FastText\n", "\n", "[Wordrank](https://arxiv.org/pdf/1506.02761v3.pdf) is a fresh new approach to the word embeddings, which formulates it as a ranking problem. That is, given a word w, it aims to output an ordered list (c1, c2, · · ·) of context words such that words that co-occur with w appear at the top of the list. This formulation fits naturally to popular word embedding tasks such as word similarity/analogy since instead of the likelihood of each word, we are interested in finding the most relevant words in a given context[1].\n", "\n", "This notebook accompanies a more theoretical blog post [here](https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/).\n", "\n", "Gensim is used to train and evaluate the word2vec models. Analogical reasoning and Word Similarity tasks are used for comparing the models. Word2vec and FastText embeddings are trained using the skipgram architecture here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Download and preprocess data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package brown to /home/misha/nltk_data...\n", "[nltk_data] Package brown is already up-to-date!\n" ] } ], "source": [ "import nltk\n", "from smart_open import smart_open\n", "from gensim.parsing.preprocessing import strip_punctuation, strip_multiple_whitespaces\n", "\n", "# Only the brown corpus is needed in case you don't have it.\n", "nltk.download('brown') \n", "\n", "# Generate brown corpus text file\n", "with smart_open('brown_corp.txt', 'w+') as f:\n", " for word in nltk.corpus.brown.words():\n", " f.write('{word} '.format(word=word))\n", " f.seek(0)\n", " brown = f.read()\n", "\n", "# Preprocess brown corpus\n", "with smart_open('proc_brown_corp.txt', 'w') as f:\n", " proc_brown = strip_punctuation(brown)\n", " proc_brown = strip_multiple_whitespaces(proc_brown).lower()\n", " f.write(proc_brown)\n", "\n", "# Set WR_HOME and FT_HOME to respective directory root\n", "WR_HOME = 'wordrank/'\n", "FT_HOME = 'fastText/'\n", "\n", "# download the text8 corpus (a 100 MB sample of preprocessed wikipedia text)\n", "import os.path\n", "if not os.path.isfile('text8'):\n", " !wget -c https://mattmahoney.net/dc/text8.zip\n", " !unzip text8.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train Models\n", "For training the models yourself, you'll need to have Gensim, FastText and Wordrank set up on your machine." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Training word2vec on proc_brown_corp.txt corpus..\n", "CPU times: user 44.6 s, sys: 85.5 ms, total: 44.7 s\n", "Wall time: 15.2 s\n" ] }, { "ename": "DeprecationWarning", "evalue": "Deprecated. Use model.wv.save_word2vec_format instead.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mDeprecationWarning\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\nUsing existing model file {:s}.vec'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 78\u001b[0;31m \u001b[0mtrain_models\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus_file\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'proc_brown_corp.txt'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'brown'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtrain_models\u001b[0;34m(corpus_file, output_name)\u001b[0m\n\u001b[1;32m 42\u001b[0m \u001b[0;31m# Text8Corpus class for reading space-separated words file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 43\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'time'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gs_model = Word2Vec(Text8Corpus(corpus_file), **w2v_params); gs_model'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 44\u001b[0;31m \u001b[0mlocals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'gs_model'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_word2vec_format\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mMODELS_DIR\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'{:s}.vec'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 45\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\nSaved gensim model as {:s}.vec'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 46\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/git/gensim/gensim/models/word2vec.py\u001b[0m in \u001b[0;36msave_word2vec_format\u001b[0;34m(self, fname, fvocab, binary)\u001b[0m\n\u001b[1;32m 1305\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1306\u001b[0m \"\"\"\n\u001b[0;32m-> 1307\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mDeprecationWarning\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Deprecated. Use model.wv.save_word2vec_format instead.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1308\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1309\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mclassmethod\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDeprecationWarning\u001b[0m: Deprecated. Use model.wv.save_word2vec_format instead." ] } ], "source": [ "MODELS_DIR = 'models/'\n", "!mkdir -p {MODELS_DIR}\n", "\n", "from gensim.models import Word2Vec\n", "from gensim.models.wrappers import Wordrank\n", "from gensim.models.word2vec import Text8Corpus\n", "\n", "# fasttext params\n", "lr = 0.05\n", "dim = 100\n", "ws = 5\n", "epoch = 5\n", "minCount = 5\n", "neg = 5\n", "loss = 'ns'\n", "t = 1e-4\n", "\n", "w2v_params = {\n", " 'alpha': 0.025,\n", " 'size': 100,\n", " 'window': 15,\n", " 'iter': 5,\n", " 'min_count': 5,\n", " 'sample': t,\n", " 'sg': 1,\n", " 'hs': 0,\n", " 'negative': 5\n", "}\n", "\n", "wr_params = {\n", " 'size': 100,\n", " 'window': 15,\n", " 'iter': 91,\n", " 'min_count': 5\n", "}\n", "\n", "def train_models(corpus_file, output_name):\n", " # Train using word2vec\n", " output_file = '{:s}_gs'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('\\nTraining word2vec on {:s} corpus..'.format(corpus_file))\n", " # Text8Corpus class for reading space-separated words file\n", " %time gs_model = Word2Vec(Text8Corpus(corpus_file), **w2v_params); gs_model\n", " locals()['gs_model'].save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))\n", " print('\\nSaved gensim model as {:s}.vec'.format(output_file))\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", "\n", " # Train using fasttext\n", " output_file = '{:s}_ft'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('Training fasttext on {:s} corpus..'.format(corpus_file))\n", " %time !{FT_HOME}fasttext skipgram -input {corpus_file} -output {MODELS_DIR+output_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t}\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", " \n", " # Train using wordrank\n", " output_file = '{:s}_wr'.format(output_name)\n", " output_dir = 'model' # directory to save embeddings and metadata to\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('\\nTraining wordrank on {:s} corpus..'.format(corpus_file))\n", " %time wr_model = Wordrank.train(WR_HOME, corpus_file, output_dir, **wr_params); wr_model\n", " locals()['wr_model'].save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))\n", " print('\\nSaved wordrank model as {:s}.vec'.format(output_file))\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", " \n", " # Loading ensemble embeddings\n", " output_file = '{:s}_wr_ensemble'.format(output_name)\n", " if not os.path.isfile(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file))):\n", " print('\\nLoading ensemble embeddings (vector combination of word and context embeddings)..')\n", " %time wr_model = Wordrank.load_wordrank_model(os.path.join(WR_HOME, 'model/wordrank.words'), os.path.join(WR_HOME, 'model/meta/vocab.txt'), os.path.join(WR_HOME, 'model/wordrank.contexts'), sorted_vocab=1, ensemble=1); wr_model\n", " locals()['wr_model'].wv.save_word2vec_format(os.path.join(MODELS_DIR, '{:s}.vec'.format(output_file)))\n", " print('\\nSaved wordrank (ensemble) model as {:s}.vec'.format(output_file))\n", " else:\n", " print('\\nUsing existing model file {:s}.vec'.format(output_file))\n", " \n", "train_models(corpus_file='proc_brown_corp.txt', output_name='brown')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_models(corpus_file='text8', output_name='text8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we train wordrank model using ensemble in second case as it is known to give a small performance boost in some cases. So we'll test accuracy for both the cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparisons" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n", "\n", "def print_analogy_accuracy(model, questions_file):\n", " acc = model.accuracy(questions_file)\n", "\n", " sem_correct = sum((len(acc[i]['correct']) for i in range(5)))\n", " sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))\n", " sem_acc = 100*float(sem_correct)/sem_total\n", " print('\\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, sem_acc))\n", " \n", " syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))\n", " syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))\n", " syn_acc = 100*float(syn_correct)/syn_total\n", " print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\\n'.format(syn_correct, syn_total, syn_acc))\n", " \n", "def print_similarity_accuracy(model, similarity_file):\n", " acc = model.evaluate_word_pairs(similarity_file)\n", " print('Pearson correlation coefficient: {:.2f}'.format(acc[0][0]))\n", " print('Spearman rank correlation coefficient: {:.2f}'.format(acc[1][0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODELS_DIR = 'models/'\n", "word_analogies_file = './datasets/questions-words.txt'\n", "simlex_file = '../../gensim/test/test_data/simlex999.txt'\n", "wordsim_file = '../../gensim/test/test_data/wordsim353.tsv'\n", "\n", "print('\\nLoading Gensim embeddings')\n", "brown_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')\n", "print('Accuracy for Word2Vec:')\n", "print_analogy_accuracy(brown_gs, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(brown_gs, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(brown_gs, wordsim_file)\n", "\n", "print('\\nLoading FastText embeddings')\n", "brown_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')\n", "print('Accuracy for FastText:')\n", "print_analogy_accuracy(brown_ft, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(brown_ft, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(brown_ft, wordsim_file)\n", "\n", "print('\\nLoading Wordrank embeddings')\n", "brown_wr = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_wr.vec')\n", "print('Accuracy for Wordrank:')\n", "print_analogy_accuracy(brown_wr, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(brown_wr, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(brown_wr, wordsim_file)\n", "\n", "print('\\nLoading Wordrank ensemble embeddings')\n", "brown_wr_ensemble = KeyedVectors.load_word2vec_format(MODELS_DIR + 'brown_wr_ensemble.vec')\n", "print('Accuracy for Wordrank:')\n", "print_analogy_accuracy(brown_wr_ensemble, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(brown_wr_ensemble, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(brown_wr_ensemble, wordsim_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As evident from the above outputs, WordRank performs significantly better in Semantic analogies, whereas, FastText on Syntactic analogies. Also ensemble embeddings gives a small performance boost in WordRank's case.\n", "\n", "Wordrank's effectiveness in Semantic analogies is possibly due to it's focused attention on getting most relevant words right at the top using the ranking approach.\n", "And as fasttext is designed to incorporate morphological information about words, it results in it's performance boost in Syntactic analogies, as most of the Syntactic analogies are morphology based[2].\n", "\n", "And for the Word Similarity, Word2Vec performed better on SimLex-999 test data, whereas, WordRank on WS-353. This is probably due to the different types of similarities these datasets address. SimLex-999 provides a measure of how well the two words are interchangeable in similar contexts, and WS-353 tries to estimate the relatedness or co-occurrence of two words. Also, ensemble embeddings doesn't help in the Word Similarity task[1], which is evident from the results above so we'll use just the Word Embeddings for it. \n", "\n", "Now lets evaluate on a larger corpus, text8, and see how it effects the performance of different embedding models. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Loading Gensim embeddings')\n", "text8_gs = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')\n", "print('Accuracy for word2vec:')\n", "print_analogy_accuracy(text8_gs, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(text8_gs, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(text8_gs, wordsim_file)\n", "\n", "print('Loading FastText embeddings')\n", "text8_ft = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')\n", "print('Accuracy for FastText (with n-grams):')\n", "print_analogy_accuracy(text8_ft, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(text8_ft, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(text8_ft, wordsim_file)\n", "\n", "print('\\nLoading Wordrank embeddings')\n", "text8_wr = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_wr.vec')\n", "print('Accuracy for Wordrank:')\n", "print_analogy_accuracy(text8_wr, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(text8_wr, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(text8_wr, wordsim_file)\n", "\n", "print('\\nLoading Wordrank ensemble embeddings')\n", "text8_wr_ensemble = KeyedVectors.load_word2vec_format(MODELS_DIR + 'text8_wr_ensemble.vec')\n", "print('Accuracy for Wordrank:')\n", "print_analogy_accuracy(text8_wr_ensemble, word_analogies_file)\n", "print('SimLex-999 similarity')\n", "print_similarity_accuracy(text8_wr_ensemble, simlex_file)\n", "print('\\nWordSim-353 similarity')\n", "print_similarity_accuracy(text8_wr_ensemble, wordsim_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a larger corpus, we observe similar patterns in the accuracies. Here also, WordRank dominates the Semantic analogies and FastText Syntactic ones. Word2Vec again performs better on SimLex-999 dataset and WordRank on WordSim-353.\n", "Though we observe a little performance decrease in WordRank in case of ensemble embeddings here, so it's good to try both the cases for evaluations.\n", "\n", "# Word Frequency and Model Performance\n", "\n", "In this section, we'll see if the frequency of a word has any effect on embedding model's performance in Analogy task. Accuracy vs. Frequency graph is used to analyze this effect. The mean frequency of four words involved in each analogy is computed, and then bucketed with other analogies having similar mean frequencies. Each bucket has six percent of the total analogies involved in the particular task. You can go to this [repo](https://github.com/parulsethi/EmbeddingVisData/tree/master/WordAnalogyFreq) if you want to inspect about what analogies(with their sorted frequencies) were used for each of the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from __future__ import division\n", "import matplotlib.pyplot as plt\n", "import copy\n", "import multiprocessing\n", "import numpy as np\n", "from smart_open import smart_open\n", "\n", "\n", "def compute_accuracies(model, freq):\n", " # mean_freq will contain analogies together with the mean frequency of 4 words involved\n", " mean_freq = {}\n", " with smart_open(word_analogies_file, 'r') as r:\n", " for i, line in enumerate(r):\n", " if ':' not in line:\n", " analogy = tuple(line.split())\n", " else:\n", " continue\n", " try:\n", " mfreq = sum([int(freq[x.lower()]) for x in analogy])/4\n", " mean_freq['a%d'%i] = [analogy, mfreq]\n", " except KeyError:\n", " continue\n", " \n", " # compute model's accuracy\n", " model = KeyedVectors.load_word2vec_format(model)\n", " acc = model.accuracy(word_analogies_file)\n", " \n", " sem_correct = [acc[i]['correct'] for i in range(5)]\n", " sem_total = [acc[i]['correct'] + acc[i]['incorrect'] for i in range(5)]\n", " syn_correct = [acc[i]['correct'] for i in range(5, len(acc)-1)]\n", " syn_total = [acc[i]['correct'] + acc[i]['incorrect'] for i in range(5, len(acc)-1)]\n", " total_correct = sem_correct + syn_correct\n", " total_total = sem_total + syn_total\n", "\n", " sem_x, sem_y = calc_axis(sem_correct, sem_total, mean_freq)\n", " syn_x, syn_y = calc_axis(syn_correct, syn_total, mean_freq)\n", " total_x, total_y = calc_axis(total_correct, total_total, mean_freq)\n", " return ((sem_x, sem_y), (syn_x, syn_y), (total_x, total_y))\n", "\n", "def calc_axis(correct, total, mean_freq):\n", " # make flat lists\n", " correct_analogies = []\n", " for i in range(len(correct)):\n", " for analogy in correct[i]:\n", " correct_analogies.append(analogy) \n", " total_analogies = []\n", " for i in range(len(total)):\n", " for analogy in total[i]:\n", " total_analogies.append(analogy)\n", "\n", " copy_mean_freq = copy.deepcopy(mean_freq)\n", " # delete other case's analogy from total analogies \n", " for key, value in copy_mean_freq.items():\n", " value[0] = tuple(x.upper() for x in value[0])\n", " if value[0] not in total_analogies:\n", " del copy_mean_freq[key]\n", "\n", " # append 0 or 1 for incorrect or correct analogy\n", " for key, value in copy_mean_freq.iteritems():\n", " value[0] = tuple(x.upper() for x in value[0])\n", " if value[0] in correct_analogies:\n", " copy_mean_freq[key].append(1)\n", " else:\n", " copy_mean_freq[key].append(0)\n", "\n", " x = []\n", " y = []\n", " bucket_size = int(len(copy_mean_freq) * 0.06)\n", " # sort analogies according to their mean frequences \n", " copy_mean_freq = sorted(copy_mean_freq.items(), key=lambda x: x[1][1])\n", " # prepare analogies buckets according to given size\n", " for centre_p in range(bucket_size//2, len(copy_mean_freq), bucket_size):\n", " bucket = copy_mean_freq[centre_p-bucket_size//2:centre_p+bucket_size//2]\n", " b_acc = 0\n", " # calculate current bucket accuracy with b_acc count\n", " for analogy in bucket:\n", " if analogy[1][2]==1:\n", " b_acc+=1\n", " y.append(b_acc/bucket_size)\n", " x.append(np.log(copy_mean_freq[centre_p][1][1]))\n", " return x, y\n", "\n", "# a sample model using gensim's Word2Vec for getting vocab counts\n", "corpus = Text8Corpus('proc_brown_corp.txt')\n", "model = Word2Vec(min_count=5)\n", "model.build_vocab(corpus)\n", "freq = {}\n", "for word in model.wv.index2word:\n", " freq[word] = model.wv.vocab[word].count\n", "\n", "# plot results\n", "word2vec = compute_accuracies('brown_gs.vec', freq)\n", "wordrank = compute_accuracies('brown_wr_ensemble.vec', freq)\n", "fasttext = compute_accuracies('brown_ft.vec', freq)\n", "\n", "fig = plt.figure(figsize=(7,15))\n", "\n", "for i, subplot, title in zip([0, 1, 2], ['311', '312', '313'], ['Semantic Analogies', 'Syntactic Analogies', 'Total Analogy']):\n", " ax = fig.add_subplot(subplot)\n", " ax.plot(word2vec[i][0], word2vec[i][1], 'r-', label='Word2Vec')\n", " ax.plot(wordrank[i][0], wordrank[i][1], 'g--', label='WordRank')\n", " ax.plot(fasttext[i][0], fasttext[i][1], 'b:', label='FastText')\n", " ax.set_ylabel('Average accuracy')\n", " ax.set_xlabel('Log mean frequency')\n", " ax.set_title(title)\n", " ax.legend(loc='upper right', prop={'size':10})\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This graph show the results trained over Brown corpus(1 million tokens).\n", "\n", "The main observations that can be drawn here are-\n", "1. In Semantic Analogies, all the models perform poorly for rare words as compared to their performance at more frequent words.\n", "2. In Syntactic Analogies, FastText performance is way better than Word2Vec and WordRank.\n", "3. If we go through the frequency range in Syntactic Analogies plot, FastText performance drops significantly at highly frequent words, whereas, for Word2Vec and WordRank there is no significant difference over the whole frequency range.\n", "4. End plot shows the results of combined Semantic and Syntactic Analogies. It has more resemblance to the Syntactic Analogy's plot because the total no. of Syntactic Analogies(=5461) is much greater than the total no. of Semantic ones(=852). So it's bound to trace the Syntactic's results as they have more weightage in the total analogies considered.\n", "\n", "Now, let’s see if a larger corpus creates any difference in this pattern of model's performance over different frequencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# a sample model using gensim's Word2Vec for getting vocab counts\n", "corpus = Text8Corpus('text8')\n", "model = Word2Vec(min_count=5)\n", "model.build_vocab(corpus)\n", "freq = {}\n", "for word in model.wv.index2word:\n", " freq[word] = model.wv.vocab[word].count\n", " \n", "word2vec = compute_accuracies('text8_gs.vec', freq)\n", "wordrank = compute_accuracies('text8_wr.vec', freq)\n", "fasttext = compute_accuracies('text8_ft.vec', freq)\n", "\n", "fig = plt.figure(figsize=(7,15))\n", "\n", "for i, subplot, title in zip([0, 1, 2], ['311', '312', '313'], ['Semantic Analogies', 'Syntactic Analogies', 'Total Analogy']):\n", " ax = fig.add_subplot(subplot)\n", " ax.plot(word2vec[i][0], word2vec[i][1], 'r-', label='Word2Vec')\n", " ax.plot(wordrank[i][0], wordrank[i][1], 'g--', label='WordRank')\n", " ax.plot(fasttext[i][0], fasttext[i][1], 'b:', label='FastText')\n", " ax.set_ylabel('Average accuracy')\n", " ax.set_xlabel('Log mean frequency')\n", " ax.set_title(title)\n", " ax.legend(loc='upper right', prop={'size':10})\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows the results for text8(17 million tokens). Following points can be observed in this case-\n", "\n", "1. For Semantic analogies, all the models perform comparatively poor on rare words and also when the word frequency is high towards the end.\n", "2. For Syntactic Analogies, FastText performance is fairly well on rare words but then falls steeply at highly frequent words.\n", "3. WordRank and Word2Vec perform very similar with low accuracy for rare and highly frequent words in Syntactic Analogies.\n", "4. FastText is again better in total analogies case due to the same reason described previously. Here the total no. of Semantic analogies is 7416 and Syntactic Analogies is 10411.\n", "\n", "These graphs also conclude that WordRank is the best suited method for Semantic Analogies, and FastText for Syntactic Analogies for all the frequency ranges and over different corpus sizes, though all the embedding methods could become very competitive as the corpus size increases largerly[2]. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusions\n", "\n", "\n", "The experiments here conclude two main points from comparing Word embeddings. Firstly, there is no single global embedding model we could rely on for different types of NLP applications. For example, in Word Similarity, WordRank performed better than the other two algorithms for WS-353 test data whereas, Word2Vec performed better on SimLex-999. This is probably due to the different type of similarities these datasets address[3]. And in Word Analogy task, WordRank performed better for Semantic Analogies and FastText for Syntactic Analogies. This basically tells us that we need to choose the embedding method carefully according to our final use-case.\n", "\n", "Secondly, our query words do matter apart from the generalized model performance. As we observed in Accuracy vs. Frequency graphs that models perform differently depending on the frequency of question analogy words in training corpus. For example, we are likely to get poor results if our query words are all highly frequent.\n", "\n", "*__Note__:* WordRank can sometimes produce NaN values during model evaluation, when the embedding vector values get too diverged at some iterations, but it dumps embedding vectors after every few iterations, so you could just load embeddings from a different iteration’s text file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "1. [WordRank: Learning Word Embeddings via Robust Ranking](https://arxiv.org/pdf/1506.02761v3.pdf)\n", "2. [Word2Vec and FastText comparison notebook](Word2Vec_FastText_Comparison.ipynb)\n", "3. [Similarity test data](https://www.cl.cam.ac.uk/~fh295/simlex.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }