{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# WordRank wrapper tutorial on Lee Corpus\n",
    "\n",
    "WordRank is a new word embedding algorithm which captures the semantic similarities in a text data well. See this [notebook](Wordrank_comparisons.ipynb) for it's comparisons to other popular embedding models. This tutorial will serve as a guide to use the WordRank wrapper in gensim. You need to install [WordRank](https://bitbucket.org/shihaoji/wordrank) before proceeding with this tutorial.\n",
    "\n",
    "\n",
    "# Train model\n",
    "\n",
    "We'll use [Lee corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor) for training which is already available in gensim. Now for Wordrank, two parameters `dump_period` and `iter` needs to be in sync as it dumps the embedding file with the start of next iteration. For example, if you want results after 10 iterations, you need to use `iter=11` and `dump_period` can be anything that gives mod 0 with resulting iteration, in this case 2 or 5.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "FileNotFoundError",
     "evalue": "[Errno 2] No such file or directory: 'wordrank/glove/vocab_count': 'wordrank/glove/vocab_count'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-a9079683a958>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'../../gensim/test/test_data/lee.cor'\u001b[0m \u001b[0;31m# sample corpus\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mWordrank\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwr_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout_dir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0miter\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m11\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdump_period\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/git/gensim/gensim/models/wrappers/wordrank.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(cls, wr_path, corpus_file, out_name, size, window, symmetric, min_count, max_vocab_size, sgd_num, lrate, period, iter, epsilon, dump_period, reg, alpha, beta, loss, memory, np, cleanup_files, sorted_vocab, ensemble)\u001b[0m\n\u001b[1;32m    177\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0msmart_open\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput_fname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'rb'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    178\u001b[0m                 \u001b[0;32mwith\u001b[0m \u001b[0msmart_open\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_fname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'wb'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mw\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 179\u001b[0;31m                     \u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcheck_output\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstdin\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    180\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    181\u001b[0m         \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Deleting frequencies from vocab file\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/git/gensim/gensim/utils.py\u001b[0m in \u001b[0;36mcheck_output\u001b[0;34m(stdout, *popenargs, **kwargs)\u001b[0m\n\u001b[1;32m   1907\u001b[0m     \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1908\u001b[0m         \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdebug\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"COMMAND: %s %s\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpopenargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1909\u001b[0;31m         \u001b[0mprocess\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msubprocess\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstdout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstdout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mpopenargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1910\u001b[0m         \u001b[0moutput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0munused_err\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mprocess\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommunicate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1911\u001b[0m         \u001b[0mretcode\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mprocess\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpoll\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/lib/python3.7/subprocess.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)\u001b[0m\n\u001b[1;32m    773\u001b[0m                                 \u001b[0mc2pread\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mc2pwrite\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    774\u001b[0m                                 \u001b[0merrread\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merrwrite\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 775\u001b[0;31m                                 restore_signals, start_new_session)\n\u001b[0m\u001b[1;32m    776\u001b[0m         \u001b[0;32mexcept\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    777\u001b[0m             \u001b[0;31m# Cleanup if the child failed starting.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/lib/python3.7/subprocess.py\u001b[0m in \u001b[0;36m_execute_child\u001b[0;34m(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)\u001b[0m\n\u001b[1;32m   1520\u001b[0m                         \u001b[0;32mif\u001b[0m \u001b[0merrno_num\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1521\u001b[0m                             \u001b[0merr_msg\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m': '\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mrepr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr_filename\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1522\u001b[0;31m                     \u001b[0;32mraise\u001b[0m \u001b[0mchild_exception_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno_num\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr_msg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr_filename\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1523\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mchild_exception_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr_msg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1524\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'wordrank/glove/vocab_count': 'wordrank/glove/vocab_count'"
     ]
    }
   ],
   "source": [
    "from gensim.models.wrappers import Wordrank\n",
    "\n",
    "wr_path = 'wordrank' # path to Wordrank directory\n",
    "out_dir = 'model' # name of output directory to save data to\n",
    "data = '../../gensim/test/test_data/lee.cor' # sample corpus\n",
    "\n",
    "model = Wordrank.train(wr_path, data, out_dir, iter=11, dump_period=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, you can use any of the Keyed Vector function in gensim, on this model for further tasks. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.most_similar('President')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.similarity('President', 'military')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "As Wordrank provides two sets of embeddings, the word and context embedding, you can obtain their addition by setting ensemble parameter to 1 in the train method.\n",
    "\n",
    "# Save and Load models\n",
    "In case, you have trained the model yourself using demo scripts in Wordrank, you can then simply load the embedding files in gensim. \n",
    "\n",
    "Also, Wordrank doesn't return the embeddings sorted according to the word frequency in corpus, so you can use the sorted_vocab parameter in the load method. But for that, you need to provide the vocabulary file generated in the 'matrix.toy' directory(if you used default names in demo) where all the metadata is stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wr_word_embedding = 'wordrank.words'\n",
    "vocab_file = 'vocab.txt'\n",
    "\n",
    "model = Wordrank.load_wordrank_model(wr_word_embedding, vocab_file, sorted_vocab=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "If you want to load the ensemble embedding, you similarly need to provide the context embedding file and set ensemble to 1 in `load_wordrank_model` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "wr_context_file = 'wordrank.contexts'\n",
    "model = Wordrank.load_wordrank_model(wr_word_embedding, vocab_file, wr_context_file, sorted_vocab=1, ensemble=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can save these sorted embeddings using the standard gensim methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tempfile import mkstemp\n",
    "\n",
    "fs, temp_path = mkstemp(\"gensim_temp\")  # creates a temp file\n",
    "model.save(temp_path)  # save the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Evaluating models\n",
    "Now that the embeddings are loaded in Word2Vec format and sorted according to the word frequencies in corpus, you can use the evaluations provided by gensim on this model.\n",
    "\n",
    "For example, it can be evaluated on following Word Analogies and Word Similarity benchmarks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "word_analogies_file = 'datasets/questions-words.txt'\n",
    "model.accuracy(word_analogies_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_similarity_file = 'datasets/ws-353.txt'\n",
    "model.evaluate_word_pairs(word_similarity_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These methods take an [optional parameter](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy) restrict_vocab which limits which test examples are to be considered.\n",
    "\n",
    "The results here don't look good because the training corpus is very small. To get meaningful results one needs to train on 500k+ words.\n",
    "\n",
    "# Conclusion\n",
    "We learned to use Wordrank wrapper on a sample corpus and also how to directly load the Wordrank embedding files in gensim. Once loaded, you can use the standard gensim methods on this embedding."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}