{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial for using Gensim's API for downloading corpuses/models\n",
    "Let's start by importing the api module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import gensim.downloader as api\n",
    "\n",
    "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets download the text8 corpus and load it to memory (automatically)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[==================================================] 100.0% 31.6/31.6MB downloaded\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2017-11-10 14:49:45,787 : INFO : text8 downloaded\n"
     ]
    }
   ],
   "source": [
    "corpus = api.load('text8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2017-11-10 14:50:02,458 : INFO : collecting all words and their counts\n",
      "2017-11-10 14:50:02,461 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n",
      "2017-11-10 14:50:08,402 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences\n",
      "2017-11-10 14:50:08,403 : INFO : Loading a fresh vocabulary\n",
      "2017-11-10 14:50:08,693 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)\n",
      "2017-11-10 14:50:08,694 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)\n",
      "2017-11-10 14:50:08,870 : INFO : deleting the raw counts dictionary of 253854 items\n",
      "2017-11-10 14:50:08,898 : INFO : sample=0.001 downsamples 38 most-common words\n",
      "2017-11-10 14:50:08,899 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)\n",
      "2017-11-10 14:50:08,900 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes\n",
      "2017-11-10 14:50:09,115 : INFO : resetting layer weights\n",
      "2017-11-10 14:50:09,703 : INFO : training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5\n",
      "2017-11-10 14:50:10,718 : INFO : PROGRESS: at 1.66% examples, 1020519 words/s, in_qsize 5, out_qsize 0\n",
      "2017-11-10 14:50:11,715 : INFO : PROGRESS: at 3.29% examples, 1017921 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:12,715 : INFO : PROGRESS: at 4.71% examples, 976739 words/s, in_qsize 4, out_qsize 0\n",
      "2017-11-10 14:50:13,729 : INFO : PROGRESS: at 6.35% examples, 989118 words/s, in_qsize 4, out_qsize 1\n",
      "2017-11-10 14:50:14,729 : INFO : PROGRESS: at 8.02% examples, 999982 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:15,734 : INFO : PROGRESS: at 9.65% examples, 1003821 words/s, in_qsize 1, out_qsize 1\n",
      "2017-11-10 14:50:16,740 : INFO : PROGRESS: at 11.41% examples, 1017517 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:17,738 : INFO : PROGRESS: at 13.17% examples, 1027943 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:18,740 : INFO : PROGRESS: at 14.80% examples, 1027654 words/s, in_qsize 4, out_qsize 0\n",
      "2017-11-10 14:50:19,744 : INFO : PROGRESS: at 16.53% examples, 1030328 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:20,747 : INFO : PROGRESS: at 18.21% examples, 1032126 words/s, in_qsize 0, out_qsize 1\n",
      "2017-11-10 14:50:21,750 : INFO : PROGRESS: at 19.85% examples, 1030455 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:22,755 : INFO : PROGRESS: at 21.54% examples, 1031582 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:23,760 : INFO : PROGRESS: at 23.20% examples, 1031237 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:24,764 : INFO : PROGRESS: at 24.84% examples, 1031195 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:25,769 : INFO : PROGRESS: at 26.56% examples, 1034213 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:26,771 : INFO : PROGRESS: at 28.14% examples, 1031534 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:27,777 : INFO : PROGRESS: at 29.82% examples, 1032589 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:28,780 : INFO : PROGRESS: at 31.42% examples, 1030998 words/s, in_qsize 1, out_qsize 0\n",
      "2017-11-10 14:50:29,783 : INFO : PROGRESS: at 33.15% examples, 1033447 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:30,783 : INFO : PROGRESS: at 34.85% examples, 1035303 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:31,789 : INFO : PROGRESS: at 36.50% examples, 1033770 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:32,795 : INFO : PROGRESS: at 38.17% examples, 1034073 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:33,798 : INFO : PROGRESS: at 39.81% examples, 1033387 words/s, in_qsize 2, out_qsize 0\n",
      "2017-11-10 14:50:34,800 : INFO : PROGRESS: at 41.33% examples, 1029575 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:35,801 : INFO : PROGRESS: at 43.03% examples, 1030736 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:36,801 : INFO : PROGRESS: at 44.70% examples, 1031367 words/s, in_qsize 0, out_qsize 1\n",
      "2017-11-10 14:50:37,802 : INFO : PROGRESS: at 46.41% examples, 1032986 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:38,805 : INFO : PROGRESS: at 48.09% examples, 1033731 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:39,807 : INFO : PROGRESS: at 49.82% examples, 1035440 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:40,817 : INFO : PROGRESS: at 51.49% examples, 1035681 words/s, in_qsize 3, out_qsize 0\n",
      "2017-11-10 14:50:41,811 : INFO : PROGRESS: at 53.16% examples, 1036024 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:42,817 : INFO : PROGRESS: at 54.86% examples, 1036910 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:43,820 : INFO : PROGRESS: at 56.51% examples, 1035966 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:44,822 : INFO : PROGRESS: at 58.07% examples, 1034360 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:45,822 : INFO : PROGRESS: at 59.54% examples, 1030906 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:46,823 : INFO : PROGRESS: at 61.12% examples, 1029543 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:47,827 : INFO : PROGRESS: at 62.77% examples, 1029390 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:48,833 : INFO : PROGRESS: at 64.50% examples, 1030528 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:49,833 : INFO : PROGRESS: at 66.15% examples, 1030820 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:50,836 : INFO : PROGRESS: at 67.83% examples, 1031459 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:51,850 : INFO : PROGRESS: at 69.47% examples, 1030985 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:52,857 : INFO : PROGRESS: at 71.18% examples, 1031954 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:53,862 : INFO : PROGRESS: at 72.83% examples, 1031823 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:54,864 : INFO : PROGRESS: at 74.46% examples, 1031628 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:55,866 : INFO : PROGRESS: at 76.17% examples, 1031962 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:56,870 : INFO : PROGRESS: at 77.77% examples, 1031167 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:57,875 : INFO : PROGRESS: at 79.37% examples, 1030337 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:58,880 : INFO : PROGRESS: at 80.99% examples, 1029831 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:50:59,881 : INFO : PROGRESS: at 82.67% examples, 1030029 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:00,881 : INFO : PROGRESS: at 84.39% examples, 1030874 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:01,886 : INFO : PROGRESS: at 86.03% examples, 1030988 words/s, in_qsize 2, out_qsize 0\n",
      "2017-11-10 14:51:02,892 : INFO : PROGRESS: at 87.72% examples, 1031570 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:03,895 : INFO : PROGRESS: at 89.41% examples, 1031964 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:04,902 : INFO : PROGRESS: at 91.09% examples, 1032271 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:05,910 : INFO : PROGRESS: at 92.53% examples, 1029888 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:06,912 : INFO : PROGRESS: at 94.03% examples, 1028192 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:07,916 : INFO : PROGRESS: at 95.74% examples, 1028660 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:08,919 : INFO : PROGRESS: at 97.47% examples, 1029434 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:09,923 : INFO : PROGRESS: at 99.18% examples, 1029952 words/s, in_qsize 0, out_qsize 0\n",
      "2017-11-10 14:51:10,409 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
      "2017-11-10 14:51:10,409 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
      "2017-11-10 14:51:10,415 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
      "2017-11-10 14:51:10,416 : INFO : training on 85026035 raw words (62530433 effective words) took 60.7s, 1029968 effective words/s\n"
     ]
    }
   ],
   "source": [
    "from gensim.models.word2vec import Word2Vec\n",
    "\n",
    "model = Word2Vec(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our word2vec model, let's find words that are similar to 'tree'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2017-11-10 14:51:10,422 : INFO : precomputing L2-norms of word weight vectors\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(u'trees', 0.7245415449142456),\n",
       " (u'leaf', 0.6882676482200623),\n",
       " (u'bark', 0.645646333694458),\n",
       " (u'avl', 0.6076173782348633),\n",
       " (u'cactus', 0.6019535064697266),\n",
       " (u'flower', 0.6010029315948486),\n",
       " (u'fruit', 0.5908031463623047),\n",
       " (u'bird', 0.5886812806129456),\n",
       " (u'leaves', 0.5771278142929077),\n",
       " (u'pond', 0.5627825856208801)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar('tree')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"models\": {\n",
      "        \"glove-twitter-25\": {\n",
      "            \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 25\", \n",
      "            \"file_name\": \"glove-twitter-25.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt`\", \n",
      "            \"checksum\": \"50db0211d7e7a2dcd362c6b774762793\"\n",
      "        }, \n",
      "        \"glove-twitter-100\": {\n",
      "            \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 100\", \n",
      "            \"file_name\": \"glove-twitter-100.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt`\", \n",
      "            \"checksum\": \"b04f7bed38756d64cf55b58ce7e97b15\"\n",
      "        }, \n",
      "        \"glove-wiki-gigaword-100\": {\n",
      "            \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 100\", \n",
      "            \"file_name\": \"glove-wiki-gigaword-100.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-100.txt`\", \n",
      "            \"checksum\": \"40ec481866001177b8cd4cb0df92924f\"\n",
      "        }, \n",
      "        \"glove-twitter-200\": {\n",
      "            \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 200\", \n",
      "            \"file_name\": \"glove-twitter-200.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt`\", \n",
      "            \"checksum\": \"e52e8392d1860b95d5308a525817d8f9\"\n",
      "        }, \n",
      "        \"glove-wiki-gigaword-50\": {\n",
      "            \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimension = 50\", \n",
      "            \"file_name\": \"glove-wiki-gigaword-50.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`\", \n",
      "            \"checksum\": \"c289bc5d7f2f02c6dc9f2f9b67641813\"\n",
      "        }, \n",
      "        \"glove-twitter-50\": {\n",
      "            \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 50\", \n",
      "            \"file_name\": \"glove-twitter-50.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-50.txt`\", \n",
      "            \"checksum\": \"c168f18641f8c8a00fe30984c4799b2b\"\n",
      "        }, \n",
      "        \"__testing_word2vec-matrix-synopsis\": {\n",
      "            \"description\": \"Word vecrors of the movie matrix\", \n",
      "            \"parameters\": \"dimentions = 50\", \n",
      "            \"file_name\": \"__testing_word2vec-matrix-synopsis.gz\", \n",
      "            \"papers\": \"\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v using a preprocessed corpus. Converted to w2v format with `python3.5 -m gensim.models.word2vec -train <input_filename> -iter 50 -output <output_filename>`\", \n",
      "            \"checksum\": \"534dcb8b56a360977a269b7bfc62d124\"\n",
      "        }, \n",
      "        \"glove-wiki-gigaword-200\": {\n",
      "            \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimentions = 200\", \n",
      "            \"file_name\": \"glove-wiki-gigaword-200.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-200.txt`\", \n",
      "            \"checksum\": \"59652db361b7a87ee73834a6c391dfc1\"\n",
      "        }, \n",
      "        \"word2vec-google-news-300\": {\n",
      "            \"description\": \"Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality', https://code.google.com/archive/p/word2vec/\", \n",
      "            \"parameters\": \"dimension = 300\", \n",
      "            \"file_name\": \"word2vec-google-news-300.gz\", \n",
      "            \"papers\": \"https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"checksum\": \"a5e5354d40acb95f9ec66d5977d140ef\"\n",
      "        }, \n",
      "        \"glove-wiki-gigaword-300\": {\n",
      "            \"description\": \"Pre-trained vectors, Wikipedia 2014 + Gigaword 5, 6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n",
      "            \"parameters\": \"dimensions = 300\", \n",
      "            \"file_name\": \"glove-wiki-gigaword-300.gz\", \n",
      "            \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n",
      "            \"parts\": 1, \n",
      "            \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`\", \n",
      "            \"checksum\": \"29e9329ac2241937d55b852e8284e89b\"\n",
      "        }\n",
      "    }, \n",
      "    \"corpora\": {\n",
      "        \"__testing_matrix-synopsis\": {\n",
      "            \"source\": \"http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis\", \n",
      "            \"checksum\": \"1767ac93a089b43899d54944b07d9dc5\", \n",
      "            \"parts\": 1, \n",
      "            \"description\": \"Synopsis of the movie matrix\", \n",
      "            \"file_name\": \"__testing_matrix-synopsis.gz\"\n",
      "        }, \n",
      "        \"fake-news\": {\n",
      "            \"source\": \"Kaggle\", \n",
      "            \"checksum\": \"5e64e942df13219465927f92dcefd5fe\", \n",
      "            \"parts\": 1, \n",
      "            \"description\": \"It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.\", \n",
      "            \"file_name\": \"fake-news.gz\"\n",
      "        }, \n",
      "        \"__testing_multipart-matrix-synopsis\": {\n",
      "            \"description\": \"Synopsis of the movie matrix\", \n",
      "            \"source\": \"http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis\", \n",
      "            \"file_name\": \"__testing_multipart-matrix-synopsis.gz\", \n",
      "            \"checksum-0\": \"c8b0c7d8cf562b1b632c262a173ac338\", \n",
      "            \"checksum-1\": \"5ff7fc6818e9a5d9bc1cf12c35ed8b96\", \n",
      "            \"checksum-2\": \"966db9d274d125beaac7987202076cba\", \n",
      "            \"parts\": 3\n",
      "        }, \n",
      "        \"text8\": {\n",
      "            \"source\": \"https://mattmahoney.net/dc/text8.zip\", \n",
      "            \"checksum\": \"68799af40b6bda07dfa47a32612e5364\", \n",
      "            \"parts\": 1, \n",
      "            \"description\": \"Cleaned small sample from wikipedia\", \n",
      "            \"file_name\": \"text8.gz\"\n",
      "        }, \n",
      "        \"wiki-en\": {\n",
      "            \"description\": \"Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz`\", \n",
      "            \"source\": \"https://dumps.wikimedia.org/enwiki/20171001/\", \n",
      "            \"file_name\": \"wiki-en.gz\", \n",
      "            \"parts\": 4, \n",
      "            \"checksum-0\": \"a7d7d7fd41ea7e2d7fa32ec1bb640d71\", \n",
      "            \"checksum-1\": \"b2683e3356ffbca3b6c2dca6e9801f9f\", \n",
      "            \"checksum-2\": \"c5cde2a9ae77b3c4ebce804f6df542c2\", \n",
      "            \"checksum-3\": \"00b71144ed5e3aeeb885de84f7452b81\"\n",
      "        }, \n",
      "        \"20-newsgroups\": {\n",
      "            \"source\": \"http://qwone.com/~jason/20Newsgroups/\", \n",
      "            \"checksum\": \"c92fd4f6640a86d5ba89eaad818a9891\", \n",
      "            \"parts\": 1, \n",
      "            \"description\": \"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups\", \n",
      "            \"file_name\": \"20-newsgroups.gz\"\n",
      "        }\n",
      "    }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "data_list = api.info()\n",
    "print(json.dumps(data_list, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to get detailed information about the model/corpus, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"source\": \"Kaggle\", \n",
      "    \"checksum\": \"5e64e942df13219465927f92dcefd5fe\", \n",
      "    \"parts\": 1, \n",
      "    \"description\": \"It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.\", \n",
      "    \"file_name\": \"fake-news.gz\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "fake_news_info = api.info('fake-news')\n",
    "print(json.dumps(fake_news_info, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n"
     ]
    }
   ],
   "source": [
    "print(api.load('glove-wiki-gigaword-50', return_path=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to load the model to memory, then:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2017-11-10 14:51:59,199 : INFO : loading projection weights from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n",
      "2017-11-10 14:52:18,380 : INFO : loaded (400000, 50) matrix from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n",
      "2017-11-10 14:52:18,405 : INFO : precomputing L2-norms of word weight vectors\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(u'plastic', 0.7942505478858948),\n",
       " (u'metal', 0.770871639251709),\n",
       " (u'walls', 0.7700636386871338),\n",
       " (u'marble', 0.7638524174690247),\n",
       " (u'wood', 0.7624281048774719),\n",
       " (u'ceramic', 0.7602593302726746),\n",
       " (u'pieces', 0.7589111924171448),\n",
       " (u'stained', 0.7528817057609558),\n",
       " (u'tile', 0.748193621635437),\n",
       " (u'furniture', 0.746385931968689)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = api.load(\"glove-wiki-gigaword-50\")\n",
    "model.most_similar(\"glass\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class `Dataset` and provide `__iter__` method"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}