{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial for using Gensim's API for downloading corpuses/models\n", "Let's start by importing the api module." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import gensim.downloader as api\n", "\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, lets download the text8 corpus and load it to memory (automatically)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[==================================================] 100.0% 31.6/31.6MB downloaded\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2017-11-10 14:49:45,787 : INFO : text8 downloaded\n" ] } ], "source": [ "corpus = api.load('text8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2017-11-10 14:50:02,458 : INFO : collecting all words and their counts\n", "2017-11-10 14:50:02,461 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2017-11-10 14:50:08,402 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences\n", "2017-11-10 14:50:08,403 : INFO : Loading a fresh vocabulary\n", "2017-11-10 14:50:08,693 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)\n", "2017-11-10 14:50:08,694 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)\n", "2017-11-10 14:50:08,870 : INFO : deleting the raw counts dictionary of 253854 items\n", "2017-11-10 14:50:08,898 : INFO : sample=0.001 downsamples 38 most-common words\n", "2017-11-10 14:50:08,899 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)\n", "2017-11-10 14:50:08,900 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes\n", "2017-11-10 14:50:09,115 : INFO : resetting layer weights\n", "2017-11-10 14:50:09,703 : INFO : training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5\n", "2017-11-10 14:50:10,718 : INFO : PROGRESS: at 1.66% examples, 1020519 words/s, in_qsize 5, out_qsize 0\n", "2017-11-10 14:50:11,715 : INFO : PROGRESS: at 3.29% examples, 1017921 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:12,715 : INFO : PROGRESS: at 4.71% examples, 976739 words/s, in_qsize 4, out_qsize 0\n", "2017-11-10 14:50:13,729 : INFO : PROGRESS: at 6.35% examples, 989118 words/s, in_qsize 4, out_qsize 1\n", "2017-11-10 14:50:14,729 : INFO : PROGRESS: at 8.02% examples, 999982 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:15,734 : INFO : PROGRESS: at 9.65% examples, 1003821 words/s, in_qsize 1, out_qsize 1\n", "2017-11-10 14:50:16,740 : INFO : PROGRESS: at 11.41% examples, 1017517 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:17,738 : INFO : PROGRESS: at 13.17% examples, 1027943 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:18,740 : INFO : PROGRESS: at 14.80% examples, 1027654 words/s, in_qsize 4, out_qsize 0\n", "2017-11-10 14:50:19,744 : INFO : PROGRESS: at 16.53% examples, 1030328 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:20,747 : INFO : PROGRESS: at 18.21% examples, 1032126 words/s, in_qsize 0, out_qsize 1\n", "2017-11-10 14:50:21,750 : INFO : PROGRESS: at 19.85% examples, 1030455 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:22,755 : INFO : PROGRESS: at 21.54% examples, 1031582 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:23,760 : INFO : PROGRESS: at 23.20% examples, 1031237 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:24,764 : INFO : PROGRESS: at 24.84% examples, 1031195 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:25,769 : INFO : PROGRESS: at 26.56% examples, 1034213 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:26,771 : INFO : PROGRESS: at 28.14% examples, 1031534 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:27,777 : INFO : PROGRESS: at 29.82% examples, 1032589 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:28,780 : INFO : PROGRESS: at 31.42% examples, 1030998 words/s, in_qsize 1, out_qsize 0\n", "2017-11-10 14:50:29,783 : INFO : PROGRESS: at 33.15% examples, 1033447 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:30,783 : INFO : PROGRESS: at 34.85% examples, 1035303 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:31,789 : INFO : PROGRESS: at 36.50% examples, 1033770 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:32,795 : INFO : PROGRESS: at 38.17% examples, 1034073 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:33,798 : INFO : PROGRESS: at 39.81% examples, 1033387 words/s, in_qsize 2, out_qsize 0\n", "2017-11-10 14:50:34,800 : INFO : PROGRESS: at 41.33% examples, 1029575 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:35,801 : INFO : PROGRESS: at 43.03% examples, 1030736 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:36,801 : INFO : PROGRESS: at 44.70% examples, 1031367 words/s, in_qsize 0, out_qsize 1\n", "2017-11-10 14:50:37,802 : INFO : PROGRESS: at 46.41% examples, 1032986 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:38,805 : INFO : PROGRESS: at 48.09% examples, 1033731 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:39,807 : INFO : PROGRESS: at 49.82% examples, 1035440 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:40,817 : INFO : PROGRESS: at 51.49% examples, 1035681 words/s, in_qsize 3, out_qsize 0\n", "2017-11-10 14:50:41,811 : INFO : PROGRESS: at 53.16% examples, 1036024 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:42,817 : INFO : PROGRESS: at 54.86% examples, 1036910 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:43,820 : INFO : PROGRESS: at 56.51% examples, 1035966 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:44,822 : INFO : PROGRESS: at 58.07% examples, 1034360 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:45,822 : INFO : PROGRESS: at 59.54% examples, 1030906 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:46,823 : INFO : PROGRESS: at 61.12% examples, 1029543 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:47,827 : INFO : PROGRESS: at 62.77% examples, 1029390 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:48,833 : INFO : PROGRESS: at 64.50% examples, 1030528 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:49,833 : INFO : PROGRESS: at 66.15% examples, 1030820 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:50,836 : INFO : PROGRESS: at 67.83% examples, 1031459 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:51,850 : INFO : PROGRESS: at 69.47% examples, 1030985 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:52,857 : INFO : PROGRESS: at 71.18% examples, 1031954 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:53,862 : INFO : PROGRESS: at 72.83% examples, 1031823 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:54,864 : INFO : PROGRESS: at 74.46% examples, 1031628 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:55,866 : INFO : PROGRESS: at 76.17% examples, 1031962 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:56,870 : INFO : PROGRESS: at 77.77% examples, 1031167 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:57,875 : INFO : PROGRESS: at 79.37% examples, 1030337 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:58,880 : INFO : PROGRESS: at 80.99% examples, 1029831 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:50:59,881 : INFO : PROGRESS: at 82.67% examples, 1030029 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:00,881 : INFO : PROGRESS: at 84.39% examples, 1030874 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:01,886 : INFO : PROGRESS: at 86.03% examples, 1030988 words/s, in_qsize 2, out_qsize 0\n", "2017-11-10 14:51:02,892 : INFO : PROGRESS: at 87.72% examples, 1031570 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:03,895 : INFO : PROGRESS: at 89.41% examples, 1031964 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:04,902 : INFO : PROGRESS: at 91.09% examples, 1032271 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:05,910 : INFO : PROGRESS: at 92.53% examples, 1029888 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:06,912 : INFO : PROGRESS: at 94.03% examples, 1028192 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:07,916 : INFO : PROGRESS: at 95.74% examples, 1028660 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:08,919 : INFO : PROGRESS: at 97.47% examples, 1029434 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:09,923 : INFO : PROGRESS: at 99.18% examples, 1029952 words/s, in_qsize 0, out_qsize 0\n", "2017-11-10 14:51:10,409 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2017-11-10 14:51:10,409 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2017-11-10 14:51:10,415 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2017-11-10 14:51:10,416 : INFO : training on 85026035 raw words (62530433 effective words) took 60.7s, 1029968 effective words/s\n" ] } ], "source": [ "from gensim.models.word2vec import Word2Vec\n", "\n", "model = Word2Vec(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our word2vec model, let's find words that are similar to 'tree'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2017-11-10 14:51:10,422 : INFO : precomputing L2-norms of word weight vectors\n" ] }, { "data": { "text/plain": [ "[(u'trees', 0.7245415449142456),\n", " (u'leaf', 0.6882676482200623),\n", " (u'bark', 0.645646333694458),\n", " (u'avl', 0.6076173782348633),\n", " (u'cactus', 0.6019535064697266),\n", " (u'flower', 0.6010029315948486),\n", " (u'fruit', 0.5908031463623047),\n", " (u'bird', 0.5886812806129456),\n", " (u'leaves', 0.5771278142929077),\n", " (u'pond', 0.5627825856208801)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.most_similar('tree')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"models\": {\n", " \"glove-twitter-25\": {\n", " \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 25\", \n", " \"file_name\": \"glove-twitter-25.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-25.txt`\", \n", " \"checksum\": \"50db0211d7e7a2dcd362c6b774762793\"\n", " }, \n", " \"glove-twitter-100\": {\n", " \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 100\", \n", " \"file_name\": \"glove-twitter-100.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-100.txt`\", \n", " \"checksum\": \"b04f7bed38756d64cf55b58ce7e97b15\"\n", " }, \n", " \"glove-wiki-gigaword-100\": {\n", " \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 100\", \n", " \"file_name\": \"glove-wiki-gigaword-100.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-wiki-gigaword-100.txt`\", \n", " \"checksum\": \"40ec481866001177b8cd4cb0df92924f\"\n", " }, \n", " \"glove-twitter-200\": {\n", " \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 200\", \n", " \"file_name\": \"glove-twitter-200.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-200.txt`\", \n", " \"checksum\": \"e52e8392d1860b95d5308a525817d8f9\"\n", " }, \n", " \"glove-wiki-gigaword-50\": {\n", " \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimension = 50\", \n", " \"file_name\": \"glove-wiki-gigaword-50.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-wiki-gigaword-50.txt`\", \n", " \"checksum\": \"c289bc5d7f2f02c6dc9f2f9b67641813\"\n", " }, \n", " \"glove-twitter-50\": {\n", " \"description\": \"Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 50\", \n", " \"file_name\": \"glove-twitter-50.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-50.txt`\", \n", " \"checksum\": \"c168f18641f8c8a00fe30984c4799b2b\"\n", " }, \n", " \"__testing_word2vec-matrix-synopsis\": {\n", " \"description\": \"Word vecrors of the movie matrix\", \n", " \"parameters\": \"dimentions = 50\", \n", " \"file_name\": \"__testing_word2vec-matrix-synopsis.gz\", \n", " \"papers\": \"\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v using a preprocessed corpus. Converted to w2v format with `python3.5 -m gensim.models.word2vec -train -iter 50 -output `\", \n", " \"checksum\": \"534dcb8b56a360977a269b7bfc62d124\"\n", " }, \n", " \"glove-wiki-gigaword-200\": {\n", " \"description\": \"Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimentions = 200\", \n", " \"file_name\": \"glove-wiki-gigaword-200.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-wiki-gigaword-200.txt`\", \n", " \"checksum\": \"59652db361b7a87ee73834a6c391dfc1\"\n", " }, \n", " \"word2vec-google-news-300\": {\n", " \"description\": \"Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality', https://code.google.com/archive/p/word2vec/\", \n", " \"parameters\": \"dimension = 300\", \n", " \"file_name\": \"word2vec-google-news-300.gz\", \n", " \"papers\": \"https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf\", \n", " \"parts\": 1, \n", " \"checksum\": \"a5e5354d40acb95f9ec66d5977d140ef\"\n", " }, \n", " \"glove-wiki-gigaword-300\": {\n", " \"description\": \"Pre-trained vectors, Wikipedia 2014 + Gigaword 5, 6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/\", \n", " \"parameters\": \"dimensions = 300\", \n", " \"file_name\": \"glove-wiki-gigaword-300.gz\", \n", " \"papers\": \"https://nlp.stanford.edu/pubs/glove.pdf\", \n", " \"parts\": 1, \n", " \"preprocessing\": \"Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-wiki-gigaword-300.txt`\", \n", " \"checksum\": \"29e9329ac2241937d55b852e8284e89b\"\n", " }\n", " }, \n", " \"corpora\": {\n", " \"__testing_matrix-synopsis\": {\n", " \"source\": \"http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis\", \n", " \"checksum\": \"1767ac93a089b43899d54944b07d9dc5\", \n", " \"parts\": 1, \n", " \"description\": \"Synopsis of the movie matrix\", \n", " \"file_name\": \"__testing_matrix-synopsis.gz\"\n", " }, \n", " \"fake-news\": {\n", " \"source\": \"Kaggle\", \n", " \"checksum\": \"5e64e942df13219465927f92dcefd5fe\", \n", " \"parts\": 1, \n", " \"description\": \"It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.\", \n", " \"file_name\": \"fake-news.gz\"\n", " }, \n", " \"__testing_multipart-matrix-synopsis\": {\n", " \"description\": \"Synopsis of the movie matrix\", \n", " \"source\": \"http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis\", \n", " \"file_name\": \"__testing_multipart-matrix-synopsis.gz\", \n", " \"checksum-0\": \"c8b0c7d8cf562b1b632c262a173ac338\", \n", " \"checksum-1\": \"5ff7fc6818e9a5d9bc1cf12c35ed8b96\", \n", " \"checksum-2\": \"966db9d274d125beaac7987202076cba\", \n", " \"parts\": 3\n", " }, \n", " \"text8\": {\n", " \"source\": \"https://mattmahoney.net/dc/text8.zip\", \n", " \"checksum\": \"68799af40b6bda07dfa47a32612e5364\", \n", " \"parts\": 1, \n", " \"description\": \"Cleaned small sample from wikipedia\", \n", " \"file_name\": \"text8.gz\"\n", " }, \n", " \"wiki-en\": {\n", " \"description\": \"Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz`\", \n", " \"source\": \"https://dumps.wikimedia.org/enwiki/20171001/\", \n", " \"file_name\": \"wiki-en.gz\", \n", " \"parts\": 4, \n", " \"checksum-0\": \"a7d7d7fd41ea7e2d7fa32ec1bb640d71\", \n", " \"checksum-1\": \"b2683e3356ffbca3b6c2dca6e9801f9f\", \n", " \"checksum-2\": \"c5cde2a9ae77b3c4ebce804f6df542c2\", \n", " \"checksum-3\": \"00b71144ed5e3aeeb885de84f7452b81\"\n", " }, \n", " \"20-newsgroups\": {\n", " \"source\": \"http://qwone.com/~jason/20Newsgroups/\", \n", " \"checksum\": \"c92fd4f6640a86d5ba89eaad818a9891\", \n", " \"parts\": 1, \n", " \"description\": \"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups\", \n", " \"file_name\": \"20-newsgroups.gz\"\n", " }\n", " }\n", "}\n" ] } ], "source": [ "import json\n", "data_list = api.info()\n", "print(json.dumps(data_list, indent=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get detailed information about the model/corpus, use:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"source\": \"Kaggle\", \n", " \"checksum\": \"5e64e942df13219465927f92dcefd5fe\", \n", " \"parts\": 1, \n", " \"description\": \"It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.\", \n", " \"file_name\": \"fake-news.gz\"\n", "}\n" ] } ], "source": [ "fake_news_info = api.info('fake-news')\n", "print(json.dumps(fake_news_info, indent=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n" ] } ], "source": [ "print(api.load('glove-wiki-gigaword-50', return_path=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to load the model to memory, then:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2017-11-10 14:51:59,199 : INFO : loading projection weights from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n", "2017-11-10 14:52:18,380 : INFO : loaded (400000, 50) matrix from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n", "2017-11-10 14:52:18,405 : INFO : precomputing L2-norms of word weight vectors\n" ] }, { "data": { "text/plain": [ "[(u'plastic', 0.7942505478858948),\n", " (u'metal', 0.770871639251709),\n", " (u'walls', 0.7700636386871338),\n", " (u'marble', 0.7638524174690247),\n", " (u'wood', 0.7624281048774719),\n", " (u'ceramic', 0.7602593302726746),\n", " (u'pieces', 0.7589111924171448),\n", " (u'stained', 0.7528817057609558),\n", " (u'tile', 0.748193621635437),\n", " (u'furniture', 0.746385931968689)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = api.load(\"glove-wiki-gigaword-50\")\n", "model.most_similar(\"glass\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class `Dataset` and provide `__iter__` method" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }