# Tutorial for using Gensim's API for downloading corpuses/models
Let's start by importing the api module.

In [1]:
import logging
import gensim.downloader as api

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Now, lets download the text8 corpus and load it to memory (automatically)

In [2]:
corpus = api.load('text8')



2017-11-10 14:49:45,787 : INFO : text8 downloaded


As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.

In [3]:
from gensim.models.word2vec import Word2Vec

model = Word2Vec(corpus)

2017-11-10 14:50:02,458 : INFO : collecting all words and their counts
2017-11-10 14:50:02,461 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-10 14:50:08,402 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-11-10 14:50:08,403 : INFO : Loading a fresh vocabulary
2017-11-10 14:50:08,693 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-11-10 14:50:08,694 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-11-10 14:50:08,870 : INFO : deleting the raw counts dictionary of 253854 items
2017-11-10 14:50:08,898 : INFO : sample=0.001 downsamples 38 most-common words
2017-11-10 14:50:08,899 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2017-11-10 14:50:08,900 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2017-11-10 14:50:09,115 : INFO : resetting lay

Now that we have our word2vec model, let's find words that are similar to 'tree'

In [4]:
model.most_similar('tree')

2017-11-10 14:51:10,422 : INFO : precomputing L2-norms of word weight vectors


[(u'trees', 0.7245415449142456),
 (u'leaf', 0.6882676482200623),
 (u'bark', 0.645646333694458),
 (u'avl', 0.6076173782348633),
 (u'cactus', 0.6019535064697266),
 (u'flower', 0.6010029315948486),
 (u'fruit', 0.5908031463623047),
 (u'bird', 0.5886812806129456),
 (u'leaves', 0.5771278142929077),
 (u'pond', 0.5627825856208801)]

You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:

In [5]:
import json
data_list = api.info()
print(json.dumps(data_list, indent=4))

{
 "models": {
 "glove-twitter-25": {
 "description": "Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/", 
 "parameters": "dimensions = 25", 
 "file_name": "glove-twitter-25.gz", 
 "papers": "https://nlp.stanford.edu/pubs/glove.pdf", 
 "parts": 1, 
 "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-25.txt`", 
 "checksum": "50db0211d7e7a2dcd362c6b774762793"
 }, 
 "glove-twitter-100": {
 "description": "Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/", 
 "parameters": "dimensions = 100", 
 "file_name": "glove-twitter-100.gz", 
 "papers": "https://nlp.stanford.edu/pubs/glove.pdf", 
 "parts": 1, 
 "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i -o glove-twitter-100.txt`", 
 "checksum": "b04f7bed38756d64cf55b58ce7e97b15"
 }, 
 "glove-wiki-gigaword-100": {
 "description": "Pre-tr

If you want to get detailed information about the model/corpus, use:

In [6]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

{
 "source": "Kaggle", 
 "checksum": "5e64e942df13219465927f92dcefd5fe", 
 "parts": 1, 
 "description": "It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.", 
 "file_name": "fake-news.gz"
}


Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :

In [7]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

/home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz


If you want to load the model to memory, then:

In [8]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

2017-11-10 14:51:59,199 : INFO : loading projection weights from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2017-11-10 14:52:18,380 : INFO : loaded (400000, 50) matrix from /home/ivan/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2017-11-10 14:52:18,405 : INFO : precomputing L2-norms of word weight vectors


[(u'plastic', 0.7942505478858948),
 (u'metal', 0.770871639251709),
 (u'walls', 0.7700636386871338),
 (u'marble', 0.7638524174690247),
 (u'wood', 0.7624281048774719),
 (u'ceramic', 0.7602593302726746),
 (u'pieces', 0.7589111924171448),
 (u'stained', 0.7528817057609558),
 (u'tile', 0.748193621635437),
 (u'furniture', 0.746385931968689)]

In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class `Dataset` and provide `__iter__` method