{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nHow to download pre-trained models and corpora\n==============================================\n\nDemonstrates simple and quick access to common corpora and pretrained models.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "One of Gensim's features is simple and easy access to common data.\nThe `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a\nvariety of corpora and pretrained models.\nGensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.\nThis module leverages a local cache (in user's home folder, by default) that\nensures data is downloaded at most once.\n\nThis tutorial:\n\n* Downloads the text8 corpus, unless it is already on your local machine\n* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)\n* Leverages the model to calculate word similarity\n* Demonstrates using the API to load other models and corpora\n\nLet's start by importing the api module.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import gensim.downloader as api"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let's download the text8 corpus and load it as a Python object\nthat supports streamed access.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "corpus = api.load('text8')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, our corpus is an iterable.\nIf you look under the covers, it has the following definition:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import inspect\nprint(inspect.getsource(corpus.__class__))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For more details, look inside the file that defines the Dataset class for your particular resource.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(inspect.getfile(corpus.__class__))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With the corpus has been downloaded and loaded, let's use it to train a word2vec model.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from gensim.models.word2vec import Word2Vec\nmodel = Word2Vec(corpus)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have our word2vec model, let's find words that are similar to 'tree'.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(model.wv.most_similar('tree'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can use the API to download several different corpora and pretrained models.\nHere's how to list all resources available in gensim-data:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import json\ninfo = api.info()\nprint(json.dumps(info, indent=4))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There are two types of data resources: corpora and models.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(info.keys())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's have a look at the available corpora:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for corpus_name, corpus_data in sorted(info['corpora'].items()):\n    print(\n        '%s (%d records): %s' % (\n            corpus_name,\n            corpus_data.get('num_records', -1),\n            corpus_data['description'][:40] + '...',\n        )\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "... and the same for models:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for model_name, model_data in sorted(info['models'].items()):\n    print(\n        '%s (%d records): %s' % (\n            model_name,\n            model_data.get('num_records', -1),\n            model_data['description'][:40] + '...',\n        )\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you want to get detailed information about a model/corpus, use:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fake_news_info = api.info('fake-news')\nprint(json.dumps(fake_news_info, indent=4))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sometimes, you do not want to load a model into memory. Instead, you can request\njust the filesystem path to the model. For that, use:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(api.load('glove-wiki-gigaword-50', return_path=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you want to load the model to memory, then:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model = api.load(\"glove-wiki-gigaword-50\")\nmodel.most_similar(\"glass\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in\na special class ``Dataset``, with an ``__iter__`` method.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}