{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nHow to download pre-trained models and corpora\n==============================================\n\nDemonstrates simple and quick access to common corpora and pretrained models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of Gensim's features is simple and easy access to common data.\nThe `gensim-data `_ project stores a\nvariety of corpora and pretrained models.\nGensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.\nThis module leverages a local cache (in user's home folder, by default) that\nensures data is downloaded at most once.\n\nThis tutorial:\n\n* Downloads the text8 corpus, unless it is already on your local machine\n* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)\n* Leverages the model to calculate word similarity\n* Demonstrates using the API to load other models and corpora\n\nLet's start by importing the api module.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim.downloader as api" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's download the text8 corpus and load it as a Python object\nthat supports streamed access.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "corpus = api.load('text8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, our corpus is an iterable.\nIf you look under the covers, it has the following definition:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import inspect\nprint(inspect.getsource(corpus.__class__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details, look inside the file that defines the Dataset class for your particular resource.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(inspect.getfile(corpus.__class__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the corpus has been downloaded and loaded, let's use it to train a word2vec model.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models.word2vec import Word2Vec\nmodel = Word2Vec(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our word2vec model, let's find words that are similar to 'tree'.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(model.wv.most_similar('tree'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the API to download several different corpora and pretrained models.\nHere's how to list all resources available in gensim-data:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import json\ninfo = api.info()\nprint(json.dumps(info, indent=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two types of data resources: corpora and models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(info.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the available corpora:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for corpus_name, corpus_data in sorted(info['corpora'].items()):\n print(\n '%s (%d records): %s' % (\n corpus_name,\n corpus_data.get('num_records', -1),\n corpus_data['description'][:40] + '...',\n )\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and the same for models:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for model_name, model_data in sorted(info['models'].items()):\n print(\n '%s (%d records): %s' % (\n model_name,\n model_data.get('num_records', -1),\n model_data['description'][:40] + '...',\n )\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get detailed information about a model/corpus, use:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fake_news_info = api.info('fake-news')\nprint(json.dumps(fake_news_info, indent=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, you do not want to load a model into memory. Instead, you can request\njust the filesystem path to the model. For that, use:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(api.load('glove-wiki-gigaword-50', return_path=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to load the model to memory, then:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = api.load(\"glove-wiki-gigaword-50\")\nmodel.most_similar(\"glass\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in\na special class ``Dataset``, with an ``__iter__`` method.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 0 }