{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing a Gensim model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate how to use [`pyLDAvis`](https://github.com/bmabey/pyLDAvis)'s gensim [helper funtions](https://pyldavis.readthedocs.org/en/latest/modules/API.html#module-pyLDAvis.gensim) we will create a model from the [20 Newsgroup corpus](http://qwone.com/~jason/20Newsgroups/). Minimal preprocessing is done and so the model is not the best. However, the goal of this notebook is to demonstrate the helper functions.\n", "\n", "## Downloading the data" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "~/DataScience/GitHub/pyLDAvis/notebooks/data ~/DataScience/GitHub/pyLDAvis/notebooks\n", "The data has already been downloaded...\n", "Lets take a look at the groups...\n", "\u001b[34malt.atheism\u001b[m\u001b[m\n", "\u001b[34mcomp.graphics\u001b[m\u001b[m\n", "\u001b[34mcomp.os.ms-windows.misc\u001b[m\u001b[m\n", "\u001b[34mcomp.sys.ibm.pc.hardware\u001b[m\u001b[m\n", "\u001b[34mcomp.sys.mac.hardware\u001b[m\u001b[m\n", "\u001b[34mcomp.windows.x\u001b[m\u001b[m\n", "\u001b[34mmisc.forsale\u001b[m\u001b[m\n", "\u001b[34mrec.autos\u001b[m\u001b[m\n", "\u001b[34mrec.motorcycles\u001b[m\u001b[m\n", "\u001b[34mrec.sport.baseball\u001b[m\u001b[m\n", "\u001b[34mrec.sport.hockey\u001b[m\u001b[m\n", "\u001b[34msci.crypt\u001b[m\u001b[m\n", "\u001b[34msci.electronics\u001b[m\u001b[m\n", "\u001b[34msci.med\u001b[m\u001b[m\n", "\u001b[34msci.space\u001b[m\u001b[m\n", "\u001b[34msoc.religion.christian\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.guns\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.mideast\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.misc\u001b[m\u001b[m\n", "\u001b[34mtalk.religion.misc\u001b[m\u001b[m\n", "~/DataScience/GitHub/pyLDAvis/notebooks\n" ] } ], "source": [ "%%bash\n", "\n", "mkdir -p data\n", "pushd data\n", "if [ -d \"20news-bydate-train\" ]\n", "then\n", " echo \"The data has already been downloaded...\"\n", "else\n", " wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz\n", " tar xfv 20news-bydate.tar.gz\n", " rm 20news-bydate.tar.gz\n", "fi\n", "echo \"Lets take a look at the groups...\"\n", "ls 20news-bydate-train/\n", "popd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each group dir has a set of files:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 marksusol staff 1.4K Mar 18 2003 61250\n", "-rw-r--r-- 1 marksusol staff 889B Mar 18 2003 61252\n", "-rw-r--r-- 1 marksusol staff 1.2K Mar 18 2003 61264\n", "-rw-r--r-- 1 marksusol staff 1.6K Mar 18 2003 61308\n", "-rw-r--r-- 1 marksusol staff 1.3K Mar 18 2003 61422\n" ] } ], "source": [ "!ls -lah data/20news-bydate-train/sci.space | tail -n 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets take a peak at one email:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: ralph.buttigieg@f635.n713.z3.fido.zeta.org.au (Ralph Buttigieg)\n", "Subject: Why not give $1 billion to first year-lo\n", "Organization: Fidonet. Gate admin is fido@socs.uts.edu.au\n", "Lines: 34\n", "\n", "Original to: keithley@apple.com\n", "G'day keithley@apple.com\n", "\n", "21 Apr 93 22:25, keithley@apple.com wrote to All:\n", "\n" ] } ], "source": [ "!head data/20news-bydate-train/sci.space/61422" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading and tokenizing the corpus" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /Users/marksusol/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "import re\n", "import string\n", "import funcy as fp\n", "from gensim import models\n", "from gensim.corpora import Dictionary, MmCorpus\n", "import nltk\n", "import pandas as pd\n", "\n", "nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | \n", " | doc | \n", "tokens | \n", "
---|---|---|---|
group | \n", "id | \n", "\n", " | \n", " |
talk.politics.mideast | \n", "75895 | \n", "[From: hm@cs.brown.edu (Harry Mamaysky)\\n, Sub... | \n", "[from, #email, harry, mamaysky, subject, heil,... | \n", "
76248 | \n", "[From: waldo@cybernet.cse.fau.edu (Todd J. Dic... | \n", "[from, #email, todd, dicker, subject, israel's... | \n", "|
76277 | \n", "[From: C.L.Gannon@newcastle.ac.uk (Space Cadet... | \n", "[from, #email, space, cadet, subject, exact, m... | \n", "|
76045 | \n", "[From: shaig@Think.COM (Shai Guday)\\n, Subject... | \n", "[from, #email, shai, guday, subject, basil, op... | \n", "|
76283 | \n", "[From: koc@rize.ECE.ORST.EDU (Cetin Kaya Koc)\\... | \n", "[from, #email, cetin, kaya, koc, subject, seve... | \n", "