{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modeling\n", "\n", "In this lecture, we'll work through an example of *topic modeling*. The idea of topic modeling is to find \"topics\" in documents that tie together many words. Here are some examples of hypothetical topics that you might find in a newspaper: \n", "\n", "1. **Finance**: \"dollar\", \"stock\", \"banks\"\n", "2. **Politics**: \"party\", \"vote\", \"election\"\n", "3. **Sports**: \"team\", \"win\", \"game\"\n", "\n", "In this lecture, we'll see how to use the term-document matrix from last time, in combination with some nice algorithms from `scikit-learn`, to perform topic modeling. Our overall aim is to get a coarse, topic-level summary of the plot of the short book *Alice’s Adventures in Wonderland* by Lewis Carroll. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "\n", "import nltk\n", "from nltk.corpus import gutenberg\n", "# need to do this once to download the data\n", "# nltk.download('gutenberg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's briefly review the steps that we took to construct our term-document matrix. First, we used the `gutenberg` module to read in the raw text of the book, and split it into chapters. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "s = gutenberg.raw(\"carroll-alice.txt\")\n", "chapters = s.split(\"CHAPTER\")[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we created a nice, tidy data frame in which we stored the complete text of each chapter. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | chapter | \n", "text | \n", "
---|---|---|
0 | \n", "1 | \n", "I. Down the Rabbit-Hole\\n\\nAlice was beginnin... | \n", "
1 | \n", "2 | \n", "II. The Pool of Tears\\n\\n'Curiouser and curio... | \n", "
2 | \n", "3 | \n", "III. A Caucus-Race and a Long Tale\\n\\nThey we... | \n", "
3 | \n", "4 | \n", "IV. The Rabbit Sends in a Little Bill\\n\\nIt w... | \n", "
4 | \n", "5 | \n", "V. Advice from a Caterpillar\\n\\nThe Caterpill... | \n", "
5 | \n", "6 | \n", "VI. Pig and Pepper\\n\\nFor a minute or two she... | \n", "
6 | \n", "7 | \n", "VII. A Mad Tea-Party\\n\\nThere was a table set... | \n", "
7 | \n", "8 | \n", "VIII. The Queen's Croquet-Ground\\n\\nA large r... | \n", "
8 | \n", "9 | \n", "IX. The Mock Turtle's Story\\n\\n'You can't thi... | \n", "
9 | \n", "10 | \n", "X. The Lobster Quadrille\\n\\nThe Mock Turtle s... | \n", "
10 | \n", "11 | \n", "XI. Who Stole the Tarts?\\n\\nThe King and Quee... | \n", "
11 | \n", "12 | \n", "XII\\n\\n Alice's Evidence\\n\\n\\n'Here... | \n", "
\n", " | chapter | \n", "text | \n", "_i_ | \n", "abide | \n", "able | \n", "absence | \n", "absurd | \n", "acceptance | \n", "accident | \n", "accidentally | \n", "... | \n", "year | \n", "years | \n", "yelled | \n", "yelp | \n", "yer | \n", "yesterday | \n", "young | \n", "youth | \n", "zealand | \n", "zigzag | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "I. Down the Rabbit-Hole\\n\\nAlice was beginnin... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
1 | \n", "2 | \n", "II. The Pool of Tears\\n\\n'Curiouser and curio... | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "3 | \n", "III. A Caucus-Race and a Long Tale\\n\\nThey we... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 rows × 2148 columns
\n", "