{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Week 5: Word-Level Text Analysis\n", "\n", "\n", "Topics to Cover\n", "---------------\n", "\n", "- comparing word frequency between authors\n", "\n", "- part-of-speech (POS) tagging\n", "\n", "- POS frequency comparison\n", "\n", "- sentiment analysis\n", "\n", "\n", "### Class Objective\n", "Use text analysis techniques introduced by Montfort to examine and compare small text corpora.\n", "\n", "#### Loading Corpora\n", "Today we will be analyzing and comparing two small text corpora, which we will download using `wget`.\n", "\n", "- [Works of Ralph Waldo Emerson](https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Emerson.zip)\n", "- [Works of Oscar Wilde](https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Wilde.zip)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Review Exercises*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_text = \"A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall. Speak what you think now in hard words, and to-morrow speak what to-morrow thinks in hard words again, though it contradict every thing you said to-day. — 'Ah, so you shall be sure to be misunderstood.' — Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood.\"\n", "\n", "print(sample_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Split the sample text above into a list of words. Don't worry about punctuation and other oddities.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Create a function that takes a text string as an argument and returns the number of words in the text.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Create a list that contains the first letter of each word in the sample text.\n", "### Hint: Use the `append()` function to add items to your list.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a list that contains the length, in characters, of each word in the sample text.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Create a function that takes a text string as an argument and returns the average word length in the text.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Assembling Text Corpora*\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First, import the packages we will use below.\n", "\n", "import os\n", "import nltk\n", "import textblob\n", "import random\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download zipped texts from GitHub, then unzip the directories.\n", "\n", "os.chdir('/sharedfolder/')\n", "\n", "!wget -N https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Emerson.zip?raw=true -O Emerson.zip\n", "!unzip -o Emerson.zip\n", "\n", "!wget -N https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Wilde.zip?raw=true -O Wilde.zip\n", "!unzip -o Wilde.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## First, load each author’s works as a list of strings.\n", "\n", "corpus_1_dir = \"/sharedfolder/Emerson/\"\n", "corpus_2_dir = \"/sharedfolder/Wilde/\"\n", "\n", "##\n", "\n", "os.chdir(corpus_1_dir)\n", "\n", "corpus_1_filenames = os.listdir(\"./\")\n", "\n", "corpus_1_texts=[]\n", "\n", "for filename in corpus_1_filenames:\n", " text = open(filename).read().replace(\"\\n\",\" \") #replaces newline characters with spaces\n", " corpus_1_texts.append(text)\n", "\n", "##\n", " \n", "os.chdir(corpus_2_dir)\n", "\n", "corpus_2_filenames = os.listdir(\"./\")\n", "\n", "corpus_2_texts=[]\n", "\n", "for filename in corpus_2_filenames:\n", " text = open(filename).read().replace(\"\\n\",\" \") #replaces newline characters with spaces\n", " corpus_2_texts.append(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's check the number of texts in corpus 1, then view the first 2000 characters in a randomly chosen text:\n", "\n", "print('Number of texts:')\n", "print(len(corpus_1_texts))\n", "\n", "print()\n", "\n", "random_text = random.choice(corpus_1_texts)\n", "print(random_text[:2000])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's do the same for corpus 2:\n", "\n", "print('Number of texts:')\n", "print(len(corpus_2_texts))\n", "\n", "print()\n", "\n", "random_text = random.choice(corpus_2_texts)\n", "print(random_text[:2000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Using TextBlob*\n", "\n", "Let’s review the TextBlob package, introduced in this week’s reading by Nick Montfort. First, let’s load TextBlob and convert two texts to lists of words. \n", "\n", "Note that each is contained in a WordList object, which we can manipulate as if it were an ordinary list.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textblob import TextBlob\n", "\n", "text_1 = TextBlob(corpus_1_texts[0]) # using the first text in the list corpus_1_texts\n", "print(text_1.words[:15])\n", "\n", "print()\n", "\n", "text_2 = TextBlob(corpus_2_texts[0]) # using the first text in the list corpus_2_texts\n", "print(text_2.words[:15])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also print sentences, contained in Sentence objects.\n", "\n", "print(text_1.sentences[:5])\n", "\n", "print()\n", "\n", "print(text_2.sentences[:5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note the following methods of manipulating your TextBlob results.\n", "\n", "print(sorted(text_1.words)[:500]) # prints first 500 words in alphabetized word list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(sorted(list(set(text_1.words)))[:500]) # prints sorted list of unique words (first 500 items)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Each TextBlob object contains a dictionary with the number of times each word appears in a text.\n", "\n", "from pprint import pprint\n", "\n", "pprint(text_1.word_counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Quick Exercise*\n", "\n", "Create a function that returns the top 20 most frequent words in a given TextBlob object. \n", "\n", "\n", "*Hint: Use the `itemgetter` module to sort a list of lists by a given index. We introduced `itemgetter` in the week 3 code tutorial.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Word Frequency Sans Stopwords*\n", "\n", "Next we'll load the `nltk` module, which was installed as a dependency of TextBlob.\n", "\n", "In computational text analysis, the term “stopword” refers to words that appear\n", "frequently in most texts in a given language — e.g., “I,” “the,” “and,” “while,”\n", "and so on. NLTK provides a useful stopword list. Here we assign the English stopword \n", "list to the variable `stopwords_eng`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.corpus import stopwords\n", "\n", "stopwords_eng = stopwords.words('english')\n", "\n", "print(stopwords_eng)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now let’s look at the most frequent words in a text, disregarding stopwords.\n", "\n", "from textblob import Word\n", "\n", "freq_dict = text_1.word_counts\n", "\n", "freq_sans_stopwords = []\n", "\n", "for key in freq_dict:\n", " lemma = Word(key).lemmatize()\n", " if lemma not in stopwords_eng:\n", " freq_sans_stopwords.append([key, freq_dict[key]])\n", "\n", "sorted_freq_sans_stopwords = sorted(freq_sans_stopwords, key = itemgetter(1))[::-1]\n", "\n", "pprint(sorted_freq_sans_stopwords[:20])\n", "\n", "# How do you interpret this list? Does it give you any insight into the text you’re looking at?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## *> Quick Exercise*\n", "\n", "Referencing the code above, create a function that returns a sorted list of stopword-free word frequency lists when passed a TextBlob object. Look at the top vocabulary for several texts by each of your authors. How similar or different are these frequency lists between texts and between authors?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## POS Tagging\n", "\n", "We can also use TextBlob to create a list of part-of-speech tags for each word in a text.\n", "\n", "Let’s take a close look at our results. Examine two or three sentences a word at a time and check whether parts of speech were tagged correctly. If you find any mistakes, can you guess why the tagging algorithm slipped up?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pprint(text_1.words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pprint(text_1.tags)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Following Montfort’s example, let’s create a function that counts the number of adjectives in a text.\n", "\n", "def adj_count(text):\n", " count = 0\n", " for word, tag in text.tags:\n", " if tag == 'JJ':\n", " count+=1\n", " return count\n", "\n", "print(adj_count(text_1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def adj_percent(text):\n", " return float(adj_count(text))/len(text.words)\n", "\n", "print(adj_percent(text_1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## *> Exercise*\n", "\n", "Create a function called `POS_profile` that takes a TextBlob object and returns a list containing several parts of speech and their relative frequency within the text. Your POS profile should include the following parts of speech:\n", "\n", "- nouns\n", "- adjectives\n", "- verbs\n", "- adverbs\n", "- pronouns\n", "\n", "You can find a full list of POS tags used by TextBlob [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Note that several parts of speech are split into multiple codes (e.g., NN, NNS, NNP, and NNPS for different classes of noun).\n", "\n", "Next, run your POS profile on each text in your two corpora. How much do these values vary between authors and among texts by the same author?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Sentiment Analysis with TextBlob*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Negative polarity example\n", "\n", "text = \"This is a very mean and nasty sentence.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "# result between -1 and +1\n", "sentiment_score = blob.sentiment.polarity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Positive polarity example\n", "\n", "text = \"This is a nice and positive sentence.\"\n", "\n", "blob = TextBlob(text)\n", "\n", "# result between -1 and +1\n", "sentiment_score = blob.sentiment.polarity\n", "\n", "print(sentiment_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *> Exercise*\n", "\n", "1. Measure sentiment scores for each sentence in a text, then calculate an average sentiment value across the full text.\n", "\n", "\n", "2. Calculate average sentiment values for each text in our Emerson and Wilde corpora. Which author's writing appears to be more 'positive' on average? What are the most 'positive' and most 'negative' texts in the collection?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Naive Bayes Classification\n", "\n", "Review classification examples from Montfort text.\n", "\n", "## *> Exercise*\n", "\n", "Divide each of your corpora into two sets, one for training our classifier and one for testing. Split each text into a list of sentences and combine these to create four master lists: author 1 training, author 1 testing, author 2 training, author 2 testing.\n", "\n", "Create a Naive Bayes classifier using your two training sets. Run the classifier on each sentence in your test sets and calculate the accuracy of your model.\n", "\n", "Examine sentences that were misclassified. Why do you think the algorithm was misled?\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }