## Week 5: Word-Level Text Analysis


Topics to Cover
---------------

-   comparing word frequency between authors

-   part-of-speech (POS) tagging

-   POS frequency comparison

-   sentiment analysis


### Class Objective
Use text analysis techniques introduced by Montfort to examine and compare small text corpora.

#### Loading Corpora
Today we will be analyzing and comparing two small text corpora, which we will download using `wget`.

- [Works of Ralph Waldo Emerson](https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Emerson.zip)
- [Works of Oscar Wilde](https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Wilde.zip)


## *> Review Exercises*

In [None]:
sample_text = "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall. Speak what you think now in hard words, and to-morrow speak what to-morrow thinks in hard words again, though it contradict every thing you said to-day. — 'Ah, so you shall be sure to be misunderstood.' — Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood."

print(sample_text)

In [None]:
## Split the sample text above into a list of words. Don't worry about punctuation and other oddities.





In [None]:
## Create a function that takes a text string as an argument and returns the number of words in the text.





In [None]:
## Create a list that contains the first letter of each word in the sample text.
### Hint: Use the `append()` function to add items to your list.





In [None]:
# Create a list that contains the length, in characters, of each word in the sample text.





In [None]:
## Create a function that takes a text string as an argument and returns the average word length in the text.





## *> Assembling Text Corpora*


In [None]:
# First, import the packages we will use below.

import os
import nltk
import textblob
import random
from pprint import pprint

In [None]:
# Download zipped texts from GitHub, then unzip the directories.

os.chdir('/sharedfolder/')

!wget -N https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Emerson.zip?raw=true -O Emerson.zip
!unzip -o Emerson.zip

!wget -N https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Wilde.zip?raw=true -O Wilde.zip
!unzip -o Wilde.zip

In [None]:
## First, load each author’s works as a list of strings.

corpus_1_dir = "/sharedfolder/Emerson/"
corpus_2_dir = "/sharedfolder/Wilde/"

##

os.chdir(corpus_1_dir)

corpus_1_filenames = os.listdir("./")

corpus_1_texts=[]

for filename in corpus_1_filenames:
    text = open(filename).read().replace("\n"," ") #replaces newline characters with spaces
    corpus_1_texts.append(text)

##
    
os.chdir(corpus_2_dir)

corpus_2_filenames = os.listdir("./")

corpus_2_texts=[]

for filename in corpus_2_filenames:
    text = open(filename).read().replace("\n"," ") #replaces newline characters with spaces
    corpus_2_texts.append(text)

In [None]:
# Let's check the number of texts in corpus 1, then view the first 2000 characters in a randomly chosen text:

print('Number of texts:')
print(len(corpus_1_texts))

print()

random_text = random.choice(corpus_1_texts)
print(random_text[:2000])

In [None]:
# Let's do the same for corpus 2:

print('Number of texts:')
print(len(corpus_2_texts))

print()

random_text = random.choice(corpus_2_texts)
print(random_text[:2000])

## *> Using TextBlob*

Let’s review the TextBlob package, introduced in this week’s reading by Nick Montfort. First, let’s load TextBlob and convert two texts to lists of words. 

Note that each is contained in a WordList object, which we can manipulate as if it were an ordinary list.


In [None]:
from textblob import TextBlob

text_1 = TextBlob(corpus_1_texts[0])  # using the first text in the list corpus_1_texts
print(text_1.words[:15])

print()

text_2 = TextBlob(corpus_2_texts[0])  # using the first text in the list corpus_2_texts
print(text_2.words[:15])

In [None]:
# We can also print sentences, contained in Sentence objects.

print(text_1.sentences[:5])

print()

print(text_2.sentences[:5])

In [None]:
# Note the following methods of manipulating your TextBlob results.

print(sorted(text_1.words)[:500])  # prints first 500 words in alphabetized word list

In [None]:
print(sorted(list(set(text_1.words)))[:500]) # prints sorted list of unique words (first 500 items)

In [None]:
# Each TextBlob object contains a dictionary with the number of times each word appears in a text.

from pprint import pprint

pprint(text_1.word_counts)

## *> Quick Exercise*

Create a function that returns the top 20 most frequent words in a given TextBlob object. 


*Hint: Use the `itemgetter` module to sort a list of lists by a given index. We introduced `itemgetter` in the week 3 code tutorial.*

## *> Word Frequency Sans Stopwords*

Next we'll load the `nltk` module, which was installed as a dependency of TextBlob.

In computational text analysis, the term “stopword” refers to words that appear
frequently in most texts in a given language — e.g., “I,” “the,” “and,” “while,”
and so on. NLTK provides a useful stopword list. Here we assign the English stopword 
list to the variable `stopwords_eng`.

In [None]:
import nltk
from nltk.corpus import stopwords

stopwords_eng = stopwords.words('english')

print(stopwords_eng)

In [None]:
# Now let’s look at the most frequent words in a text, disregarding stopwords.

from textblob import Word

freq_dict = text_1.word_counts

freq_sans_stopwords = []

for key in freq_dict:
    lemma = Word(key).lemmatize()
    if lemma not in stopwords_eng:
        freq_sans_stopwords.append([key, freq_dict[key]])

sorted_freq_sans_stopwords = sorted(freq_sans_stopwords, key = itemgetter(1))[::-1]

pprint(sorted_freq_sans_stopwords[:20])

# How do you interpret this list? Does it give you any insight into the text you’re looking at?


## *> Quick Exercise*

Referencing the code above, create a function that returns a sorted list of stopword-free word frequency lists when passed a TextBlob object. Look at the top vocabulary for several texts by each of your authors. How similar or different are these frequency lists between texts and between authors?


## POS Tagging

We can also use TextBlob to create a list of part-of-speech tags for each word in a text.

Let’s take a close look at our results. Examine two or three sentences a word at a time and check whether parts of speech were tagged correctly. If you find any mistakes, can you guess why the tagging algorithm slipped up?

In [None]:
pprint(text_1.words)

In [None]:
pprint(text_1.tags)

In [None]:
# Following Montfort’s example, let’s create a function that counts the number of adjectives in a text.

def adj_count(text):
    count = 0
    for word, tag in text.tags:
        if tag == 'JJ':
            count+=1
    return count

print(adj_count(text_1))

In [None]:
def adj_percent(text):
    return float(adj_count(text))/len(text.words)

print(adj_percent(text_1))


## *> Exercise*

Create a function called `POS_profile` that takes a TextBlob object and returns a list containing several parts of speech and their relative frequency within the text. Your POS profile should include the following parts of speech:

- nouns
- adjectives
- verbs
- adverbs
- pronouns

You can find a full list of POS tags used by TextBlob [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Note that several parts of speech are split into multiple codes (e.g., NN, NNS, NNP, and NNPS for different classes of noun).

Next, run your POS profile on each text in your two corpora. How much do these values vary between authors and among texts by the same author?


## *> Sentiment Analysis with TextBlob*

In [None]:
# Negative polarity example

text = "This is a very mean and nasty sentence."

blob = TextBlob(text)

# result between -1 and +1
sentiment_score = blob.sentiment.polarity

print(sentiment_score)

In [None]:
# Positive polarity example

text = "This is a nice and positive sentence."

blob = TextBlob(text)

# result between -1 and +1
sentiment_score = blob.sentiment.polarity

print(sentiment_score)

## *> Exercise*

1. Measure sentiment scores for each sentence in a text, then calculate an average sentiment value across the full text.


2. Calculate average sentiment values for each text in our Emerson and Wilde corpora. Which author's writing appears to be more 'positive' on average? What are the most 'positive' and most 'negative' texts in the collection?


## Naive Bayes Classification

Review classification examples from Montfort text.

## *> Exercise*

Divide each of your corpora into two sets, one for training our classifier and one for testing. Split each text into a list of sentences and combine these to create four master lists: author 1 training, author 1 testing, author 2 training, author 2 testing.

Create a Naive Bayes classifier using your two training sets. Run the classifier on each sentence in your test sets and calculate the accuracy of your model.

Examine sentences that were misclassified. Why do you think the algorithm was misled?

