# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:

- Reading in files
- Character encoding
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging
- DTM/TF-IDF

### Time
- Teaching: 50 minutes
- Exercises: 60 minutes

## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [1]:
import os
DATA_DIR = 'data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?
- How can we get the text of the first question?
- How can we get a list of the texts of all questions?

In [2]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
with open(fname) as f:
    reader = csv.reader(f)
    tweets = list(reader)

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [3]:
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname)

In [4]:
tweets.head(3)

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,


In [5]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']

#### Reading in `.json` files

Python has built-in support for reading in `.json` files.

- How many questions are there in the dataset?
- What data type is each question?
- How can we access the question text of the first question?
- How can we get a list of the texts of all questions?

In [6]:
import json
fname = 'jeopardy.json'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    data = json.load(f)

In [7]:
data[:3]

[{'air_date': '2004-12-31',
  'answer': 'Copernicus',
  'category': 'HISTORY',
  'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
  'round': 'Jeopardy!',
  'show_number': '4680',
  'value': '$200'},
 {'air_date': '2004-12-31',
  'answer': 'Jim Thorpe',
  'category': "ESPN's TOP 10 ALL-TIME ATHLETES",
  'question': "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
  'round': 'Jeopardy!',
  'show_number': '4680',
  'value': '$200'},
 {'air_date': '2004-12-31',
  'answer': 'Arizona',
  'category': 'EVERYBODY TALKS ABOUT IT...',
  'question': "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
  'round': 'Jeopardy!',
  'show_number': '4680',
  'value': '$200'}]

#### Reading in `.html` files

The best way to read in `.html` files in Python is with the `BeautifulSoup` package.

In [8]:
from bs4 import BeautifulSoup
fname = 'time.html'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    html = f.read()
    soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [9]:
texts = soup.findAll(text=True)
#texts = soup.getText()
texts[:5]

['html', '\n', '\n', '\n', 'Time - Wikipedia']

#### Reading in `.xml` files

We read in `.xml` files using the `ElementTree` module of Python's standard library. We can think of `.xml` files as trees where each branch has a tag name. We can find all the branches with a certain name as follows:

In [10]:
from xml.etree import ElementTree as ET
fname = 'books.xml'
fname = os.path.join(DATA_DIR, fname)
e = ET.parse(fname)
root = e.getroot()

In [11]:
descriptions = root.findall('*/description')
text = [d.text for d in descriptions]
text[:3]

['An in-depth look at creating applications \n      with XML.',
 'A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.',
 'After the collapse of a nanotechnology \n      society in England, the young survivors lay the \n      foundation for a new society.']

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `austen`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [12]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with open(fname) as f:
        text = f.read()
        austen += text

### Challenge

Read in all the `.csv` files in the folder `amazon`. Extract out only the text column from each file and store them all in a list called `reviews`.

## Character encoding

Character encoding was more of a problem in Python 2 and early years in general. With Python 3 and most text files being encoded in `UTF-8`, we don't often need to think about it. If you're getting nonsense when reading in a file, try adding `encoding='utf-8'` to the `open` function.

In [13]:
fname = 'dante.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [14]:
text[5000:6000]

'oglia.\n\n  Questi non ciberà terra né peltro,\n  ma sapïenza, amore e virtute,\n  e sua nazion sarà tra feltro e feltro.\n\n  Di quella umile Italia fia salute\n  per cui morì la vergine Cammilla,\n  Eurialo e Turno e Niso di ferute.\n\n  Questi la caccerà per ogne villa,\n  fin che l’avrà rimessa ne lo ’nferno,\n  là onde ’nvidia prima dipartilla.\n\n  Ond’ io per lo tuo me’ penso e discerno\n  che tu mi segui, e io sarò tua guida,\n  e trarrotti di qui per loco etterno;\n\n  ove udirai le disperate strida,\n  vedrai li antichi spiriti dolenti,\n  ch’a la seconda morte ciascun grida;\n\n  e vederai color che son contenti\n  nel foco, perché speran di venire\n  quando che sia a le beate genti.\n\n  A le quai poi se tu vorrai salire,\n  anima fia a ciò più di me degna:\n  con lei ti lascerò nel mio partire;\n\n  ché quello imperador che là sù regna,\n  perch’ i’ fu’ ribellante a la sua legge,\n  non vuol che ’n sua città per me si vegna.\n\n  In tutte parti impera e quivi regge;\n  qu

In [15]:
fname = 'akutagawa.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [16]:
text[5000:6000]

'二人は屍骸の中で、暫、無言のまま、つかみ合った。しか\nし勝負は、はじめから、わかっている。下人はとうとう、老婆の腕をつかんで、無理に\nそこへねじ倒した。丁度、鶏（とり）の脚のような、骨と皮ばかりの腕である。\n\u3000「何をしていた。さあ何をしていた。云え。云わぬと\u3000これだぞよ。」\n\u3000下人は、老婆をつき放すと、いきなり、太刀の鞘を払って、白い鋼（はがね）の色を\nその眼の前へつきつけた。けれども、老婆は黙っている。両手をわなわなふるわせて、\n肩で息を切りながら、眼を、眼球がまぶたの外へ出そうになる程、見開いて、唖のよう\nに執拗（しゅうね）く黙っている。これを見ると、下人は始めて明白にこの老婆の生死\nが、全然、自分の意志に支配されていると云う事を意識した。そうして、この意識は、\n今まではげしく燃えていた憎悪の心を何時（いつ）の間にか冷ましてしまった。後に残っ\nたのは、唯、或仕事をして、それが円満に成就した時の、安らかな得意と満足とがある\nばかりである。そこで、下人は、老婆を、見下げながら、少し声を柔げてこう云った。\n\u3000「己は検非違使（けびいし）の庁の役人などではない。今し方この門の下を通りかかっ\nた旅の者だ。だからお前に縄をかけて、どうしようと云うような事はない。唯今時分、\nこの門の上で、何をしていたのだか、それを己に話さえすればいいのだ。」\n\u3000すると、老婆は、見開いた眼を、一層大きくして、じっとその下人の顔を見守った。\nまぶたの赤くなった、肉食鳥のような、鋭い眼で見たのである。それから、皺で、殆、\n鼻と一つになった唇を何か物でも噛んでいるように動かした。細い喉で、尖った喉仏の\n動いているのが見える。その時、その喉から、鴉（からす）の啼くような声が、喘ぎ喘\nぎ、下人の耳へ伝わって来た。\n\u3000「この髪を抜いてな、この女の髪を抜いてな、鬘（かつら）にしようと思うたの\nじゃ。」\n\u3000下人は、老婆の答が存外、平凡なのに失望した。そうして失望すると同時に、又前の\n憎悪が、冷な侮蔑と一しょに、心の中へはいって来た。すると\u3000その気色（けしき）が、\n先方へも通じたのであろう。老婆は、片手に、まだ屍骸の頭から奪（と）った長い抜け\n毛を持ったなり、蟇（ひき）のつぶやくよう

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [65]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [18]:
text.split()[:10]

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of']

#### Tokenizing with regular expressions

In [19]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, text)
tokens[:10]

['In', 'this', 'little', 'example', 'we', 're', 'going', 'to', 'see', 'some']

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [20]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
tokens[:10]

['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see']

#### Challenge

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

In [61]:
tokens = word_tokenize(austen)
tokens[0]

'\ufeffThe'

In [62]:
tokens[:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Emma',
 ',',
 'by',
 'Jane',
 'Austen']

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [66]:
text.split('.')

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 " Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [67]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

### Challenge

The file `example2.txt1` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

#### Sentence segmentation by `nltk`

In [68]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization.",
 "Tokenization may seem simple, but it's harder than it first appears.",
 'Why is it so hard?',
 "Punctuations, contractions (like don't, won't and would've) get in the way.",
 "We can split text into sentences using punctuation, but unfortunately that's not always going to work.",
 "For example, if I wanted to tell you about Dr. Frankenstein, or Mrs. Doubtfire, we'd be in trouble.",
 'What if I wanted to write about U.C.',
 'Berkeley?',
 'When you think about it, URLs like www.google.com are troublesome too.',
 'How would we settle on a price of $10.50?',
 'The main point is that these punctuation characters serve a variety of purposes in writing.',
 'Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.']

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [70]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

'In this little example were going to see some of the problems that regularly appear in tokenization Tokenization may seem simple but its harder than it first appears Why is it so hard Punctuations contractions like dont wont and wouldve get in the way \n\nWe can split text into sentences using punctuation but unfortunately thats not always going to work For example if I wanted to tell you about Dr Frankenstein or Mrs Doubtfire wed be in trouble What if I wanted to write about UC Berkeley When you think about it URLs like wwwgooglecom are troublesome too How would we settle on a price of 1050 The main point is that these punctuation characters serve a variety of purposes in writing Moreover the functions they serve change depending on the domain medical vs forum text and language'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [71]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [73]:
print(text)



This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.






In [74]:
stripped_text = text.strip()
print(stripped_text)

This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [75]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text

' This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches whitespace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this. '

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [31]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

'Upper and lower case characters can be annoying. Characters are the individual letters and numbers that we see on the page. Case folding is the generic term we use for dealing with upper and lower case characters. Lower case is often what people want. Title Case refers to a multi-word expression with the first character of every word in upper case. '

In [77]:
['One', 'Two'].lower()

AttributeError: 'list' object has no attribute 'lower'

### Challenge

The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

In [86]:
tokens = word_tokenize(text)
lowercase_tokens = []
for token in tokens:
    lowercased_version = token.lower()
    lowercase_tokens.append(lowercased_version)

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [34]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text[0]
single_tweet

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z'

In [35]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL '

#### Challenge

Above we replaced the URL in a single tweet. Now replace all the URLs in all tweets in `tweet_text`.

In [91]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []
for facebook_post in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, facebook_post)
    list_of_url_less_tweets.append(url_less_tweet)

In [93]:
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, facebook_post) for facebook_post in tweet_text]

In [94]:
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

#### Challenge

Use the regular expression for hashtags below to replace all hashtags in all tweets in `tweet_text`.

In [36]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

#### OOV words

Sometimes it's best for us to remove infrequent words (sometimes not!). When we do remove infrequent words, it's often for a downstream method (like classification) that is sensitive to rare words.

In [37]:
all_tweets = ' '.join(tweet_text)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

['Today',
 'we',
 'express',
 'our',
 'deepest',
 'gratitude',
 'to',
 'all',
 'those',
 'who',
 'have',
 'served',
 'in',
 'our',
 'armed',
 'forces',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabularly`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types.

In [38]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('URL', 932),
 ('HASHTAG', 717),
 ('DIGIT', 258),
 ('the', 87),
 ('in', 76),
 ('to', 72),
 ('of', 61),
 ('you', 57),
 ('I', 56),
 ('is', 54)]

In [39]:
freq['unleashed']

1

In [40]:
OOV = 'OOV'
new_tokens = []
for token in tokens:
    if freq[token] == 1:
        new_tokens.append(OOV)
    else:
        new_tokens.append(token)

In [41]:
new_tokens[:20]

['OOV',
 'we',
 'OOV',
 'our',
 'OOV',
 'OOV',
 'to',
 'all',
 'those',
 'who',
 'have',
 'OOV',
 'in',
 'our',
 'OOV',
 'OOV',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

### Challenge

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Separate each review into sentences
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [42]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

In [43]:
reviews[:3]

['Text',
 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".']

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [44]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

### Challenge

Use the list `stop` of English stopwords to remove stopwords from our dataset of Tweets.

In [45]:
all_tweets = ' '.join(tweet_text)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

['Today',
 'we',
 'express',
 'our',
 'deepest',
 'gratitude',
 'to',
 'all',
 'those',
 'who',
 'have',
 'served',
 'in',
 'our',
 'armed',
 'forces',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [46]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [47]:
stemmer.stem('grows')

'grow'

In [48]:
stemmer.stem('running')

'run'

In [49]:
stemmer.stem('leaves')

'leav'

In [50]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [51]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

run
leav


In [52]:
print(lemmatizer.lemmatize('leaves'))

leaf


### Challenge

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [53]:
from nltk import pos_tag
single_review = reviews[3]
single_review

'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

In [54]:
tokens = word_tokenize(single_review)
tagged_review = pos_tag(tokens)
tagged_review

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('confection', 'NN'),
 ('that', 'WDT'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('around', 'IN'),
 ('a', 'DT'),
 ('few', 'JJ'),
 ('centuries', 'NNS'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('light', 'JJ'),
 (',', ','),
 ('pillowy', 'JJ'),
 ('citrus', 'NN'),
 ('gelatin', 'NN'),
 ('with', 'IN'),
 ('nuts', 'NNS'),
 ('-', ':'),
 ('in', 'IN'),
 ('this', 'DT'),
 ('case', 'NN'),
 ('Filberts', 'NNP'),
 ('.', '.'),
 ('And', 'CC'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('cut', 'VBN'),
 ('into', 'IN'),
 ('tiny', 'JJ'),
 ('squares', 'NNS'),
 ('and', 'CC'),
 ('then', 'RB'),
 ('liberally', 'RB'),
 ('coated', 'VBN'),
 ('with', 'IN'),
 ('powdered', 'JJ'),
 ('sugar', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('tiny', 'JJ'),
 ('mouthful', 'NN'),
 ('of', 'IN'),
 ('heaven', 'NN'),
 ('.', '.'),
 ('Not', 'RB'),
 ('too', 'RB'),
 ('chewy', 'JJ'),
 (',', ','),
 ('and', 'CC'),
 ('very', 'RB'),
 ('flavorful', 'J

### Challenge

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [55]:
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
pride = raw[679:684814]

## DTM/TF-IDF

Document term matrix and Term Frequency-Inverse Document Frequency are common preprocessing steps for taking tokenized texts and turning them into numerical features, ready for supervised machine learning models. Scikit-learn is the standard method of using DTM and TF-IDF in Python. They have two main classes for this: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

In [56]:
clean = [re.sub(url_pattern, URL_SIGN, t) for t in tweet_text]
clean = [re.sub(hashtag_pattern, HASHTAG_SIGN, t) for t in clean]
clean = [re.sub(digit_pattern, DIGIT_SIGN, t) for t in clean]
clean = [re.sub(whitespace_pattern, ' ', t) for t in clean]
clean[:4]

['Today we express our deepest gratitude to all those who have served in our armed forces. HASHTAG URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']

In [57]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
count = CountVectorizer()
X = count.fit_transform(clean)
X

<7375x10046 sparse matrix of type '<class 'numpy.int64'>'
	with 113679 stored elements in Compressed Sparse Row format>

In [58]:
X.toarray()[:5,:5]

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int64)

In [59]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(clean)
X

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


<7375x10046 sparse matrix of type '<class 'numpy.float64'>'
	with 113679 stored elements in Compressed Sparse Row format>

In [60]:
X.toarray()[:5,:5]

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

## Things we didn't cover

- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- SpaCy