# Introduction to Natural Language Processing
> In this post, we will dig into the strong NLP foundation through basic concepts like Tokenization, Stopword handling, Stemming and so on. We will use sklearn with Natural Language ToolKit (NLTK) package, which is widely used in NLP area.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Machine_Learning, Natural_Language_Processing]
- image: images/ner_draw.png

## Introduction
Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data (wikipedia).

This rapidly improving area of artificial intelligence covers tasks such as speech recognition, natural-language understanding, and natural language generation.

In the following projects, we're going to be building a strong NLP foundation by practicing:

- Tokenizing - Splitting sentences and words from the body of text.
- Part of Speech tagging
- Chunking

This foundation will open the door for machine learning in conjunction with NLP. We will cover:

- Machine learning in NLP
- How to tie in Scikit-learn (sklearn) with NLTK
- Training classifiers with a datasets (Next Project)

Let's dive right in! We are going to be using the Natural Language Toolkit (NLTK) which is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language

## Required packages
Mentioned before, it is required to import nltk package. But not only install the package itself, we need to import nltk-related packages. In that case, we just use `nltk.download()` to download required packages.

In [1]:
import nltk
import sys
import sklearn

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

After that, you may see some GUI form to download some packages. The efficient way to process NLP is selecting "popular" option. Then it will download and install required packages and Corpora.

![nltk_download](image/nltk_download.png)

We need more corporas for this post. Choose the next menu in Corpora, and install following things,

- `state_union`
- `udhr2`
- `udhr`

## Notation

Before beginning, Some words will not be familiar with us like corpus, Lexicon, and Token.
**Corpus** is the body of text. It is singular. **Corpora** is the plural of this. **Lexicon** is the set of words and its meanings. And **Token** means each "entity" that is a part of whatever was split up based on specific rules. For example, We can tokenize the word based on stem, or space.

## Version check

In [3]:
print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))

Python: 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
NLTK: 3.4.5
Scikit-learn: 0.22.1


## Tokenization
When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello students, how are you doing today? The olympics are inspiring, and Python is awesome. You look nice today."

If we want to tokenize this sentence,

In [5]:
sent_tokenize(text)

['Hello students, how are you doing today?',
 'The olympics are inspiring, and Python is awesome.',
 'You look nice today.']

You can see 3 sentence by tokenizing. Maybe we guess that it is tokenized by the Capital letter. 

Next, if you want the word that containing this sentense,

In [6]:
word_tokenize(text)

['Hello',
 'students',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'olympics',
 'are',
 'inspiring',
 ',',
 'and',
 'Python',
 'is',
 'awesome',
 '.',
 'You',
 'look',
 'nice',
 'today',
 '.']

You can see almost all words in sentence is tokenized, but some specific characters are also contained like ",", "?", ".". These are called **puncutation**. Of course, these are important to understand the intension of sentence. But we'll cover it later.

## Stop words
When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as **stop words**. Of course, it would be different in each language, in this post we will use english stop words.

In [7]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Mentioned before, it maybe important to understand the intension. But most of time, words in stopwords appears multiple times and it will be hard to understand the intension of sentence from computer side. So it is helpful to remove these words in advance. See what is different.

In [8]:
example_sent = "This is some sample text, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)

filtered_sent = [w for w in word_tokens if not w in stop_words]

print(word_tokens)
print(filtered_sent)

['This', 'is', 'some', 'sample', 'text', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'text', ',', 'showing', 'stop', 'words', 'filtration', '.']


## Stemming
Stemming, which attempts to normalize sentences, is another preprocessing step that we can perform. In the english language, different variations of words and sentences often having the same meaning. Stemming is a way to account for these variations; furthermore, it will help us shorten the sentences and shorten our lookup. For example, consider the following sentence:

- I was taking a ride on my horse.
- I was riding my horse.

These sentences mean the same thing, as noted by the same tense (`-ing`) in each sentence; however, that isn't intuitively understood by the computer. To account for all the variations of words in the english language, we can use the `Porter` stemmer, which has been around since 1979. You can see the details in this [pages](https://tartarus.org/martin/PorterStemmer/).

In [9]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ['ride', 'riding', 'rider', 'rides']

for w in example_words:
    print(ps.stem(w))

ride
ride
rider
ride


Usually, we can apply this in token, so we can analyze which words appears occasionally.

In [10]:
text = "When riders are riding their horses, they often think of how cowboys rode horses."

words = word_tokenize(text)

for w in words:
    print(ps.stem(w))

when
rider
are
ride
their
hors
,
they
often
think
of
how
cowboy
rode
hors
.


## Part of Speech Tagging (POS)
Part of speech tagging means labeling words as nouns, verbs, adjectives, etc. Even better, NLTK can handle tenses! While we're at it, we are also going to import a new sentence tokenizer (`PunktSentenceTokenizer`). This tokenizer is capable of unsupervised learning, so it can be trained on any body of text.

In this section, we will use pre-downloaded "Universal declaration of human rights" (`udhr` for short).

In [11]:
from nltk.corpus import udhr
print(udhr.raw('English-Latin1'))

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, 

Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, 

Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, 

Whereas it is essential to promote the development of friendly relations between nations, 

Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in

Here, we will also import some corpus examples, - George Bush's 2005 and 2006 state of the union addresses.

In [12]:
from nltk.corpus import state_union

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [13]:
print(train_text[:1000])

PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION
 
February 2, 2005


9:10 P.M. EST 

THE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: 

As a new Congress gathers, all of us in the elected branches of government share a great privilege: We've been placed in office by the votes of the people we serve. And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq. (Applause.) 

Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. 

Tonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong. (Applause.) 

Our generati

Now we have some text, we can train the PunktSentenceTokenizer.

In [14]:
from nltk.tokenize import PunktSentenceTokenizer

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokens = custom_sent_tokenizer.tokenize(sample_text)

In [15]:
tokens

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national ach

After that, we can tokenize each word in sentence. Now we need to tag each words with part of speech (also known as POS)

In [16]:
def process_content():
    try:
        for t in tokens[:5]:
            words = nltk.word_tokenize(t)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))
        
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

Here, you can see tags are added after each words. Each tag means:

- POS: Possesive ending
- NNP: Proper noun, singular
- IN: Preposition or subordinating conjuntion
- NN: Noun, sigular or mass
- RB : Adverb
- VBP: Verb, non-3rd person sigular present
- ...

(The detailed list are found in [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))

Or you can download the tagsets from `nltk.download()`. (All packages -> tagsets)

In [17]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## Chunking
Now that each word has been tagged with a part of speech, we can move onto chunking, meaning that grouping the words into meaningful clusters. The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs.

The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:

- $+$ = match 1 or more
- $?$ = match 0 or 1 repetitions.
- $*$ = match 0 or MORE repetitions	  
- $.$ = Any character except a new line


In [18]:
def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # draw the chunks with nltk
            chunked.draw()     

    except Exception as e:
        print(str(e))

We build the chuck rule as follows:

$<\text{RB}.?>*$ = "0 or more of any tense of adverb," followed by: 

$<\text{VB}.?>*$ = "0 or more of any tense of verb," followed by: 

$<\text{NNP}>+$ = "One or more proper nouns," followed by 

$<\text{NN}>?$ = "zero or one singular noun." 

See what's going on.

In [19]:
process_content()

Maybe you can this kind of tree diagrams:

![nltk_draw](image/nltk_draw.png)

This diagram shows the hierarchical relationship between words and which words are grouping with some tokens.

Or we can print it inline, not showing in GUI.

In [22]:
def process_content():
    try:
        for i in tokens[:10]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print the nltk tree
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

    except Exception as e:
        print(str(e))

In [23]:
process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)
(Chunk Coretta/NNP Scott/NNP King/NNP)
(Chunk Applause/NNP)
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP)
(Chunk Tuesday/NNP)
(Chunk Jan/NNP)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)


## Chinking
Another process in NLP is **chinking** that remove the chunks that we don't want to use.

In [28]:
def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # The main difference here is the }{, vs. the {}. This means we're removing 
            # from the chink one or more verbs, prepositions, determiners, or the word 'to'.
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            # chunked.draw()

    except Exception as e:
        print(str(e))

In [29]:
process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP 'S/POS ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk
  THE/NNP
  UNION/NNP
  January/NNP
  31/CD
  ,/,
  2006/CD
  THE/NNP
  PRESIDENT/NNP
  :/:
  Thank/NNP
  you/PRP)
(Chunk ./.)
(Chunk
  Mr./NNP
  Speaker/NNP
  ,/,
  Vice/NNP
  President/NNP
  Cheney/NNP
  ,/,
  members/NNS)
(Chunk Congress/NNP ,/, members/NNS)
(Chunk
  Supreme/NNP
  Court/NNP
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:)
(Chunk our/PRP$ nation/NN)
(Chunk ,/, graceful/JJ ,/, courageous/JJ woman/NN who/WP)
(Chunk America/NNP)
(Chunk its/PRP$ founding/NN ideals/NNS and/CC)
(Chunk noble/JJ dream/NN ./.)
(Chunk Tonight/NN we/PRP)
(Chunk hope/NN)
(Chunk glad/JJ reunion/NN)
(Chunk husband/NN who/WP)
(Chunk so/RB long/RB ago/RB ,/, and/CC we/PRP)
(Chunk grateful/JJ)
(Chunk good/JJ life/NN)
(Chunk Coretta/NNP Scott/NNP King/NNP ./.)
(Chunk

For summary, using modified regular expressions, we can define **chunk patterns**. These are patterns of part-of-speech tags that define what kinds of words make up a **chunk**. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as **chinks**.

## Named Entity Recognition (NER)
One of the most common forms of chunking in natural language processing is called **Named Entity Recognition** (NER for short). NLTK is able to identify people, places, things, locations, monetary figures, and more.

There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

In [32]:
def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            
            # print(chunked)
            for subtree in namedEnt.subtrees(filter=lambda t: t.label() == 'NE'):
                print(subtree)
            
#             namedEnt.draw()
            
    except Exception as e:
        print(str(e))

In [33]:
process_content()

(NE GEORGE/NNP)
(NE ADDRESS/NNP)
(NE THE/NNP)
(NE CONGRESS/NNP)
(NE THE/NNP UNION/NNP)
(NE Mr./NNP Speaker/NNP)
(NE Cheney/NNP)
(NE Congress/NNP)
(NE Supreme/NNP Court/NNP)
(NE America/NNP)
(NE Coretta/NNP Scott/NNP King/NNP)
(NE Applause/NNP)
(NE George/NNP)
(NE Union/NNP Address/NNP)
(NE Capitol/NNP)
(NE Jan/NNP)


Same visualization we can see above.

![ner_draw](image/ner_draw.png)

## Text Classification

Now, it's time to process text classification. All processes we've done is some kind of preprocessing the text data, like tokenization, stemming, POS tagging, chunking and chinking, and NER. 

In this part, we will use movie review dataset in NLTK, one of famous NLP datasets. This datasets are commonly used to sentimental analysis. But we need to classify each words in advance.

In [37]:
from nltk.corpus import movie_reviews
import random

# Build documents
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories() 
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[1]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

# Generate frequency distribution
all_words = nltk.FreqDist(all_words)

print('\nMost common words: {}'.format(all_words.most_common(15)))
print('\nThe word happy: {}'.format(all_words["happy"]))

Number of Documents: 2000
First Review: (['capsule', ':', 'gal', 'is', 'a', '50s', '-', 'ish', 'london', 'cockney', 'gangster', 'who', 'has', 'retired', 'to', 'spain', '.', 'his', 'old', 'associates', 'want', 'him', 'for', 'one', 'last', 'job', 'and', 'send', 'the', 'vicious', 'don', 'to', 'give', 'him', 'an', 'offer', 'he', 'can', "'", 't', 'refuse', '.', 'a', 'standout', 'performance', 'by', 'ben', 'kingsley', 'as', 'don', 'cannot', 'save', 'what', 'is', 'essentially', 'a', 'set', 'of', 'cliches', 'recycled', 'from', 'old', 'westerns', '.', ',', '0', '(', '-', '4', 'to', '+', '4', ')', 'roger', 'ebert', 'asks', 'in', 'his', 'review', 'of', 'sexy', 'beast', ',', '"', 'who', 'would', 'have', 'guessed', 'that', 'the', 'most', 'savage', 'mad', '-', 'dog', 'frothing', 'gangster', 'in', 'recent', 'movies', 'would', 'be', 'played', 'by', '.', '.', '.', 'ben', 'kingsley', '?', '"', 'my', 'response', 'would', 'be', 'that', 'anyone', 'who', 'has', 'seen', 'alan', 'arkin', 'in', 'wait', 'until'

You can see that there are 215 "happy" words in movie_reviews. It means that that review maybe positive review. And you can also notice that the common words contain punctuation or stop words and so on. 

Now we need to build features. In this post, we will use 4000 high frequent words as features.

In [38]:
word_features = list(all_words.keys())[:4000]

And it will be helpful to define the function that find the features.

In [42]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Then Let's use an example from a negative review.

In [46]:
neg_features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for k, v in neg_features.items():
    if v == True:
        print(k)

plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an
accident
one
of
the
guys
dies
but
his
girlfriend
continues
see
him
in
her
life
has
nightmares
what
'
s
deal
?
watch
movie
"
sorta
find
out
critique
mind
-
fuck
for
generation
that
touches
on
very
cool
idea
presents
it
bad
package
which
is
makes
this
review
even
harder
write
since
i
generally
applaud
films
attempt
break
mold
mess
with
your
head
such
(
lost
highway
&
memento
)
there
are
good
ways
making
all
types
these
folks
just
didn
t
snag
correctly
seem
have
taken
pretty
neat
concept
executed
terribly
so
problems
well
its
main
problem
simply
too
jumbled
starts
off
normal
downshifts
fantasy
world
you
as
audience
member
no
going
dreams
characters
coming
back
from
dead
others
who
look
like
strange
apparitions
disappearances
looooot
chase
scenes
tons
weird
things
happen
most
not
explained
now
personally
don
trying
unravel
film
every
when
does
give
me
same
clue
over
again
kind
fed
up
after
while
biggest


Now redo this in whole documents.

In [47]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

After that, we will use Support Vector Classifier for text classification. Before classifcation, we need to make train and test set, same as usual.

In [48]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(featuresets, test_size=0.25, random_state=1)

In [49]:
print(len(training), len(test))

1500 500


Then, we use `SVC` from sklearn and `SklearnClassifier` from nltk.

In [51]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

# Instantiate the model
model = SklearnClassifier(SVC(kernel='linear'))

# Train the model
model.train(training)

# Evaluate the model
accuracy = nltk.classify.accuracy(model, test)
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 0.802


As a result, we can build the text classifier model with almost 80% accuracy.

## Summary
In this post, we covered tokenization, stemming, POS tagging, chucking/chinking, and NER for data preprocessing. After that, we built the classifier model with Support Vector Machine, the training it with data. As a result we could build the text classifier model with 80% accuracy.