### Author: [Pratik Sharma](https://github.com/sharmapratik88/)
## Project 10 - Natural Language Processing - Sentiment Analysis

* Generate Word Embedding and retrieve outputs of each layer with Keras based on the Classification task.
* Word embedding are a type of word representation that allows words with similar meaning to have a similar representation.
* It is a distributed representation for the text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
* We will use the IMDb dataset to learn word embedding as we train our dataset. 
* This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).

**Data Description**

* The Dataset of 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
* For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word.
* Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.
* As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

### Import Packages

In [1]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Setting the current working directory
import os; os.chdir('drive/My Drive/Great Learning/NLP')

In [3]:
# Import packages
import pandas as pd, numpy as np
import tensorflow as tf
assert tf.__version__ >= '2.0'

from itertools import islice

# Keras
from keras.layers import Dense, Embedding, LSTM, Dropout, MaxPooling1D, Conv1D
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.preprocessing import sequence
from keras.datasets import imdb

from keras.callbacks import ModelCheckpoint, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')

random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)

Using TensorFlow backend.


### Loading Dataset - Train & Test Split

In [4]:
vocab_size = 10000
maxlen = 300
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size)

x_train = pad_sequences(x_train, maxlen = maxlen, padding = 'pre')
x_test = pad_sequences(x_test, maxlen = maxlen, padding = 'pre')

X = np.concatenate((x_train, x_test), axis = 0)
y = np.concatenate((y_train, y_test), axis = 0)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle = True)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size = 0.2, random_state = random_state, shuffle = True)

print('---'*20, f'\nNumber of rows in training dataset: {x_train.shape[0]}')
print(f'Number of columns in training dataset: {x_train.shape[1]}')
print(f'Number of unique words in training dataset: {len(np.unique(np.hstack(x_train)))}')


print('---'*20, f'\nNumber of rows in validation dataset: {x_valid.shape[0]}')
print(f'Number of columns in validation dataset: {x_valid.shape[1]}')
print(f'Number of unique words in validation dataset: {len(np.unique(np.hstack(x_valid)))}')


print('---'*20, f'\nNumber of rows in test dataset: {x_test.shape[0]}')
print(f'Number of columns in test dataset: {x_test.shape[1]}')
print(f'Number of unique words in test dataset: {len(np.unique(np.hstack(x_test)))}')


print('---'*20, f'\nUnique Categories: {np.unique(y_train), np.unique(y_valid), np.unique(y_test)}')

------------------------------------------------------------ 
Number of rows in training dataset: 32000
Number of columns in training dataset: 300
Number of unique words in training dataset: 9999
------------------------------------------------------------ 
Number of rows in validation dataset: 8000
Number of columns in validation dataset: 300
Number of unique words in validation dataset: 9984
------------------------------------------------------------ 
Number of rows in test dataset: 10000
Number of columns in test dataset: 300
Number of unique words in test dataset: 9995
------------------------------------------------------------ 
Unique Categories: (array([0, 1]), array([0, 1]), array([0, 1]))


### Get word index and create a key-value pair for word and word id

In [5]:
def decode_review(x, y):
 w2i = imdb.get_word_index() 
 w2i = {k:(v + 3) for k, v in w2i.items()}
 w2i[''] = 0
 w2i[''] = 1
 w2i[''] = 2
 i2w = {i: w for w, i in w2i.items()}

 ws = (' '.join(i2w[i] for i in x))
 print(f'Review: {ws}')
 print(f'Actual Sentiment: {y}')
 return w2i, i2w

w2i, i2w = decode_review(x_train[0], y_train[0])

# get first 50 key, value pairs from id to word dictionary
print('---'*30, '\n', list(islice(i2w.items(), 0, 50)))

Review: the only possible way to enjoy this flick is to bang your head against the wall allow some internal of the brain let a bunch of your brain cells die and once you are officially mentally retarded perhaps then you might enjoy this film br br the only saving grace was the story between and stephanie govinda was excellent in the role of the cab driver and so was the brit girl perhaps if they would have created the whole movie on their in india and how they eventually fall in love would have made it a much more enjoyable film br br the only reason i gave it a 3 rating is because of and his ability as an actor when it comes to comedy br br and anil kapoor were wasted needlessly plus the scene at of the re union was just too much to being an international in the post 9 11 world anil kapoor would have got himself shot much before he even reached the sky bridge to his true love but then again the point of the movie was to defy logic gravity physics and throw an egg on the face of the ge

### Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [6]:
# Model
model = Sequential()
model.add(Embedding(vocab_size, 256, input_length = maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(256, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(Conv1D(128, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))
model.add(Conv1D(64, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))
model.add(LSTM(75))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
print(model.summary())

# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 0) 
mc = ModelCheckpoint('imdb_model.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param # 
embedding_1 (Embedding) (None, 300, 256) 2560000 
_________________________________________________________________
dropout_1 (Dropout) (None, 300, 256) 0 
_________________________________________________________________
conv1d_1 (Conv1D) (None, 300, 256) 327936 
_________________________________________________________________
conv1d_2 (Conv1D) (None, 300, 128) 163968 
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 150, 128) 0 
_________________________________________________________________
conv1d_3 (Conv1D) (None, 150, 64) 41024 
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 75, 64) 0 
_________________________________________________________________
lstm_1 (LSTM) (None, 75) 42000 
_________________________________________________________________
dens

In [7]:
# Fit the model
model.fit(x_train, y_train, validation_data = (x_valid, y_valid), epochs = 3, batch_size = 64, verbose = True, callbacks = [es, mc])

# Evaluate the model
scores = model.evaluate(x_test, y_test, batch_size = 64)
print('Test accuracy: %.2f%%' % (scores[1]*100))

Train on 32000 samples, validate on 8000 samples
Epoch 1/3

Epoch 00001: val_loss improved from inf to 0.24669, saving model to imdb_model.h5
Epoch 2/3

Epoch 00002: val_loss did not improve from 0.24669
Epoch 00002: early stopping
Test accuracy: 90.14%


In [8]:
y_pred = model.predict_classes(x_test)
print(f'Classification Report:\n{classification_report(y_pred, y_test)}')

Classification Report:
 precision recall f1-score support

 0 0.92 0.89 0.90 5086
 1 0.89 0.92 0.90 4914

 accuracy 0.90 10000
 macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000



### Retrive output of each layer in keras for a given single test sample from the trained model

In [9]:
sample_x_test = x_test[np.random.randint(10000)]
for layer in model.layers:

 model_layer = Model(inputs = model.input, outputs = model.get_layer(layer.name).output)
 output = model_layer.predict(sample_x_test.reshape(1,-1))
 print('\n','--'*20, layer.name, 'layer', '--'*20, '\n')
 print(output)


 ---------------------------------------- embedding_1 layer ---------------------------------------- 

[[[ 4.74077724e-02 -1.45893563e-02 -1.92809459e-02 ... 1.59389190e-02
 -3.90756801e-02 -6.46728724e-02]
 [ 4.74077724e-02 -1.45893563e-02 -1.92809459e-02 ... 1.59389190e-02
 -3.90756801e-02 -6.46728724e-02]
 [ 4.74077724e-02 -1.45893563e-02 -1.92809459e-02 ... 1.59389190e-02
 -3.90756801e-02 -6.46728724e-02]
 ...
 [-5.12011871e-02 2.73237063e-04 -3.15764773e-05 ... 4.48421352e-02
 2.12928746e-02 -1.26087647e-02]
 [ 6.66740909e-02 1.52700637e-02 -7.01705664e-02 ... -9.86870304e-02
 4.93544117e-02 -3.51153836e-02]
 [-3.40692252e-02 -4.36996408e-02 4.43636142e-02 ... 1.14621185e-02
 2.80509088e-02 -2.31574550e-02]]]

 ---------------------------------------- dropout_1 layer ---------------------------------------- 

[[[ 4.74077724e-02 -1.45893563e-02 -1.92809459e-02 ... 1.59389190e-02
 -3.90756801e-02 -6.46728724e-02]
 [ 4.74077724e-02 -1.45893563e-02 -1.92809459e-02 ... 1.59389190e-02


In [10]:
decode_review(x_test[10], y_test[10])
print(f'Predicted sentiment: {y_pred[10][0]}')

Review: this movie was great and i was waiting for it for a long time when it finally came out i was really happy and looked forward to a 10 out of 10 it was great and lived up to my potential the performances were great on the part of the adults and most of the kids the only bad performance was by milo himself there was one problem that i encountered with this and others like it movie all of the characters i wanted to live were getting killed overall i give this movie an excellent 9 out of 10 maybe we should better people to kill next time though ok
Actual Sentiment: 1
Predicted sentiment: 1


### Conclusion
* Sentiment classification task on the IMDB dataset, on test dataset,
 * Accuracy: > 90%
 * F1-score: > 90%
 * Loss of 0.25