{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TF-IDF and N-Grams\n",
"The goal of this project was to predict the sentiment of an IMDB movie review using a binary classification system. The dataset was part of the [Bag of Words Meets Bag of Popcorn Competition](https://www.kaggle.com/c/word2vec-nlp-tutorial).\n",
"\n",
"Model Accuracy: 0.89532\n",
"\n",
"## Bag of Words & TF-IDF\n",
"\n",
"A Bag of Words (BoW) model is a simple algorithm used in Natural Language Processing. It simply counts the number of times a word appears in a document.\n",
"\n",
"TF-IDF (or Term Frequency-Inverse Document Frequency) on the other hand reflects how important a word is to a document, or corpus. With TF-IDF, words are given weight, measured by relevance, rather than frequency. \n",
"\n",
"\n",
"It is the product of two statistics:\n",
"1. Term Frequency (TF): The number of times a word appears in a given document.\n",
"2. Inverse Document Frequency (IDF): The more documents a word appears in, the less valuable that word is as a signal. Very common words, such as “a” or “the”, thereby receive heavily discounted tf-idf scores, in contrast to words that are very specific to the document in question. \n",
"\n",
"\n",
"\n",
"In the project, I used two separate TF-IDF vectorizers and merged them into a single bag of words.\n",
"* The first vectorizer (word_vectorizer) analyzed complete words.\n",
"* The second vectorizer (char_vectorizer) analyzed the frequency of character n-grams. An n-gram is a continous sequence of n items from a document. Using Trigrams (N-gram size = 3) yielded a high predictive score.\n",
"\n",
"Lastly, we used a Logistic Regression to predict the sentiment attached to each review. The hyperparameters of the model were tuned using a validation dataset prior to training the model.\n",
"\n",
"#### Interestingly, our model performed worse if we cleaned the text data in the usual methods. This includes removing html, removing unwanted punctuation, removing stopwords, stemming, or tokenizing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Required Libraries and Reading the Data into Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we need to load the required libraries and read the data into Python."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import classification_report, confusion_matrix, accuracy_score\n",
"from scipy.sparse import hstack\n",
"from time import time"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train = pd.read_csv( \"labeledTrainData.tsv\", header=0, delimiter=\"\\t\")\n",
"test = pd.read_csv(\"testData.tsv\", header=0, delimiter=\"\\t\")\n",
"\n",
"train_text = train['review']\n",
"test_text = test['review']\n",
"y = train['sentiment']\n",
"\n",
"all_text = pd.concat([train_text, test_text])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF Vectorizers\n",
"First, we convert the reviews into a Bag of Words using the TF-IDF vectorizer for words and for character trigrams."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\\w{1,}', sublinear_tf=True, strip_accents='unicode', \n",
" stop_words='english', ngram_range=(1, 1), max_features=10000)\n",
"word_vectorizer.fit(train_text)\n",
"\n",
"train_word_features = word_vectorizer.transform(train_text)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode', \n",
" stop_words='english', ngram_range=(1, 3), max_features=50000)\n",
"char_vectorizer.fit(train_text)\n",
"\n",
"train_char_features = char_vectorizer.transform(train_text)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_features = hstack([train_word_features, train_char_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hyperparameter Tuning of Logistic Regression\n",
"Since there are multiple hyperparameters to tune in the XGBoost model, we will use the [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function of Sklearn to determine the optimal hyperparameter values. Next, I used the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to generate a validation set and find the best parameters."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GridSearch took 350.08 seconds to complete.\n"
]
},
{
"data": {
"text/plain": [
"{'C': 1, 'solver': 'saga'}"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-Validated Score of the Best Estimator: 0.888\n"
]
}
],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(train_features, y,test_size=0.3 ,random_state=1234)\n",
"\n",
"lr_model = LogisticRegression(random_state=1234)\n",
"param_dict = {'C': [0.001, 0.01, 0.1, 1, 10],\n",
" 'solver': ['sag', 'lbfgs', 'saga']}\n",
"\n",
"start = time()\n",
"grid_search = GridSearchCV(lr_model, param_dict)\n",
"grid_search.fit(X_train, y_train)\n",
"print(\"GridSearch took %.2f seconds to complete.\" % (time()-start))\n",
"display(grid_search.best_params_)\n",
"print(\"Cross-Validated Score of the Best Estimator: %.3f\" % grid_search.best_score_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how well our model does on the validation dataset and where any misclassifications occur.\n",
"\n",
"We have several metrics available for classification accuracy, including a confusion matrix and a classification report."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[3366 399]\n",
" [ 366 3369]]\n",
" precision recall f1-score support\n",
"\n",
" 0 0.90 0.89 0.90 3765\n",
" 1 0.89 0.90 0.90 3735\n",
"\n",
"avg / total 0.90 0.90 0.90 7500\n",
"\n",
"Accuracy Score: 0.898\n"
]
}
],
"source": [
"lr=LogisticRegression(C=1, solver ='saga')\n",
"lr.fit(X_train, y_train)\n",
"lr_preds=lr.predict(X_test)\n",
"\n",
"print(confusion_matrix(y_test, lr_preds))\n",
"print(classification_report(y_test, lr_preds))\n",
"print(\"Accuracy Score: %.3f\" % accuracy_score(y_test, lr_preds))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The number of false positives (FP = 366) is similar to the number of false negatives (FN = 399), suggesting that our model is not biased towards either specificity nor sensitivity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modelling Sentiment from Reviews\n",
"We will redo the steps taken above, this time we both the train and test dataset.\n",
"\n",
"1. Create a TF-IDF BoW for both words and trigrams.\n",
"2. Train the Logistic Regression model using the tuned hyperparameters.\n",
"3. Format predictions for submission to Kaggle Competition."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\\w{1,}', sublinear_tf=True, strip_accents='unicode', \n",
" stop_words='english', ngram_range=(1, 1), max_features=10000)\n",
"word_vectorizer.fit(all_text)\n",
"\n",
"train_word_features = word_vectorizer.transform(train_text)\n",
"test_word_features = word_vectorizer.transform(test_text)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode', \n",
" stop_words='english', ngram_range=(1, 3), max_features=50000)\n",
"char_vectorizer.fit(all_text)\n",
"\n",
"train_char_features = char_vectorizer.transform(train_text)\n",
"test_char_features = char_vectorizer.transform(test_text)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_features = hstack([train_char_features, train_word_features])\n",
"test_features = hstack([test_char_features, test_word_features])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"lr=LogisticRegression(C=1,solver='saga')\n",
"lr.fit(train_features,y)\n",
"final_preds=lr.predict(test_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The predictions are then formatted in an appropriate layout for submission to Kaggle."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"test['sentiment'] = final_preds\n",
"test = test[['id', 'sentiment']]\n",
"test.to_csv('Submission.csv',index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logistic Regression Sentiment Accuracy = 0.89532"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}