{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data source\n",
    "[polarity dataset v1.1](http://www.cs.cornell.edu/people/pabo/movie-review-data/) (training set) (2.2Mb) (includes README.1.1): approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar, who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt.\n",
    "\n",
    "[polarity dataset v0.9](http://www.cs.cornell.edu/people/pabo/movie-review-data/) (testing set) (2.8Mb) (includes a README):. 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a function to get .txt files in a folder\n",
    "from os import listdir\n",
    "def list_textfiles(directory):\n",
    "    \"Return a list of filenames ending in '.txt' in DIRECTORY.\"\n",
    "    textfiles = []\n",
    "    for filename in listdir(directory):\n",
    "        if filename.endswith(\".txt\"):\n",
    "            textfiles.append(directory + \"/\" + filename)\n",
    "    return textfiles   \n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a function to read the text in a .txt file\n",
    "import codecs\n",
    "def read_txt(filename):\n",
    "    try:\n",
    "        # using codecs to avoid error like \"'utf8' codec can't decode byte...\"\n",
    "        f = codecs.open(filename,'r',encoding='utf-8', errors='ignore')\n",
    "        text = f.read()\n",
    "    finally:\n",
    "        if f:\n",
    "            f.close()\n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import training data\n",
    "\n",
    "filenames_pos = list_textfiles(\"movieReview_data/tokens/pos\")\n",
    "filenames_neg = list_textfiles(\"movieReview_data/tokens/neg\")\n",
    "\n",
    "# create two lists to store reivew text and polarity\n",
    "data_train = []\n",
    "data_labels_train = []\n",
    "\n",
    "for f in filenames_pos:\n",
    "    data_train.append(read_txt(f))\n",
    "    data_labels_train.append('pos')\n",
    "\n",
    "for f in filenames_neg:\n",
    "    data_train.append(read_txt(f))\n",
    "    data_labels_train.append('neg')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorization\n",
    "Next, we initialize a sckit-learn vector with the CountVectorizer class. Because the data could be in any format, we’ll set lowercase to False and exclude common words such as “the” or “and”. This vectorizer will transform our data into vectors of features. In this case, we use a CountVector, which means that our features are counts of the words that occur in our dataset. Once the CountVectorizer class is initialized, we fit it onto the data above and convert it to an array for easy usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)\n",
    "\n",
    "features_train = vectorizer.fit_transform(data_train)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Classifier\n",
    "#### SVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import svm\n",
    "clf = svm.SVC()\n",
    "# train svm model\n",
    "clf.fit(features_train, data_labels_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "import numpy as np;\n",
    "\n",
    "# import test data\n",
    "filenames_pos_test = list_textfiles(\"mix20_rand700_tokens/tokens/pos\")\n",
    "filenames_neg_test = list_textfiles(\"mix20_rand700_tokens/tokens/neg\")\n",
    "\n",
    "# create two lists to store reivew text and polarity\n",
    "data_test = []\n",
    "data_labels_test = []\n",
    "\n",
    "for f in filenames_pos:\n",
    "    data_test.append(read_txt(f))\n",
    "    data_labels_test.append('pos')\n",
    "\n",
    "for f in filenames_neg:\n",
    "    data_test.append(read_txt(f))\n",
    "    data_labels_test.append('neg')\n",
    "\n",
    "# vectorize\n",
    "features_test = vectorizer.fit_transform(data_test)\n",
    "\n",
    "#features_nd_test = features_test.toarray() # for easy usage\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of SVM model:\n",
      "0.500721500722\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "        neg       0.00      0.00      0.00       692\n",
      "        pos       0.50      1.00      0.67       694\n",
      "\n",
      "avg / total       0.25      0.50      0.33      1386\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zjm/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.\n",
      "  'precision', 'predicted', average, warn_for)\n"
     ]
    }
   ],
   "source": [
    "predicted = clf.predict(features_test)\n",
    "\n",
    "# print the accuracy score\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import classification_report\n",
    "print(\"Accuracy score of SVM model:\\n\"+ str(accuracy_score(data_labels_test,predicted)))\n",
    "# print evaluation report showing precision, recall, f1, support\n",
    "print(classification_report(data_labels_test, predicted))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Naive Bayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "mnb = MultinomialNB()\n",
    "mnb.fit(features_train, data_labels_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of Naive Bayes model:\n",
      "0.968253968254\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "        neg       0.96      0.97      0.97       692\n",
      "        pos       0.97      0.96      0.97       694\n",
      "\n",
      "avg / total       0.97      0.97      0.97      1386\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "mnb_predict = mnb.predict(features_test)\n",
    "print(\"Accuracy score of Naive Bayes model:\\n\"+ str(accuracy_score(data_labels_test,mnb_predict)))\n",
    "\n",
    "\n",
    "\n",
    "print(classification_report(data_labels_test, mnb_predict))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dumping model\n",
    "\n",
    "joblib.dump(value, filename, compress=0, protocol=None, cache_size=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['sentNB.model']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.externals import joblib\n",
    "joblib.dump(mnb,'sentNB.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question:\n",
    "\n",
    "Why Naive Bayes performed much better than SVM in this prediction?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}