{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个shallow modle输出的结果是0.88160，比我之前的根据官方教程做的0.84要好很多。分类器恐怕不会造成这么大的差异，应该是在BOW_LR.py中，`lab_fea = select_feature('../../data/feature_chi.txt', max_feature)[\"1\"]`，这一行语句的效果。从中选了1000个作为feature的单词。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "error_bad_lines : boolean, default True\n",
    "Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, \n",
    "and no DataFrame will be returned. \n",
    "If False, then these “bad lines” will dropped from the DataFrame that is returned.\n",
    "\n",
    "warn_bad_lines : boolean, default True\n",
    "If error_bad_lines is False, and warn_bad_lines is True, \n",
    "a warning for each “bad line” will be output.\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "# Preprocessing Training Data\n",
    "train = pd.read_csv(\"../Sentiment/data/labeledTrainData.tsv\", header=0, delimiter='\\t', quoting=3, error_bad_lines=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\"5814_8\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"With all this stuff going down at the moment ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\"2381_9\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"\\\"The Classic War of the Worlds\\\" by Timothy ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\"7759_3\"</td>\n",
       "      <td>0</td>\n",
       "      <td>\"The film starts with a manager (Nicholas Bell...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"3630_4\"</td>\n",
       "      <td>0</td>\n",
       "      <td>\"It must be assumed that those who praised thi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>\"9495_8\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"Superbly trashy and wondrously unpretentious ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         id  sentiment                                             review\n",
       "0  \"5814_8\"          1  \"With all this stuff going down at the moment ...\n",
       "1  \"2381_9\"          1  \"\\\"The Classic War of the Worlds\\\" by Timothy ...\n",
       "2  \"7759_3\"          0  \"The film starts with a manager (Nicholas Bell...\n",
       "3  \"3630_4\"          0  \"It must be assumed that those who praised thi...\n",
       "4  \"9495_8\"          1  \"Superbly trashy and wondrously unpretentious ..."
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(25000, 3)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25000"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train['review'].size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里为了加快运算，只取前100个样本好了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train = train[:100]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 3)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_reviews = train['review'].size\n",
    "num_reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cleaning and parsing the training set movie reviews\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Part 1 Shallow Model.ipynb  feature_chi.txt\r\n",
      "__init__.py                 \u001b[1m\u001b[36mutils\u001b[m\u001b[m/\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from utils.TextPreprocess import review_to_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clean_train_reviews = []\n",
    "for i in range(0, num_reviews):\n",
    "    clean_train_reviews.append(review_to_words(train['review'][i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "调查一下问题，下面进入review_to_words："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"With all this stuff going down at the moment with MJ i\\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\\'s feeling towards the press and also the obvious message of drugs are bad m\\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\\'ve gave this subject....hmmm well i don\\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.\"'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_review = train['review'][0]\n",
    "raw_review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "import re\n",
    "import nltk\n",
    "from nltk.corpus import stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<html><body><p>\"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br/><br/>Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br/><br/>The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br/><br/>Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br/><br/>Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.\"</p></body></html>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1. Remove HTML\n",
    "review_text = BeautifulSoup(raw_review, \"lxml\")\n",
    "review_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"With all this stuff going down at the moment with MJ i\\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\\'s feeling towards the press and also the obvious message of drugs are bad m\\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\\'s music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\\'ve gave this subject....hmmm well i don\\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.\"'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1. Remove HTML\n",
    "review_text = BeautifulSoup(raw_review, \"lxml\").get_text()\n",
    "review_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  '"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 2. Remove non-letters        \n",
    "letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n",
    "letters_only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['with',\n",
       " 'all',\n",
       " 'this',\n",
       " 'stuff',\n",
       " 'going',\n",
       " 'down',\n",
       " 'at',\n",
       " 'the',\n",
       " 'moment',\n",
       " 'with',\n",
       " 'mj',\n",
       " 'i',\n",
       " 've',\n",
       " 'started',\n",
       " 'listening',\n",
       " 'to',\n",
       " 'his',\n",
       " 'music',\n",
       " 'watching',\n",
       " 'the']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3. Convert to lower case, split into individual words\n",
    "words = letters_only.lower().split()                             \n",
    "words[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['out',\n",
       " 'why',\n",
       " 'because',\n",
       " 'that',\n",
       " 'other',\n",
       " 'themselves',\n",
       " 'not',\n",
       " 'just',\n",
       " 'this',\n",
       " 'so',\n",
       " 't',\n",
       " 'now',\n",
       " 'itself',\n",
       " 'most',\n",
       " 'didn',\n",
       " 'did',\n",
       " 'ourselves',\n",
       " 'i',\n",
       " 'very',\n",
       " 'which']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 4. In Python, searching a set is much faster than searching\n",
    "#   a list, so convert the stop words to a set\n",
    "stops = set(stopwords.words(\"english\"))  \n",
    "list(stops)[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['stuff',\n",
       " 'going',\n",
       " 'moment',\n",
       " 'mj',\n",
       " 'started',\n",
       " 'listening',\n",
       " 'music',\n",
       " 'watching',\n",
       " 'odd',\n",
       " 'documentary']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 5. Remove stop words\n",
    "meaningful_words = [w for w in words if not w in stops]   \n",
    "meaningful_words[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter\n"
     ]
    }
   ],
   "source": [
    "# 6. Join the words back into one string separated by space, \n",
    "# and return the result.\n",
    "print(\" \".join( meaningful_words ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def review_to_words(raw_review):\n",
    "    # Function to convert a raw review to a string of words\n",
    "    # The input is a single string (a raw movie review), and \n",
    "    # the output is a single string (a preprocessed movie review)\n",
    "    #\n",
    "    # 1. Remove HTML\n",
    "    review_text = BeautifulSoup(raw_review, \"lxml\").get_text() \n",
    "    #\n",
    "    # 2. Remove non-letters        \n",
    "    letters_only = re.sub(\"[^a-zA-Z]\", \" \", review_text) \n",
    "    #\n",
    "    # 3. Convert to lower case, split into individual words\n",
    "    words = letters_only.lower().split()                             \n",
    "    #\n",
    "    # 4. In Python, searching a set is much faster than searching\n",
    "    #   a list, so convert the stop words to a set\n",
    "    stops = set(stopwords.words(\"english\"))                  \n",
    "    # \n",
    "    # 5. Remove stop words\n",
    "    meaningful_words = [w for w in words if not w in stops]   \n",
    "    #\n",
    "    # 6. Join the words back into one string separated by space, \n",
    "    # and return the result.\n",
    "    return( \" \".join( meaningful_words )) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重新回到runBow.py:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cleaning and parsing the training set movie reviews...\n"
     ]
    }
   ],
   "source": [
    "print(\"Cleaning and parsing the training set movie reviews...\")\n",
    "clean_train_reviews = []\n",
    "for i in range(0, num_reviews):\n",
    "    clean_train_reviews.append(review_to_words(train[\"review\"][i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "准备test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cleaning and parsing the test set movie reviews...\n"
     ]
    }
   ],
   "source": [
    "test = pd.read_csv(\"../Sentiment/data/testData.tsv\", header = 0, delimiter = \"\\t\", quoting = 3)\n",
    "\n",
    "test = test[:100]\n",
    "num_reviews = len(test[\"review\"])\n",
    "clean_test_reviews = []\n",
    "\n",
    "print(\"Cleaning and parsing the test set movie reviews...\")\n",
    "for i in range(0, num_reviews):\n",
    "    clean_review = review_to_words(test[\"review\"][i])\n",
    "    clean_test_reviews.append(clean_review)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 2)"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是进行分类，这个需要进入BOW_LR.py，查看构建的class BagOfWords(object)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "'''\n",
    "Train and Test\n",
    "'''\n",
    "# import BOW_LR\n",
    "\n",
    "bow = BOW_LR.BagOfWords(vocab = True, tfidf = True, max_feature = 19000)\n",
    "bow.train_lr(clean_train_reviews, list(train[\"sentiment\"]), C = 1)\n",
    "result = bow.test_lr(clean_test_reviews)\n",
    "print(result)\n",
    "\n",
    "print(\"output...\")\n",
    "out = open(\"result\\\\BOW_chi_tfidf.csv\", 'w')\n",
    "out.write(\"\\\"id\\\"\" + \",\" + \"\\\"sentiment\\\"\")\n",
    "out.write(\"\\n\")\n",
    "for i, key in enumerate(list(test[\"id\"])):\n",
    "    out.write(str(key) + \",\" + str(result[i]) + \"\\n\")\n",
    "# out.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的代码不用运行，path应该是基于windows的，我们先进入BOW_LR.py，下面是整个文件完整的内容。\n",
    "\n",
    "- 首先得到lab_fea是一个1000个单词的list，用于作为维度\n",
    "- 关于CountVectorizer和TfidfVectorizer的使用，看这里：https://zhangzirui.github.io/posts/Document-14%20(sklearn-feature).md。 默认都是在init的时候建好了分类器\n",
    "- "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from utils.feature_select import select_feature\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from scipy.sparse import bsr_matrix\n",
    "import numpy as np\n",
    "\n",
    "class BagOfWords(object):\n",
    "    \n",
    "    def __init__(self, vocab = False, tfidf = False, max_feature = 1000):\n",
    "        lab_fea = None\n",
    "        if(vocab == True):\n",
    "            print(\"select features...\")\n",
    "            lab_fea = select_feature('data\\\\feature_chi.txt', max_feature)[\"1\"]\n",
    "        \n",
    "        self.vectorizer = None\n",
    "        if(tfidf == True):\n",
    "            self.vectorizer = TfidfVectorizer(analyzer = \"word\",\n",
    "                                 tokenizer = None,\n",
    "                                 preprocessor = None,\n",
    "                                 stop_words = None,\n",
    "                                 vocabulary = lab_fea,\n",
    "                                 max_features = max_feature)\n",
    "        else:\n",
    "            self.vectorizer = CountVectorizer(analyzer = \"word\",\n",
    "                                 tokenizer = None,\n",
    "                                 preprocessor = None,\n",
    "                                 stop_words = None,\n",
    "                                 vocabulary = lab_fea,\n",
    "                                 max_features = max_feature)\n",
    "        self.lr = None\n",
    "        \n",
    "    def train_lr(self, train_data, lab_data, C = 1.0):\n",
    "        train_data_features = self.vectorizer.fit_transform(train_data)\n",
    "        train_data_features = bsr_matrix(train_data_features)\n",
    "        print (train_data_features.shape)\n",
    "        \n",
    "        print(\"Training the logistic regression...\")\n",
    "        self.lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) \n",
    "        self.lr = self.lr.fit(train_data_features, lab_data)\n",
    "        \n",
    "    def test_lr(self, test_data):\n",
    "        test_data_features = self.vectorizer.transform(test_data)\n",
    "        test_data_features = bsr_matrix(test_data_features)\n",
    "    \n",
    "        result = self.lr.predict_proba(test_data_features)[:,1]\n",
    "        return result\n",
    "    \n",
    "    def validate_lr(self, train_data, lab_data, C = 1.0):\n",
    "        train_data_features = self.vectorizer.fit_transform(train_data)\n",
    "        train_data_features = bsr_matrix(train_data_features)\n",
    "        lab_data = np.array(lab_data)\n",
    "        \n",
    "        print(\"start k-fold validate...\")\n",
    "        lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)\n",
    "        cv = np.mean(cross_val_score(lr, train_data_features, lab_data, cv=10, scoring='roc_auc'))\n",
    "        return cv\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面进入select_feature，这部分没有看懂究竟是想做什么，这个feature_chi.tex的文件又是从哪里来的？但输出的是1000个单词，这个应该就是用来选择作为维度的1000个单词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import heapq \n",
    "\n",
    "def select_feature(filePath, k):\n",
    "\tread = open(filePath, 'r')\n",
    "\tlab_fea = {}\n",
    "\t\n",
    "\tfor line in read:\n",
    "\t\tline_arr = line.strip().split()\n",
    "\t\tif len(line_arr) - 1 <= k:\n",
    "\t\t\tlab_fea[line_arr[0]] = [kv.split(':')[0] for kv in line_arr[1 : ]]\n",
    "\t\telse:\n",
    "\t\t\theap = []\n",
    "\t\t\theapq.heapify(heap)\n",
    "\t\t\tfor kv in line_arr[1 : ]:\n",
    "\t\t\t\tkey, val = kv.split(':')\n",
    "\t\t\t\tif len(heap) < k:\n",
    "\t\t\t\t\theapq.heappush(heap, (float(val), key))\n",
    "\t\t\t\telse:\n",
    "\t\t\t\t\tif float(val) > heap[0][0]:\n",
    "\t\t\t\t\t\theapq.heappop(heap)\n",
    "\t\t\t\t\t\theapq.heappush(heap, (float(val), key))\n",
    "\t\t\tlab_fea[line_arr[0]] = [heapq.heappop(heap)[1] for i in range(len(heap))]\n",
    "\tread.close()\n",
    "\treturn lab_fea"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lab_fea = select_feature('feature_chi.txt', 1000)['1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(lab_fea)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接上上面BagOfWords，这里先分析一下train_lr的过程："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "max_feature = 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vectorizer = TfidfVectorizer(analyzer = \"word\",\n",
    "                                 tokenizer = None,\n",
    "                                 preprocessor = None,\n",
    "                                 stop_words = None,\n",
    "                                 vocabulary = lab_fea,\n",
    "                                 max_features = max_feature)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def train_lr(self, train_data, lab_data, C = 1.0):\n",
    "        train_data_features = self.vectorizer.fit_transform(train_data)\n",
    "        train_data_features = bsr_matrix(train_data_features)\n",
    "        print (train_data_features.shape)\n",
    "        \n",
    "        print(\"Training the logistic regression...\")\n",
    "        self.lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) \n",
    "        self.lr = self.lr.fit(train_data_features, lab_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train_data_features = vectorizer.fit_transform(clean_train_reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<100x1000 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 2607 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train_data_features = bsr_matrix(train_data_features)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<100x1000 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 2607 stored elements (blocksize = 1x1) in Block Sparse Row format>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 1000)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_features.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# target \n",
    "lab_data = np.array(list(train[\"sentiment\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) \n",
    "lr = lr.fit(train_data_features, lab_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对test data进行预测："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# prepare test data\n",
    "test_data_features = vectorizer.transform(clean_test_reviews)\n",
    "test_data_features = bsr_matrix(test_data_features)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# predict\n",
    "result = lr.predict_proba(test_data_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.49322008,  0.50677992])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个结果应该是分别为0和1的概率，我们要的是1的概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = lr.predict_proba(test_data_features)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result2 = lr.predict(test_data_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,\n",
       "       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,\n",
       "       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1,\n",
       "       0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,\n",
       "       1, 0, 1, 0, 1, 0, 0, 0])"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\"5814_8\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"With all this stuff going down at the moment ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\"2381_9\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"\\\"The Classic War of the Worlds\\\" by Timothy ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\"7759_3\"</td>\n",
       "      <td>0</td>\n",
       "      <td>\"The film starts with a manager (Nicholas Bell...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\"3630_4\"</td>\n",
       "      <td>0</td>\n",
       "      <td>\"It must be assumed that those who praised thi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>\"9495_8\"</td>\n",
       "      <td>1</td>\n",
       "      <td>\"Superbly trashy and wondrously unpretentious ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         id  sentiment                                             review\n",
       "0  \"5814_8\"          1  \"With all this stuff going down at the moment ...\n",
       "1  \"2381_9\"          1  \"\\\"The Classic War of the Worlds\\\" by Timothy ...\n",
       "2  \"7759_3\"          0  \"The film starts with a manager (Nicholas Bell...\n",
       "3  \"3630_4\"          0  \"It must be assumed that those who praised thi...\n",
       "4  \"9495_8\"          1  \"Superbly trashy and wondrously unpretentious ..."
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}