{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) 2015, 2016 [Sebastian Raschka](sebastianraschka.com)\n",
"\n",
"https://github.com/rasbt/python-machine-learning-book\n",
"\n",
"[MIT License](https://github.com/rasbt/python-machine-learning-book/blob/master/LICENSE.txt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Machine Learning - Code Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 8 - Applying Machine Learning To Sentiment Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sebastian Raschka \n",
"last updated: 2016-09-29 \n",
"\n",
"CPython 3.5.2\n",
"IPython 5.1.0\n",
"\n",
"numpy 1.11.1\n",
"pandas 0.18.1\n",
"matplotlib 1.5.1\n",
"sklearn 0.18\n",
"nltk 3.2.1\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,sklearn,nltk"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"*The use of `watermark` is optional. You can install this IPython extension via \"`pip install watermark`\". For more information, please see: https://github.com/rasbt/watermark.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)\n",
"- [Introducing the bag-of-words model](#Introducing-the-bag-of-words-model)\n",
" - [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)\n",
" - [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)\n",
" - [Cleaning text data](#Cleaning-text-data)\n",
" - [Processing documents into tokens](#Processing-documents-into-tokens)\n",
"- [Training a logistic regression model for document classification](#Training-a-logistic-regression-model-for-document-classification)\n",
"- [Working with bigger data – online algorithms and out-of-core learning](#Working-with-bigger-data-–-online-algorithms-and-out-of-core-learning)\n",
"- [Summary](#Summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Added version check for recent scikit-learn 0.18 checks\n",
"from distutils.version import LooseVersion as Version\n",
"from sklearn import __version__ as sklearn_version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Obtaining the IMDb movie review dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).\n",
"After downloading the dataset, decompress the files.\n",
"\n",
"A) If you are working with Linux or MacOS X, open a new terminal windowm `cd` into the download directory and execute \n",
"\n",
"`tar -zxf aclImdb_v1.tar.gz`\n",
"\n",
"B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compatibility Note:\n",
"\n",
"I received an email from a reader who was having troubles with reading the movie review texts due to encoding issues. Typically, Python's default encoding is set to `'utf-8'`, which shouldn't cause troubles when running this IPython notebook. You can simply check the encoding on your machine by firing up a new Python interpreter from the command line terminal and execute\n",
"\n",
" >>> import sys\n",
" >>> sys.getdefaultencoding()\n",
" \n",
"If the returned result is **not** `'utf-8'`, you probably need to change your Python's encoding to `'utf-8'`, for example by typing `export PYTHONIOENCODING=utf8` in your terminal shell prior to running this IPython notebook. (Note that this is a temporary change, and it needs to be executed in the same shell that you'll use to launch `ipython notebook`.\n",
"\n",
"Alternatively, you can replace the lines \n",
"\n",
" with open(os.path.join(path, file), 'r') as infile:\n",
" ...\n",
" pd.read_csv('./movie_data.csv')\n",
" ...\n",
" df.to_csv('./movie_data.csv', index=False)\n",
"\n",
"by \n",
"\n",
" with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:\n",
" ...\n",
" pd.read_csv('./movie_data.csv', encoding='utf-8')\n",
" ...\n",
" df.to_csv('./movie_data.csv', index=False, encoding='utf-8')\n",
" \n",
"in the following cells to achieve the desired effect."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"0% 100%\n",
"[##############################] | ETA: 00:00:00\n",
"Total time elapsed: 00:09:04\n"
]
}
],
"source": [
"import pyprind\n",
"import pandas as pd\n",
"import os\n",
"\n",
"# change the `basepath` to the directory of the\n",
"# unzipped movie dataset\n",
"\n",
"#basepath = '/Users/Sebastian/Desktop/aclImdb/'\n",
"basepath = './aclImdb'\n",
"\n",
"labels = {'pos': 1, 'neg': 0}\n",
"pbar = pyprind.ProgBar(50000)\n",
"df = pd.DataFrame()\n",
"for s in ('test', 'train'):\n",
" for l in ('pos', 'neg'):\n",
" path = os.path.join(basepath, s, l)\n",
" for file in os.listdir(path):\n",
" with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:\n",
" txt = infile.read()\n",
" df = df.append([[txt, labels[l]]], ignore_index=True)\n",
" pbar.update()\n",
"df.columns = ['review', 'sentiment']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Shuffling the DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"np.random.seed(0)\n",
"df = df.reindex(np.random.permutation(df.index))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Optional: Saving the assembled data as CSV file:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.to_csv('./movie_data.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | review | \n", "sentiment | \n", "
---|---|---|
0 | \n", "In 1974, the teenager Martha Moxley (Maggie Gr... | \n", "1 | \n", "
1 | \n", "OK... so... I really like Kris Kristofferson a... | \n", "0 | \n", "
2 | \n", "***SPOILER*** Do not read this, if you think a... | \n", "0 | \n", "