{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Natural Language Processing (NLP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is NLP?\n",
    "\n",
    "- Using computers to process (analyze, understand, generate) natural human languages\n",
    "- Most knowledge created by humans is unstructured text, and we need a way to make sense of it\n",
    "- Build probabilistic model using data about a language"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What are some of the higher level task areas?\n",
    "\n",
    "- **Information retrieval**: Find relevant results and similar results\n",
    "    - [Google](https://www.google.com/)\n",
    "- **Information extraction**: Structured information from unstructured documents\n",
    "    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)\n",
    "- **Machine translation**: One language to another\n",
    "    - [Google Translate](https://translate.google.com/)\n",
    "- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary\n",
    "    - [Rewordify](https://rewordify.com/)\n",
    "    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)\n",
    "- **Predictive text input**: Faster or easier typing\n",
    "    - [My application](https://justmarkham.shinyapps.io/textprediction/)\n",
    "    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)\n",
    "- **Sentiment analysis**: Attitude of speaker\n",
    "    - [Hater News](http://haternews.herokuapp.com/)\n",
    "- **Automatic summarization**: Extractive or abstractive summarization\n",
    "    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)\n",
    "- **Natural Language Generation**: Generate text from data\n",
    "    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)\n",
    "    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)\n",
    "- **Speech recognition and generation**: Speech-to-text, text-to-speech\n",
    "    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)\n",
    "    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)\n",
    "- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses\n",
    "    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)\n",
    "    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)\n",
    "    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What are some of the lower level components?\n",
    "\n",
    "- **Tokenization**: breaking text into tokens (words, sentences, n-grams)\n",
    "- **Stopword removal**: a/an/the\n",
    "- **Stemming and lemmatization**: root word\n",
    "- **TF-IDF**: word importance\n",
    "- **Part-of-speech tagging**: noun/verb/adjective\n",
    "- **Named entity recognition**: person/organization/location\n",
    "- **Spelling correction**: \"New Yrok City\"\n",
    "- **Word sense disambiguation**: \"buy a mouse\"\n",
    "- **Segmentation**: \"New York City subway\"\n",
    "- **Language detection**: \"translate this page\"\n",
    "- **Machine learning**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why is NLP hard?\n",
    "\n",
    "- **Ambiguity**:\n",
    "    - Hospitals are Sued by 7 Foot Doctors\n",
    "    - Juvenile Court to Try Shooting Defendant\n",
    "    - Local High School Dropouts Cut in Half\n",
    "- **Non-standard English**: text messages\n",
    "- **Idioms**: \"throw in the towel\"\n",
    "- **Newly coined words**: \"retweet\"\n",
    "- **Tricky entity names**: \"Where is A Bug's Life playing?\"\n",
    "- **World knowledge**: \"Mary and Sue are sisters\", \"Mary and Sue are mothers\"\n",
    "\n",
    "NLP requires an understanding of the **language** and the **world**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Reading in the Yelp Reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- \"corpus\" = collection of documents\n",
    "- \"corpora\" = plural form of corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import metrics\n",
    "from textblob import TextBlob, Word\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# read yelp.csv into a DataFrame\n",
    "url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv'\n",
    "yelp = pd.read_csv(url)\n",
    "\n",
    "# create a new DataFrame that only contains the 5-star and 1-star reviews\n",
    "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n",
    "\n",
    "# define X and y\n",
    "X = yelp_best_worst.text\n",
    "y = yelp_best_worst.stars\n",
    "\n",
    "# split the new DataFrame into training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **What:** Separate text into units such as sentences or words\n",
    "- **Why:** Gives structure to previously unstructured text\n",
    "- **Notes:** Relatively easy with English language text, not easy with some languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# use CountVectorizer to create document-term matrices from X_train and X_test\n",
    "vect = CountVectorizer()\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_test_dtm = vect.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 16825)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rows are documents, columns are terms (aka \"tokens\" or \"features\")\n",
    "X_train_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'yyyyy', u'z11', u'za', u'zabba', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zero', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zihuatenejo', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zippers', u'zipps', u'ziti', u'zoe', u'zombi', u'zombies', u'zone', u'zones', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel', u'zzed', u'\\xe9clairs', u'\\xe9cole', u'\\xe9m']\n"
     ]
    }
   ],
   "source": [
    "# last 50 features\n",
    "print vect.get_feature_names()[-50:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
       "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# show vectorizer options\n",
    "vect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **lowercase:** boolean, True by default\n",
    "- Convert all characters to lowercase before tokenizing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 20838)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# don't convert to lowercase\n",
    "vect = CountVectorizer(lowercase=False)\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_train_dtm.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **ngram_range:** tuple (min_n, max_n)\n",
    "- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 169847)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# include 1-grams and 2-grams\n",
    "vect = CountVectorizer(ngram_range=(1, 2))\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_train_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\\xe4uter', u'zzed', u'zzed in', u'\\xe9clairs', u'\\xe9clairs napoleons', u'\\xe9cole', u'\\xe9cole len\\xf4tre', u'\\xe9m', u'\\xe9m all']\n"
     ]
    }
   ],
   "source": [
    "# last 50 features\n",
    "print vect.get_feature_names()[-50:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Predicting the star rating:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.918786692759\n"
     ]
    }
   ],
   "source": [
    "# use default options for CountVectorizer\n",
    "vect = CountVectorizer()\n",
    "\n",
    "# create document-term matrices\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_test_dtm = vect.transform(X_test)\n",
    "\n",
    "# use Naive Bayes to predict the star rating\n",
    "nb = MultinomialNB()\n",
    "nb.fit(X_train_dtm, y_train)\n",
    "y_pred_class = nb.predict(X_test_dtm)\n",
    "\n",
    "# calculate accuracy\n",
    "print metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.81996086105675148"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate null accuracy\n",
    "y_test_binary = np.where(y_test==5, 1, 0)\n",
    "max(y_test_binary.mean(), 1 - y_test_binary.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# define a function that accepts a vectorizer and calculates the accuracy\n",
    "def tokenize_test(vect):\n",
    "    X_train_dtm = vect.fit_transform(X_train)\n",
    "    print 'Features: ', X_train_dtm.shape[1]\n",
    "    X_test_dtm = vect.transform(X_test)\n",
    "    nb = MultinomialNB()\n",
    "    nb.fit(X_train_dtm, y_train)\n",
    "    y_pred_class = nb.predict(X_test_dtm)\n",
    "    print 'Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  169847\n",
      "Accuracy:  0.854207436399\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams\n",
    "vect = CountVectorizer(ngram_range=(1, 2))\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Stopword Removal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **What:** Remove common words that will likely appear in any text\n",
    "- **Why:** They don't tell you much about your text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
       "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 2), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# show vectorizer options\n",
    "vect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **stop_words:** string {'english'}, list, or None (default)\n",
    "- If 'english', a built-in stop word list for English is used.\n",
    "- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
    "- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  16528\n",
      "Accuracy:  0.915851272016\n"
     ]
    }
   ],
   "source": [
    "# remove English stop words\n",
    "vect = CountVectorizer(stop_words='english')\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])\n"
     ]
    }
   ],
   "source": [
    "# set of stop words\n",
    "print vect.get_stop_words()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4: Other CountVectorizer Options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **max_features:** int or None, default=None\n",
    "- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  100\n",
      "Accuracy:  0.869863013699\n"
     ]
    }
   ],
   "source": [
    "# remove English stop words and only keep 100 features\n",
    "vect = CountVectorizer(stop_words='english', max_features=100)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']\n"
     ]
    }
   ],
   "source": [
    "# all 100 features\n",
    "print vect.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  100000\n",
      "Accuracy:  0.885518590998\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams, and limit the number of features\n",
    "vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
    "- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  43957\n",
      "Accuracy:  0.932485322896\n"
     ]
    }
   ],
   "source": [
    "# include 1-grams and 2-grams, and only include terms that appear at least 2 times\n",
    "vect = CountVectorizer(ngram_range=(1, 2), min_df=2)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 5: Introduction to TextBlob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TextBlob: \"Simplified Text Processing\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n",
      "\n",
      "Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n",
      "\n",
      "While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I've ever had.\n",
      "\n",
      "Anyway, I can't wait to go back!\n"
     ]
    }
   ],
   "source": [
    "# print the first review\n",
    "print yelp_best_worst.text[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# save it as a TextBlob object\n",
    "review = TextBlob(yelp_best_worst.text[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', 'had', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', 'had', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back'])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# list the words\n",
    "review.words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Sentence(\"My wife took me here on my birthday for breakfast and it was excellent.\"),\n",
       " Sentence(\"The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.\"),\n",
       " Sentence(\"Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.\"),\n",
       " Sentence(\"It looked like the place fills up pretty quickly so the earlier you get here the better.\"),\n",
       " Sentence(\"Do yourself a favor and get their Bloody Mary.\"),\n",
       " Sentence(\"It was phenomenal and simply the best I've ever had.\"),\n",
       " Sentence(\"I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.\"),\n",
       " Sentence(\"It was amazing.\"),\n",
       " Sentence(\"While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.\"),\n",
       " Sentence(\"It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.\"),\n",
       " Sentence(\"It was the best \"toast\" I've ever had.\"),\n",
       " Sentence(\"Anyway, I can't wait to go back!\")]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# list the sentences\n",
    "review.sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TextBlob(\"my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.\n",
       "\n",
       "do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.\n",
       "\n",
       "while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best \"toast\" i've ever had.\n",
       "\n",
       "anyway, i can't wait to go back!\")"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# some string methods are available\n",
    "review.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 6: Stemming and Lemmatization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stemming:**\n",
    "\n",
    "- **What:** Reduce a word to its base/stem/root form\n",
    "- **Why:** Often makes sense to treat related words the same way\n",
    "- **Notes:**\n",
    "    - Uses a \"simple\" and fast rule-based approach\n",
    "    - Stemmed words are usually not shown to users (used for analysis/indexing)\n",
    "    - Some search engines treat words with the same stem as synonyms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u\"'m\", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u\"n't\", u'wait', u'to', u'go', u'back']\n"
     ]
    }
   ],
   "source": [
    "# initialize stemmer\n",
    "stemmer = SnowballStemmer('english')\n",
    "\n",
    "# stem each word\n",
    "print [stemmer.stem(word) for word in review.words]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Lemmatization**\n",
    "\n",
    "- **What:** Derive the canonical form ('lemma') of a word\n",
    "- **Why:** Can be better than stemming\n",
    "- **Notes:** Uses a dictionary-based approach (slower than stemming)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', 'had', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', 'had', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back']\n"
     ]
    }
   ],
   "source": [
    "# assume every word is a noun\n",
    "print [word.lemmatize() for word in review.words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', u'have', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', u'have', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back']\n"
     ]
    }
   ],
   "source": [
    "# assume every word is a verb\n",
    "print [word.lemmatize(pos='v') for word in review.words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# define a function that accepts text and returns a list of lemmas\n",
    "def split_into_lemmas(text):\n",
    "    text = unicode(text, 'utf-8').lower()\n",
    "    words = TextBlob(text).words\n",
    "    return [word.lemmatize() for word in words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features:  16452\n",
      "Accuracy:  0.920743639922\n"
     ]
    }
   ],
   "source": [
    "# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)\n",
    "vect = CountVectorizer(analyzer=split_into_lemmas)\n",
    "tokenize_test(vect)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'yuyuyummy', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zen-like', u'zero', u'zero-star', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zipps', u'ziti', u'zoe', u'zombi', u'zombie', u'zone', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\\xe4uter', u'zzed', u'\\xe9clairs', u'\\xe9cole', u'\\xe9m']\n"
     ]
    }
   ],
   "source": [
    "# last 50 features\n",
    "print vect.get_feature_names()[-50:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **What:** Computes \"relative frequency\" that a word appears in a document compared to its frequency across all documents\n",
    "- **Why:** More useful than \"term frequency\" for identifying \"important\" words in each document (high frequency in that document, low frequency in other documents)\n",
    "- **Notes:** Used for search engine scoring, text summarization, document clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# example documents\n",
    "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   0       0        1    1\n",
       "1    1     1   1       0        0    0\n",
       "2    0     1   1       2        0    0"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Term Frequency\n",
    "vect = CountVectorizer()\n",
    "tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())\n",
    "tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    1     3   2       1        1    1"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Document Frequency\n",
    "vect = CountVectorizer(binary=True)\n",
    "df = vect.fit_transform(simple_train).toarray().sum(axis=0)\n",
    "pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.5</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab      call   me  please  tonight  you\n",
       "0    0  0.333333  0.0       0        1    1\n",
       "1    1  0.333333  0.5       0        0    0\n",
       "2    0  0.333333  0.5       2        0    0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Term Frequency-Inverse Document Frequency (simple version)\n",
    "tf/df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.385372</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.652491</td>\n",
       "      <td>0.652491</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.720333</td>\n",
       "      <td>0.425441</td>\n",
       "      <td>0.547832</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.266075</td>\n",
       "      <td>0.342620</td>\n",
       "      <td>0.901008</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        cab      call        me    please   tonight       you\n",
       "0  0.000000  0.385372  0.000000  0.000000  0.652491  0.652491\n",
       "1  0.720333  0.425441  0.547832  0.000000  0.000000  0.000000\n",
       "2  0.000000  0.266075  0.342620  0.901008  0.000000  0.000000"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# TfidfVectorizer\n",
    "vect = TfidfVectorizer()\n",
    "pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 8: Using TF-IDF to Summarize a Yelp Review\n",
    "\n",
    "Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10000, 28881)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a document-term matrix using TF-IDF\n",
    "vect = TfidfVectorizer(stop_words='english')\n",
    "dtm = vect.fit_transform(yelp.text)\n",
    "features = vect.get_feature_names()\n",
    "dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def summarize():\n",
    "    \n",
    "    # choose a random review that is at least 300 characters\n",
    "    review_length = 0\n",
    "    while review_length < 300:\n",
    "        review_id = np.random.randint(0, len(yelp))\n",
    "        review_text = unicode(yelp.text[review_id], 'utf-8')\n",
    "        review_length = len(review_text)\n",
    "    \n",
    "    # create a dictionary of words and their TF-IDF scores\n",
    "    word_scores = {}\n",
    "    for word in TextBlob(review_text).words:\n",
    "        word = word.lower()\n",
    "        if word in features:\n",
    "            word_scores[word] = dtm[review_id, features.index(word)]\n",
    "    \n",
    "    # print words with the top 5 TF-IDF scores\n",
    "    print 'TOP SCORING WORDS:'\n",
    "    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]\n",
    "    for word, score in top_scores:\n",
    "        print word\n",
    "    \n",
    "    # print 5 random words\n",
    "    print '\\n' + 'RANDOM WORDS:'\n",
    "    random_words = np.random.choice(word_scores.keys(), size=5, replace=False)\n",
    "    for word in random_words:\n",
    "        print word\n",
    "    \n",
    "    # print the review\n",
    "    print '\\n' + review_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOP SCORING WORDS:\n",
      "restaurant\n",
      "machaca\n",
      "cactus\n",
      "pickup\n",
      "rental\n",
      "\n",
      "RANDOM WORDS:\n",
      "did\n",
      "pickup\n",
      "hole\n",
      "burrito\n",
      "arrived\n",
      "\n",
      "The place is a hole in the wall right behind the PHX car rental center. Arrived at dinner time, and had the best Mexican food! I had the #5 Machaca Plate w/rice, beans, & one of their homemade tortilla's burrito style. It was just amazing, and so cheap ($5.65). The only issue I have with this place is that its very well used, and looks very dirty, I did find it a bit sticky. Maybe this is a better place for pickup.  If you want a clean restaurant, their North restaurant on Cactus Road restaurant is much cleaner.\n"
     ]
    }
   ],
   "source": [
    "summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 9: Sentiment Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n",
      "\n",
      "Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n",
      "\n",
      "While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I've ever had.\n",
      "\n",
      "Anyway, I can't wait to go back!\n"
     ]
    }
   ],
   "source": [
    "print review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.40246913580246907"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# polarity ranges from -1 (most negative) to 1 (most positive)\n",
    "review.sentiment.polarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>date</th>\n",
       "      <th>review_id</th>\n",
       "      <th>stars</th>\n",
       "      <th>text</th>\n",
       "      <th>type</th>\n",
       "      <th>user_id</th>\n",
       "      <th>cool</th>\n",
       "      <th>useful</th>\n",
       "      <th>funny</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9yKzy9PApeiPPOUJEtnvkg</td>\n",
       "      <td>2011-01-26</td>\n",
       "      <td>fWKvX83p0-ka4JS3dc6E5A</td>\n",
       "      <td>5</td>\n",
       "      <td>My wife took me here on my birthday for breakf...</td>\n",
       "      <td>review</td>\n",
       "      <td>rLtl8ZkDX5vH5nAx9C3q5Q</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>889</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              business_id        date               review_id  stars  \\\n",
       "0  9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   \n",
       "\n",
       "                                                text    type  \\\n",
       "0  My wife took me here on my birthday for breakf...  review   \n",
       "\n",
       "                  user_id  cool  useful  funny  length  \n",
       "0  rLtl8ZkDX5vH5nAx9C3q5Q     2       5      0     889  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# understanding the apply method\n",
    "yelp['length'] = yelp.text.apply(len)\n",
    "yelp.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# define a function that accepts text and returns the polarity\n",
    "def detect_sentiment(text):\n",
    "    return TextBlob(text.decode('utf-8')).sentiment.polarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create a new DataFrame column for sentiment (WARNING: SLOW!)\n",
    "yelp['sentiment'] = yelp.text.apply(detect_sentiment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x1f6f0b00>"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEaCAYAAAAcz1CnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+cXHV97/HXJ4kQMMASpUT5FStqRSmLIpdKbA5abUAu\nhnujFtvK+ui1PlpzlavtFa26yX148Uf7uF011vZa7ab+QDHV+APlhzVnK7SIKBuxgrdRFzEYBJMI\nCSRC8rl/nDM7Z+fM7uzuzM73e2bez8dj4DszZ8585puz53u+P4+5OyIi0r8WhQ5ARETCUkEgItLn\nVBCIiPQ5FQQiIn1OBYGISJ9TQSAi0udUEEjbzOyQmd1uZuNm9m0z+60O7z8xsy+12GZ1p7+3G8xs\nwsyWN3l9X4e/p5L5I92xJHQA0hMedvezAczsJcC7gaTLMVwAPAT823w+bGYG4N2fWDPd93U6jjnn\nj5ktcffHOhyHREg1Aum044DdkJ1czewvzewOM/uumb0if33EzN6Rp3/XzMbybUfN7G/N7Ftm9gMz\ne2njzs1suZltNbPtZvZvZnamma0EXgf8j7xmsqrhMyeY2Y1m9j0z+0jtKtzMVubfsxm4Azhlmnin\n1EjMbJOZXZ6nJ8zsvfn23zSzpxa+c4uZ3Zo/np+//gQzu6EWC2DTZaSZ/Z98u6+Z2RPN7Klm9u3C\n+08rPi+8/gYz+/c8jz5lZqc15o+ZXWxmt5jZd/K8+bX8sxvM7ONmdhOw2cyelcd/e76/01v8+0sV\nubseerT1AB4DbgfuBPYCZ+ev/1fgBrKT3a8BdwMnAkcB3yO7Sr0LeEq+/SjwlTx9OnAPcCRZ7eJL\n+esfBN6Rpy8Abs/Tw8CbpolvE/CWPP27wGFgObASOAScO0O8K4rfX4jh1Xn6x8Bb8/QfFuL8FHB+\nnj4V+H6e/gDw9jx9US2WJjEfBi7L0+8APpinvw6claevAl7f5LM7gcfl6WOb5Q8wUEj/N+Cv8vQG\n4FvAkYV4X5WnlwBLQx9venT+oaYh6YRHvN40dB7wceDZwCrgU56dRX5uZmNkJ90vmdlrgW8Ab3T3\nH+f7ceAaAHffYWY/An6j4bvOB/5Lvs22/Ar7mPy96a6uzwfW5p+53sz2FN67291vLWzXGO/zgAdb\n/P6r8/9/GvjrPP07wDPzFieAY8zs8cALgEvzWL7SEEvRYeAzefoTwOfy9N8DrzGzNwGvyONr9F3g\nU2a2FdhaeL2YP6eY2TVkBd0RwI/y1x34orsfzJ//G/AXZnYy8Dl33zFNvFJhahqSjnL3W4AnmtkJ\nZCeV4snHqLd9/yZwP3BSi10ebvLatM0pM5juM/tbbOdkNZ7i38pRM3xP7fcZ8J/c/ez8cYq77y+8\nNxfFfPsccCFwMXCbuzcrSF4KfAh4DvAtM1vcZJsPAh9w998kazYq/qaHJ3+M+9XAfwYeAb5iZhfM\nMXapABUE0lFm9htkx9UDZFf8rzSzRXnB8ALg1rzN+k3A2cCFZnZu7ePAy/P+gqcCvw78oOErvgH8\nfv5dCXC/uz9E1hF6DM3dTHb1XOvMPn6a7Rrj/W3gVuAnwBlmdoSZDQAvbPjcKwv//9c8fQPwhkK+\nnJUn/wV4Vf7ahTPEsgh4eZ5+VR4b7n4AuB74MPAPjR+yrApyqrunwJVkfTbLKOfPscC9eXqouIuG\n/T3F3X/s7h8EvgCcOU28UmFqGpJOOMrMbs/TBlyeN6983rIhi9vJrmj/3N1/bmY3Am92911m9kfA\nqJk9L9/mJ2Qn32OB17n7r8zMqV8RbwA+Zmbbya7mL89f/xKwxcxeBqx395sL8W0ErjazPyRr6thF\ndmI8trBf3L1pvAB5M8r3yPoEvtPw+4/P4zkAXJa/9gbgQ/nrS4Ax4E8LsVxGVmjcPU2e7gfONbO3\nA/dRL2wg63+4lKywabQY+LiZHUf2b/F+d/9l3tldy5//nufjZ/Omqa8Dp9Wygakjll6R59ujwM+A\n/z1NvFJhlv29ioRnZv9A1tn6uZYbz22/RwCH3P1QfqL/kLs/Z477+Apwtbt/vOH1HwPPdffdnYu4\nZSx/Bhzj7sPd+k7pbaoRSD84FbjGzBYBvwJeO9PGZrYBeKq7/2HtNXe/aJrNF/RKysxGgXvcvTbc\n9vPAUyg3T4nMmwoCiYa7v2aB9ruDrON0Ifb96wux3xm+79Jufp/0B3UWS+WZ2VvM7Kdm9qCZ3WVm\nL8w7nK80sx1m9oCZfcbMjs+3X2lmh83s1WZ2t5ndb2Zvy99bA7yVrNP4oVrfh5mleX8GZjZkZjfn\nE7725N/xfDN7jZn9xMzuM7NXF+I70sz+Kv+uXWb2YTNbmr+X5LG/Kf/cvWY2lL/3x2Qdxf8zj+UL\nXcxW6SMqCKTSzOwZwOuBc9z9WOAlwARZZ+0lZCN/ngTsIRtSWXQ+8HTgRcA7zewZ7n4d2UStT7v7\nMbX5EZQ7Uc8l61ReTjaP4BqyWsdTgT8ANpnZ0fm27yGbIHdW/v+TgHcW9nUiWcf1k4E/IutkPs7d\n/y/wSeC9eSwvm1cmibSggkCq7hDZ7ONnmdnj3P0n7v4jsrHxb3f3e939UbLROuvyfoKaje5+0N2/\nS3ZSrw3xNFqP9f+xu2/OR0ddQ3YS/1/u/qi730jWF3F6PpzztWSzeve6+z6ytZh+r7CvR/PPHnL3\nrwL7gGcU3p/PvAmRWVMfgVRaPgP5CrLhkM8ys+uBN5MtH/F5MytOSHuM7Oq7Zlch/TDZePvZuq+Q\nfiSP5f6G15YBJwBHA98uzDI2pl6E/cLdi3HONRaRtqhGIJXn7le7+wvIxsI78F6y+Qhr3P34wuNo\nd//ZbHbZwfAeICsUzijEMZA3Y82GxnfLglNBIJVmZk/PO4ePBA6STep6DPhb4CozOzXf7gQzu2SW\nu90FrLTCJfx85Vf6HwFG8tnKmNlJ+Qzn2biPbIa1yIJRQSBVdyRZm/v9ZDNfn0g26uf9wBeBG8zs\nQbIZxecWPjfTlfZn8///wsxua/J+Y8dxq/29BdgB3GJmvwRuJOukns1nP0q2vMUeM+voRDuRmrZn\nFpvZx8gWufq5uzddh8TMPkC2UNbDwJC7395sOxER6b5O1Aj+AVgz3ZtmdhFwurs/DfhjssWyREQk\nEm0XBO7+DbIx2tO5BNicb/tNYMDMTpxhexER6aJu9BGcRHanqZqfAid34XtFRGQWutVZ3OxmHyIi\nEoFuTCjbCZxSeH5y/toU+ZrzIiKyQNy96ZDobtQIvgi8GibvZ7vX3e9rtmGoGzcXH6tXDwePIbbH\n8LDypNlDx0r5AcqTWPNkJm3XCMzsamA12X1q7wGGgcflJ/a/8+wG3ReZ2Q6yuy4tyFLDIiIyP20X\nBO5+2Sy2Wd/u93TL0qUToUOIzsTEROgQoqRjpZmJ0AFEIU2zR2aCDRuyVJJkj9hoZnGDNWsGQ4cQ\nncFB5UkzOlbK1q5VnpTFnyfR3LPYzDyWWERkfpKkeCUsAAMDsHdv6CjAzPCAncUi0idiOOHFZlEF\nzrIVCLG7Ul3OlChPmlO+ZEZG6m3f27enk+mRkbBxhVTMkz174s8T3ZhGRNpyxRXZA2DJEjUNwdQ8\nedzj4s8T9RGISFuKI2Q2boTh4Swd6wiZbogxT2bqI1BBICIdYwb6M54qljxRZ/EcqN23THnSnPIl\ns2oVLF2aPSCdTK9aFTqycKqWJ+ojEJG2rFuX9Q0AjI3Beedl6bVrw8UU2k031dNmcOBAuFhmQzWC\nBkm/NmrOQHnSnPKlmSR0AFFYvx5WrswekEym10e6xoJqBCLSlsHB+vyBsbF6Z2g/T0hftw6e+MQs\nvXEjDA1l6VivHVQjaKB23zLlSXPKl8z4eHGUTDqZHh8PGVVM0tABtKQagYi0RTWCsve/H7Ztqz+v\nTSTbvj3OWoGGj4pIx8QyVDK0kRHYujVLj43B6tVZeu3a+kSzbtM8AmlLmsZ5FRPaqlVTR4f0q/Xr\n4ctfztJ33w2nnZalL74YNm0KF1csYikcNY9gDtTuWzY6moYOIUq33JKGDiEKO3dmTUNZ81A6md5Z\nuiFt/7j00mzV0YEBgHQyfemloSNrTn0EIvMUw1VeDN74RjjrrCy9cWO96UO1yOpQ05A0FeNaKTFY\ntQpuuy1LHzwIRx6Zpc85p3+biY46qvmEqaVL4ZFHuh9PDGLMk5mahlQjaKD28EzjCb92q71+9653\nTS0gr7wyS/fzMfPud0/fMdqvvvrV6S+kYqQaQYOhoZTR0SR0GFFRnjS3eHHKoUNJ6DCiYpbinoQO\nI7ipo4ZSVq9OgHhHDalG0GDXrtARxKefx4PP5AlPCB1BHIonPahf9YY86YW2YwdMTNSf19I7doSI\npjXVCFB7uMzPyEj/nuiK+nn4qFnTC2xgO3BGnl4CPJanvw+cVdq6G+c+1QhaKJ7w01Tt4TI7uj9v\npmrr6nTSbE7g2TyC2qn2N4E4Lr6LVBAwtUYwNpayYUMCqEZQk6apVtpsIptHkASOIrwtW+o1Aqj3\nJz3wgP5+MimxHycqCJh6wr/lFtUIZHrFi4brr68fK/180bBpU70JyGxq27jUlqKOm/oIGmzYoIKg\nkdrCm3v842H//tBRhBfjujpSpj6COejXq7qZbN2qP+iaYo3g4YdVIwCtPtoLVCNooPbwssHBlPHx\nJHQY0dGY+czUUUMpp52WAP0xamg2YpmHo0XnZM5GRupXubU11JOkvq56vzrzzOz+vLV79NbSZ54Z\nNq6QNm/Oho3efXf2vJbevDlsXLGoQj6oRiAtJUm9OUTqFi2Cw4dDRxHemWfCnXdm6UOHYPHiLP3M\nZ8Idd4SLKxZVWIZafQQi0pbjjqvXkA4dqqePOy5cTDI3ahpqMDKShg4hOs9+dho6hCgdfXQaOoQo\nDA7CihXZA9LJtDqLa9LQAbSkGkED3XC7bN260BHEozhqaP9+jRoCuPZa+MlP6s9r6WuvVWdxVaiP\noMHQEIyOho4iLlqau7nly2H37tBRdE+V1tWJSSxzk9RH0ELxKm/z5vpMwH6+yitSQVBXPFb27Omv\nGkGvrKvTbTEUAq2oIGDqH/HWrfW1hiQzMZES+1opYaQoXxqlKE+mqsLcJBUETL3K2769v67ypqNa\nUnPj41OH0tbSAwP9nS9SbeojaKA+grIlS+Cxx1pv1280j6AsljHzUqY+gjmowkqBEk6xpuSu2qP0\nBs0jaDAwkIYOIQqrVsHSpdnj0KF0Mr1qVejIYpKGDiBCaegAojM0lIYOoSXVCKSpm26qp484Ag4c\nCBdLTKbehKXejKibsGQuvzx0BPHZvDn+5mYVBA327k1ChxCdRYuS0CFEY+ptGZO+ui3jbMSwymZ8\nktABtKSmIWnpnHNCRyAiC0k1AqZ2AG7cmFIrwdUBmFm3LqUKVzXdMHX4aEqaJoCGj9ZUYcx896XE\n/vejgoCpJ/yJiWrMBOym667THcpqxsamrkdVSx9/vPJIqqvtgsDM1gAjwGLg7939vQ3vJ8AXgB/l\nL/2Tu7+r3e9dKCtXJqFDiM6uXUnoEKLxxjfCWfnyORs3JpMnf10EZ1QbKBseTkKH0FJbE8rMbDHw\nA+B3gJ3At4DL3P3OwjYJ8CZ3v6TFvqKYUKZ1dTJTm8tgeDhL93tz2apVcNttWfrgQTjyyCx9zjlT\nR1r1q1gWWJOyhbxV5bnADnefcPdHgU8DL2sWQ5vf0zVbtqShQ4hQGjqArjOzpo+bbzYOHswe8PXJ\n9M03N9++32R9bFKUVuD2fu0WBCcB9xSe/zR/rciB55vZdjP7ipmdQcSuuy50BBIDd2/5gEWz2EYk\nfu32EczmSP8OcIq7P2xmFwJbgae3+b0L5rHHktAhRGHq6JhEi6s1lYQOIEJJ6ACiU4V+k3YLgp3A\nKYXnp5DVCia5+0OF9FfN7G/MbLm7l27pMTQ0xMp8sZ+BgQEGBwcnM7FWvVqI5yMjMDqaPb/77oQk\ngb17U1atgk2bFv77Y3w+NpZy112wdGn2/K67svd37IgjvhieZ7No44knhufKj3iej4+Ps3fvXgAm\nJiaYSbudxUvIOotfBNwL3Eq5s/hE4Ofu7mZ2LnCNu69ssq9gncWN8whqvfz93DGqPGkt1Zj5ErMU\n9yR0GFEZGkqjmHE9U2dx28tQ5809teGjH3X3d5vZ6wDc/e/M7PXAn5Ddv+5hshFEtzTZTxSjhlas\nSDVcssERR6T86ldJ6DCio4KgLJaTXkxiKRwXtCDolFgKgnPPhVtvDR1FXJQnIvMXyz0aFnL4aM85\nI+oxTWG8732hIxCRhaSCoMGuXWnoELpuujHztccFF8z8fr+Oma93kEqN8qSZNHQALWmtIaZ2jF5/\nff/ddapVk1wsbZyxGR3tj+NDep/6CBokydSbk0s8bZyxUb70tuXLYc+e0FFkjj8edpcG3M+N7lnc\nQrFGMDbWfzUCkU7ppbWG9uyJp6Bf6JZX1QgaLF+esnt3EjqMqKhpqDnlS1kv5UmnanydGGbciVg0\namgOjjgidATx0X1oRXqbagTAyAhs3Zqlx8Zg9eosvXatbjYi01MfQVkv5UlMv2WhawQqCBqos1hm\nq5fawzslppNnu2L6LWoa6rBWY+HHxkY0Zr6BxoY3lyRp6BAilIYOIDpV+Pvpu4Kg1frxz3ve4CzX\nohfpHcuXZ1ed7T6g/X0sXx42L/qRmoZKccRTHRTplpiO+1hiiSUOUNOQREDt4CK9TQVBSRo6gOjo\nPrTNVaHtt9uUJ2VVyBMVBCLzNDoaOgKRzlAfQSmOeNoFY6E8aa6X8iWm3xJLLLHEAQvfR6C1hhoM\nD4eOQERi4BhEMlLcC/9dCGoaaqCx4c2koQOIVBo6gOhUoT18tgzPLsPbfKTbtrW9D1vAQgBUEMgs\naK0hkd6mPgKReYqpDbldMf2WWGKJJQ7QPAKRjtMsWpGpVBA06KU2zk7ptTyp3XCk3ce2bWnb+4jl\nDlid0mvHSidUIU80aqiB7kMr/aifRshImfoISnHE0y4oCyOmf+NYYoklDognlljiAPURSAS01pBI\nb1NBUJKGDiA6WmuouSq0/Xab8qSsCnmigkBEpM+pj6AURzztgrHotTyJ6ffEEksscUA8scR0I8Lj\nj4fdu9vbh9YamgOtNSQi0LnCKJaCbSZqGmqgtYaaSUMHEKUqtP12m/KkmTR0AC2pRtDjli/vzKSl\nTlSTO1G97QSNmReZSn0EPS6mamksscQSB8QTS6+1h8cknn9j9RGIyAz6qT1cytRH0EBtnGXKk+aU\nL82koQOIzuWXp6FDaEkFQQPdh1ZEOmloKHQEramPoBRHb1VtY/o9scQSSxwQVyyd0Gu/p5dorSER\nEZmWCoKSNHQA0VFbeHPKl7IqtId3WxWOExUEItIxVWgPlzL1EZTi6K02zph+TyyxaMy8dNOGDXEs\n5T5TH0FPFQSdmkXbCdH8gcd01oM4SoIOiaVgk7jFcpz0TWdxJ+5F24n70MZ0L1qj/R+TbtvWkZv8\nWs8tpZCGDiA6VWgP7740dAAt9VRBICIic9dTTUOxVMEgnlhiiQPiiqUTeu33dEIs7eExieU46Zs+\nglgyHOKJJZY4IK5YOqHXfk8nKE/KYsmTBe0jMLM1ZnaXmf2Hmb1lmm0+kL+/3czObvc7F5LaOMuU\nJ81pzHwzaegAolOF46StgsDMFgObgDXAGcBlZvbMhm0uAk5396cBfwx8uJ3vFImFxszLbFThOGmr\nacjMfgsYdvc1+fMrAdz9PYVt/hbY5u6fyZ/fBax29/sa9qWmoR6OA+KKRRaG/o3jtZBNQycB9xSe\n/zR/rdU2J7f5vSIi0iHtFgSzLfsbS6EFuWbIbkHY3iNt8/O1h8dyL0Q68XPSTmQJxx8fOic6S30n\nZVVoD++2Khwn7d6hbCdwSuH5KWRX/DNtc3L+WsnQ0BArV64EYGBggMHBQZIkAeqZOdPzC9iG++y3\nb/Y8f3Hen689N0vZlqbz/nynnrebH0mSYJZNtOtEPNDd37+Qz8fHx6OKJ4bntfbwWOLp5+fj4+Ps\n3bsXgImJCWbSbh/BEuAHwIuAe4Fbgcvc/c7CNhcB6939IjM7Dxhx9/Oa7Et9BJHqpd/SSRozL7MR\ny3GyoPMIzOxCYARYDHzU3d9tZq8DcPe/y7epjSzaD7zG3b/TZD8qCCLVS7+lk5QvMhuxHCeaUDYH\naaE5J3QssTBLJ5uYpE75Utapv59eEstx0jeLzomIyNypIGigq5my4eEkdAiRSkIHEJ00TUKHEKEk\ndAAtqWlogcQUiywM/RuXKU/KYskTNQ3NQX2Yo9QoT5rTmPlm0tABRKcKx4kKApF5qsIaMhJeFY4T\nNQ0tkJhiEekWHffxUtOQiEggVWhZ7bmCQOvqdN7QUBo6hCip76Rs2bI0dAjRGR1NQ4fQUk8VBJ24\n6Xyn9rN7d9i86KTNm0NHIDEws5aPffsOttyml8wmTzZvvjX6POmpPoLOxKE2zkbKk+ZiWUMmtJER\n2Lo1S4+NwerVWXrtWrjiinBxhZSm9SahjRtheDhLJ0n2CKFvlpjoTBw66TVSnjSnfMmoIJjZwADk\ni4AGpc7iOUlDBxChNHQAXTebKj+si77K331p6ACiMDJSv/r/5S/TyfTISNi4ptPu/QhEetJsaqcn\nnZSyc+eWLkQTtyuuqF/5m1VjlMxCGxys1wLGxurNQYODwUKakZqGGqjdt0x50tyyZbBvX+gowoux\nPTwmS5fCgQOho1AfgUjHqD28bP16+PKXs/Tdd8Npp2Xpiy+GTZvCxRWLY46Bhx4KHYX6COZEY8PL\nlCfTSUMHEIV167JlFLKlFNLJ9Lp1IaMKq9hHsG+f+ghEekrV2n4ljGK/yRFHxN9vooKgge5HUKY8\nmU4SOoAojI8XT3TJZHpgoH/7CIr9Jo8+mkz2scXab6KCQETaUrz6XbYs/qvfbphaOBJ94aiCoMHQ\nUMroaBI6jKgoT6aTolrB1Kvf/ftTNmxIgHivfrthauGYRn/nNhUEDTZvhtHR0FHERXkiMyme8D/2\nMQ01blSFeYUqCEqS0AFEKAkdQKSS0AFEZ9GiJHQIUSjWkvbti7+PQPMISnFo/ZhGypPmBgeztmCp\nO/102LEjdBRxSZI4+k1mmkegGkFJSr9d6bVeE+dvMPvTlvuJoSDvphUrUvrtWGmmOMnuhz9MJ0eZ\n9fMku2KNYGws/n4TFQTS8gS+bFnKvn39dZKfjTVrQkcQh2LH6FFHxXH1G1rxhH/LLfH3m6ggaDA8\nnIQOIQpTR4LE38YZwuBgEjqEKBSPlQMHdKyUJaEDaEkFQYPYS+5u2bKlvn4M1EcNPfCA/rhrRkeV\nF9Larl2hI2hNaw010Lo6mU2bYGIieyxalE6mtYhY3fh4GjqECKWhA4hQGjqAllQjkKaK1f3Dh1F1\nP1fMl+3blS8w9bd/4AOqVcPUDvTt2+v5E2sHugqCBlpXp5kkdACRSkIHEJ3ly5PQIUSh2IGeJEn0\nHeiaRyAtaR5Bc7GMDw9NN6aZWSzHieYRzIHW1ckUbzYCKStXJoBuNlK0dGmKagVTT/if+ER9zLxk\nnv3slNiPExUEDbSujsyW5hFkijWCH/5Q/SaNqnCDHjUNleJQM0gj5YnMlpbdKEvTOApENQ3JnBVH\nPUD8ox4kHI2kmlksBcFMVCMoxZHinoQOIypHHply8GASOozopGmqUWYNTj89ZceOJHQYUYml31E1\nAmnLIk07lFlatix0BHEo1pI2b4aVK7N0rLUk1QgarFkD110XOorwNCRQZkvHysw0fLSCqrAuiEhM\nGk/4mllcPX1XELRee/8LmL2s5X5iqL10T0rs46BDUB9B2cREio4V3Y8ges1O4MURMmNjKatXZ9v0\n8wgZrR8j8zE4GDqCOBT/fiYm4v/76buCoJnBQdi7N0uPjSWT/4A6qDOnnpqEDiFKqg2U6R4NZbVZ\n+TFTQUA2AabYmVNLDwzEWY3rBo0Nl/mowpj5bqtCfqggKElRG2dj1VbrxzSjPoIy9RE0kxJ7nqgg\nYOqSsYsWxTHUS6QqqjZmXsrmPY/AzJYDnwFOAyaAV7j73ibbTQAPAoeAR9393Gn2F8U8gqOOgkce\nCR1FXFTdl9nasCH+jtF+NdM8gnbmjF4J3OjuTwf+OX/ejAOJu589XSEQ2shI/erlwIF6emQkbFyx\nUCEg0tvaKQguATbn6c3A2hm2bTV4P6grrqhXb5cuTSfT/Tp0tNH69WnoEKKk+1uXDQykoUOIzshI\nGjqEltopCE509/vy9H3AidNs58DXzOw2M3ttG9/XFQcPho4gPjfdFDoCqQoNuS6rwrLcM3YWm9mN\nwIomb/1F8Ym7u5lN18B/vrv/zMxOAG40s7vc/RvzC3dhFDu73BMNlWwwMJCEDiFKGjFUpjwpq/w8\nAnd/8XTvmdl9ZrbC3XeZ2ZOAn0+zj5/l/7/fzD4PnAs0LQiGhoZYmQ85GBgYYHBwcPLAqlXDF+L5\nli2wZUuaR5EwOgoHDqRs374w31eF5+vXp9x0U1YIjI3B4GD2/tBQkjelxRWvnsfxHLIJmbHEE+r5\nyEjK+HhWCGzcWBtWm/39dCt/xsfH2ZvPlJ2YmGAm7Ywaeh/wC3d/r5ldCQy4+5UN2xwNLHb3h8zs\n8cANwEZ3v6HJ/oKNGpq6emLK8HACqEZQMziYMj6ehA4jOqnmEZTEsvZ+TGLJk4VaffQ9wDVm9kfk\nw0fzL3sy8BF3fylZs9Ln8oXelgCfbFYIhFY84W/cqOFvItJf5l0QuPtu4HeavH4v8NI8/SMg+u6j\nYo0A1EfQaGgoCR1ClFQbyEydUJZoQlmDKvz9aGaxtKRhtDKTxhO+atRTVaEwbGf4aI9KQwcQnXpH\noBQpX8pqnaJSV4XjRDUCpl7RXHWVrmhE5kvzCKpJ9yxG91wVkd63UGsNSZ+oQM1WIqFjpawKeaKC\noCQNHUB0RkfT0CFEqQptv92mY6WsCnmiPgKmNgF95CPqIxCRztm1K3QErakgaHDCCUnoEKKgseGt\naR5BRsdKWTFPrr8+/rlJ6ixuMDQEo6Oho4iLbjYis6VjpSxbWyh0FAu3xETPmHpFk06uFhhr6d1t\nug9tc1rHd2cOAAAHL0lEQVRrqEzHSqZ4Thkbq9/zO9ZzigoCpv7jTEzoiqaRxobLbOlYyRTPKbfc\nEv85RU1DDVS1FZFOiuWconkEcxBjtU1EqqsK5xQVBCVp6ACio/HyzSlfypQnzaShA2hJBYG0VIV7\nrorI/KkgKElCBxCdvXuT0CFESSOGypQnZVXIExUEDTSHQET6jYaPNhgfT1GtoHwf51qexDoOOgTN\nIyhTnpRVIU9UEDD1pLd9O9FPB+8Gza0Q6R9qGipJQgcQndpMa5kq9qu8EJQnZVXIE00oa7BsGezb\nFzqKuKRp/9aMZG50rMRLE8paSNP67L/9+9PJtIZE16ShA4iSxsyXVWHt/W6rwnGigkBEpM+ps7gk\nCR1AdKrQxhmC8iWj+xHMrArHifoIGgwOaiatyHzFssCalKmPYA5WrEhDhxCdKrRxhqB8KcvuRyBF\nVThOVBA0WLMmdAQi1aX7EVSTmoZERPqAmoZERGRaKggaVKE9r9uUJ80pX8qUJ2VVyBMVBCIifU59\nBA00RV5EepH6COagArU4EZGOUkHQQOOgy6rQxhmC8qVMeVJWhTzREhM0TpFHU+RFpK+oj6CBpsiL\nSC9SH4GIdEUFWkGkCRUEDQYG0tAhRKcKbZwhKF/KdD+CsiocJyoIGmitFBHpN+ojEJG2FAdbbNwI\nw8NZWoMt4jJTH4FGDYlIWxpP+BpsUT1qGmpQhfa8blOeNKd8KdM8nLIqHCcqCESkY9THVk3qIxAR\n6QOaRyAiItOad0FgZi83s383s0Nm9pwZtltjZneZ2X+Y2Vvm+33dUoX2vG5TnjSnfClTnpRVIU/a\nqRHcAVwK/Mt0G5jZYmATsAY4A7jMzJ7ZxncuuPHx8dAhREd50pzypUx5UlaFPJl3QeDud7n7/2ux\n2bnADnefcPdHgU8DL5vvd3bDVVftDR1CdPbuVZ40o3wpU56UVSFPFrqP4CTgnsLzn+avReuBB0JH\nICLSXTNOKDOzG4EVTd56m7t/aRb7r9wwIPeJ0CFEZ2JiInQIUVK+lClPyqqQJ20PHzWzbcCb3f07\nTd47D9jg7mvy528FDrv7e5tsW7lCQ0SkShZ6iYmmOwduA55mZiuBe4FXApc123C6AEVEZGG1M3z0\nUjO7BzgPuNbMvpq//mQzuxbA3R8D1gPXA98HPuPud7YftoiIdEo0M4tFRCQMzSzOmdnHzOw+M7sj\ndCyxMLNTzGxbPnHwe2b2htAxhWZmS83sm2Y2bmbfN7N3h44pFma22MxuN7PZDCTpeWY2YWbfzfPk\n1tDxzEQ1gpyZvQDYB/yju58ZOp4YmNkKYIW7j5vZMuDbwNp+b94zs6Pd/WEzWwLcBPyZu98UOq7Q\nzOxNwHOBY9z9ktDxhGZmPwae6+67Q8fSimoEOXf/BrAndBwxcfdd7j6ep/cBdwJPDhtVeO7+cJ48\nAlgMRP+HvtDM7GTgIuDvmX7wSD+qRF6oIJBZyUd+nQ18M2wk4ZnZIjMbB+4Dtrn790PHFIG/Bv4c\nOBw6kIg48DUzu83MXhs6mJmoIJCW8mahLcAb85pBX3P3w+4+CJwM/LaZJYFDCsrMLgZ+7u63U5Er\n4C45393PBi4EXp83P0dJBYHMyMweB/wT8Al33xo6npi4+y+Ba4FzQscS2POBS/I28auBF5rZPwaO\nKTh3/1n+//uBz5OtvRYlFQQyLTMz4KPA9919JHQ8MTCzJ5rZQJ4+CngxcHvYqMJy97e5+ynu/hTg\n94Cvu/urQ8cVkpkdbWbH5OnHAy8hW7E5SioIcmZ2NfCvwNPN7B4ze03omCJwPvAHwAX5ELjbzWxN\n6KACexLw9byP4JvAl9z9nwPHFBsNRYQTgW8UjpMvu/sNgWOaloaPioj0OdUIRET6nAoCEZE+p4JA\nRKTPqSAQEelzKghERPqcCgIRkT6ngkBkBmZ2RT5xTKRnaR6ByAzyZRPOcfdfzOEzi9xdi69JZXTq\nnsUilZcvBXANcBLZ8tKfJVt2e5uZ3e/uLzKzD5OtLXQUsMXdN+SfnQA+TbbkxPvM7ETgdcBjZEt0\nNL1Xt0gMVBCI1K0Bdrr7SwHM7FjgNUBSuLnI29x9j5ktJlti+Nnu/j2yZRUecPfn5p/dCax090fz\n/YhES30EInXfBV5sZu8xs1Xu/mCTbV5pZt8GvgM8Czij8N5nGvb1KTP7feDQgkUs0gEqCERy7v4f\nZDffuQN4l5m9s/i+mT0FeDPwQnc/i2wJ6qWFTfYX0i8FPgQ8B/hWXoMQiZIKApGcmT0JOODunwT+\niqxQeBCoNe0cS3ayfzDvA7hwmv0YcKq7p8CVwHHA4xc2epH5Ux+BSN2ZwF+a2WHgV8CfkN105Toz\n25l3Ft8O3AXcQ3bj+mYWAx83s+PI7tj1/mmamUSioOGjIiJ9Tk1DIiJ9TgWBiEifU0EgItLnVBCI\niPQ5FQQiIn1OBYGISJ9TQSAi0udUEIiI9Ln/D6Qmug9VL5qWAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1f6f0278>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# box plot of sentiment grouped by stars\n",
    "yelp.boxplot(column='sentiment', by='stars')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "254    Our server Gary was awesome. Food was amazing....\n",
       "347    3 syllables for this place. \\nA-MAZ-ING!\\n\\nTh...\n",
       "420                                    LOVE the food!!!!\n",
       "459    Love it!!! Wish we still lived in Arizona as C...\n",
       "679                                     Excellent burger\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reviews with most positive sentiment\n",
    "yelp[yelp.sentiment == 1].text.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "773     This was absolutely horrible. I got the suprem...\n",
       "1517                  Nasty workers and over priced trash\n",
       "3266    Absolutely awful... these guys have NO idea wh...\n",
       "4766                                       Very bad food!\n",
       "5812        I wouldn't send my worst enemy to this place.\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reviews with most negative sentiment\n",
    "yelp[yelp.sentiment == -1].text.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# widen the column display\n",
    "pd.set_option('max_colwidth', 500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>date</th>\n",
       "      <th>review_id</th>\n",
       "      <th>stars</th>\n",
       "      <th>text</th>\n",
       "      <th>type</th>\n",
       "      <th>user_id</th>\n",
       "      <th>cool</th>\n",
       "      <th>useful</th>\n",
       "      <th>funny</th>\n",
       "      <th>length</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>390</th>\n",
       "      <td>106JT5p8e8Chtd0CZpcARw</td>\n",
       "      <td>2009-08-06</td>\n",
       "      <td>KowGVoP_gygzdSu6Mt3zKQ</td>\n",
       "      <td>5</td>\n",
       "      <td>RIP AZ Coffee Connection.  :(  I stopped by two days ago unaware that they had closed.  I am severely bummed.  This place is irreplaceable!  Damn you, Starbucks and McDonalds!</td>\n",
       "      <td>review</td>\n",
       "      <td>jKeaOrPyJ-dI9SNeVqrbww</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>175</td>\n",
       "      <td>-0.302083</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                business_id        date               review_id  stars  \\\n",
       "390  106JT5p8e8Chtd0CZpcARw  2009-08-06  KowGVoP_gygzdSu6Mt3zKQ      5   \n",
       "\n",
       "                                                                                                                                                                                text  \\\n",
       "390  RIP AZ Coffee Connection.  :(  I stopped by two days ago unaware that they had closed.  I am severely bummed.  This place is irreplaceable!  Damn you, Starbucks and McDonalds!   \n",
       "\n",
       "       type                 user_id  cool  useful  funny  length  sentiment  \n",
       "390  review  jKeaOrPyJ-dI9SNeVqrbww     1       0      0     175  -0.302083  "
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# negative sentiment in a 5-star review\n",
    "yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>date</th>\n",
       "      <th>review_id</th>\n",
       "      <th>stars</th>\n",
       "      <th>text</th>\n",
       "      <th>type</th>\n",
       "      <th>user_id</th>\n",
       "      <th>cool</th>\n",
       "      <th>useful</th>\n",
       "      <th>funny</th>\n",
       "      <th>length</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1781</th>\n",
       "      <td>53YGfwmbW73JhFiemNeyzQ</td>\n",
       "      <td>2012-06-22</td>\n",
       "      <td>Gi-4O3EhE175vujbFGDIew</td>\n",
       "      <td>1</td>\n",
       "      <td>If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.</td>\n",
       "      <td>review</td>\n",
       "      <td>Hqgx3IdJAAaoQjvrUnbNvw</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>119</td>\n",
       "      <td>0.766667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 business_id        date               review_id  stars  \\\n",
       "1781  53YGfwmbW73JhFiemNeyzQ  2012-06-22  Gi-4O3EhE175vujbFGDIew      1   \n",
       "\n",
       "                                                                                                                         text  \\\n",
       "1781  If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.   \n",
       "\n",
       "        type                 user_id  cool  useful  funny  length  sentiment  \n",
       "1781  review  Hqgx3IdJAAaoQjvrUnbNvw     0       1      2     119   0.766667  "
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# positive sentiment in a 1-star review\n",
    "yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# reset the column display width\n",
    "pd.reset_option('max_colwidth')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus: Adding Features to a Document-Term Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a DataFrame that only contains the 5-star and 1-star reviews\n",
    "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n",
    "\n",
    "# define X and y\n",
    "feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']\n",
    "X = yelp_best_worst[feature_cols]\n",
    "y = yelp_best_worst.stars\n",
    "\n",
    "# split into training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3064, 16825)\n",
      "(1022, 16825)\n"
     ]
    }
   ],
   "source": [
    "# use CountVectorizer with text column only\n",
    "vect = CountVectorizer()\n",
    "X_train_dtm = vect.fit_transform(X_train.text)\n",
    "X_test_dtm = vect.transform(X_test.text)\n",
    "print X_train_dtm.shape\n",
    "print X_test_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 4)"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# shape of other four feature columns\n",
    "X_train.drop('text', axis=1).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 4)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# cast other feature columns to float and convert to a sparse matrix\n",
    "extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float))\n",
    "extra.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 16829)"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# combine sparse matrices\n",
    "X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))\n",
    "X_train_dtm_extra.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1022, 16829)"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# repeat for testing set\n",
    "extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))\n",
    "X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))\n",
    "X_test_dtm_extra.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.917808219178\n"
     ]
    }
   ],
   "source": [
    "# use logistic regression with text column only\n",
    "logreg = LogisticRegression(C=1e9)\n",
    "logreg.fit(X_train_dtm, y_train)\n",
    "y_pred_class = logreg.predict(X_test_dtm)\n",
    "print metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.922700587084\n"
     ]
    }
   ],
   "source": [
    "# use logistic regression with all features\n",
    "logreg = LogisticRegression(C=1e9)\n",
    "logreg.fit(X_train_dtm_extra, y_train)\n",
    "y_pred_class = logreg.predict(X_test_dtm_extra)\n",
    "print metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus: Fun TextBlob Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TextBlob(\"15 minutes late\")"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# spelling correction\n",
    "TextBlob('15 minuets late').correct()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# spellcheck\n",
    "Word('parot').spellcheck()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'tip laterally',\n",
       " u'enclose with a bank',\n",
       " u'do business with a bank or keep an account at a bank',\n",
       " u'act as the banker in a game or in gambling',\n",
       " u'be in the banking business',\n",
       " u'put into a bank account',\n",
       " u'cover with ashes so to control the rate of burning',\n",
       " u'have confidence or faith in']"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# definitions\n",
    "Word('bank').define('v')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'es'"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# language identification\n",
    "TextBlob('Hola amigos').detect_language()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "- NLP is a gigantic field\n",
    "- Understanding the basics broadens the types of data you can work with\n",
    "- Simple techniques go a long way\n",
    "- Use scikit-learn for NLP whenever possible"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}