{ "metadata": { "name": "", "signature": "sha256:28540433bd11eedc4e697845d59b2a5620c46bbc127a315683c869d5c0d2d7bd" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural Language Processing (NLP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Introduction\n", "\n", "*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is NLP?\n", "\n", "- Using computers to process (analyze, understand, generate) natural human languages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why is NLP useful?\n", "\n", "- Most knowledge created by humans is unstructured text\n", "- Need some way to make sense of it\n", "- Enables quantitative analysis of text data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are some of the higher level task areas?\n", "\n", "- **Speech recognition and generation**: Apple Siri\n", " - Speech to text\n", " - Text to speech\n", "- **Question answering**: IBM Watson\n", " - Match query with knowledge base\n", " - Reasoning about intent of question\n", "- **Machine translation**: Google Translate\n", " - One language to another to another\n", "- **Information retrieval**: Google\n", " - Finding relevant results\n", " - Finding similar results\n", "- **Information extraction**: Gmail\n", " - Structured information from unstructured documents\n", "- **Assistive technologies**: Google autocompletion\n", " - Predictive text input\n", " - Text simplification\n", "- **Natural Language Generation**: computer-generated articles\n", " - Generating text from data\n", "- **Automatic summarization**: Google News\n", " - Extractive summarization\n", " - Abstractive summarization\n", "- **Sentiment analysis**: Twitter analysis\n", " - Attitude of speaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are some of the lower level components?\n", "\n", "- **Tokenization**: breaking text into tokens (words, sentences, n-grams)\n", "- **Stopword removal**: a/an/the\n", "- **Stemming and lemmatization**: root word\n", "- **TF-IDF**: word importance\n", "- **Part-of-speech tagging**: noun/verb/adjective\n", "- **Named entity recognition**: person/organization/location\n", "- **Spelling correction**: \"New Yrok City\"\n", "- **Word sense disambiguation**: \"buy a mouse\"\n", "- **Segmentation**: \"New York City subway\"\n", "- **Language detection**: \"translate this page\"\n", "- **Machine learning**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why is NLP hard?\n", "\n", "- **Ambiguity**:\n", " - Teacher Strikes Idle Kids\n", " - Red Tape Holds Up New Bridges\n", " - Hospitals are Sued by 7 Foot Doctors\n", " - Juvenile Court to Try Shooting Defendant\n", " - Local High School Dropouts Cut in Half\n", "- **Non-standard English**: tweets/text messages\n", "- **Idioms**: \"throw in the towel\"\n", "- **Newly coined words**: \"retweet\"\n", "- **Tricky entity names**: \"Where is A Bug's Life playing?\"\n", "- **World knowledge**: \"Mary and Sue are sisters\", \"Mary and Sue are mothers\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How does NLP work?\n", "\n", "- Build probabilistic model using data about a language\n", "- Requires an understanding of the language\n", "- Requires an understanding of the world (or a particular domain)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Reading in the Yelp Reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- \"corpus\" = collection of documents\n", "- \"corpora\" = plural form of corpus" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import scipy as sp\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn import metrics\n", "from textblob import TextBlob, Word\n", "from nltk.stem.snowball import SnowballStemmer\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# read yelp.csv into a DataFrame\n", "url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/yelp.csv'\n", "yelp = pd.read_csv(url)\n", "\n", "# create a new DataFrame that only contains the 5-star and 1-star reviews\n", "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n", "\n", "# split the new DataFrame into training and testing sets\n", "X_train, X_test, y_train, y_test = train_test_split(yelp_best_worst.text, yelp_best_worst.stars, random_state=1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Tokenization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **What:** Separate text into units such as sentences or words\n", "- **Why:** Gives structure to previously unstructured text\n", "- **Notes:** Relatively easy with English language text, not easy with some languages" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# use CountVectorizer to create document-term matrices from X_train and X_test\n", "vect = CountVectorizer()\n", "train_dtm = vect.fit_transform(X_train)\n", "test_dtm = vect.transform(X_test)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# rows are documents, columns are terms (aka \"tokens\" or \"features\")\n", "train_dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "(3064, 16825)" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# last 50 features\n", "print vect.get_feature_names()[-50:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'yyyyy', u'z11', u'za', u'zabba', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zero', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zihuatenejo', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zippers', u'zipps', u'ziti', u'zoe', u'zombi', u'zombies', u'zone', u'zones', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel', u'zzed', u'\\xe9clairs', u'\\xe9cole', u'\\xe9m']\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# show vectorizer options\n", "vect" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "CountVectorizer(analyzer=u'word', binary=False, charset=None,\n", " charset_error=None, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **lowercase:** boolean, True by default\n", "- Convert all characters to lowercase before tokenizing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# don't convert to lowercase\n", "vect = CountVectorizer(lowercase=False)\n", "train_dtm = vect.fit_transform(X_train)\n", "train_dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "(3064, 20838)" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **token_pattern:** string\n", "- Regular expression denoting what constitutes a \"token\". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# allow tokens of one character\n", "vect = CountVectorizer(token_pattern=r'(?u)\\b\\w+\\b')\n", "train_dtm = vect.fit_transform(X_train)\n", "train_dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "(3064, 16861)" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **ngram_range:** tuple (min_n, max_n)\n", "- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# include 1-grams and 2-grams\n", "vect = CountVectorizer(ngram_range=(1, 2))\n", "train_dtm = vect.fit_transform(X_train)\n", "train_dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "(3064, 169847)" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# last 50 features\n", "print vect.get_feature_names()[-50:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\\xe4uter', u'zzed', u'zzed in', u'\\xe9clairs', u'\\xe9clairs napoleons', u'\\xe9cole', u'\\xe9cole len\\xf4tre', u'\\xe9m', u'\\xe9m all']\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Predicting the star rating:**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# use default options for CountVectorizer\n", "vect = CountVectorizer()\n", "\n", "# create document-term matrices\n", "train_dtm = vect.fit_transform(X_train)\n", "test_dtm = vect.transform(X_test)\n", "\n", "# use Naive Bayes to predict the star rating\n", "nb = MultinomialNB()\n", "nb.fit(train_dtm, y_train)\n", "y_pred_class = nb.predict(test_dtm)\n", "\n", "# calculate accuracy\n", "print metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.918786692759\n" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate null accuracy\n", "y_test_binary = np.where(y_test==5, 1, 0)\n", "y_test_binary.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "0.81996086105675148" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "# define a function that accepts a vectorizer and returns the accuracy\n", "def tokenize_test(vect):\n", " train_dtm = vect.fit_transform(X_train)\n", " print 'Features: ', train_dtm.shape[1]\n", " test_dtm = vect.transform(X_test)\n", " nb = MultinomialNB()\n", " nb.fit(train_dtm, y_train)\n", " y_pred_class = nb.predict(test_dtm)\n", " print 'Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# include 1-grams and 2-grams\n", "vect = CountVectorizer(ngram_range=(1, 2))\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 169847\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.854207436399\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Stopword Removal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **What:** Remove common words that will likely appear in any text\n", "- **Why:** They don't tell you much about your text" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# show vectorizer options\n", "vect" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "CountVectorizer(analyzer=u'word', binary=False, charset=None,\n", " charset_error=None, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 2), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **stop_words:** string {'english'}, list, or None (default)\n", "- If 'english', a built-in stop word list for English is used.\n", "- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n", "- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# remove English stop words\n", "vect = CountVectorizer(stop_words='english')\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 16528\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.915851272016\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# set of stop words\n", "print vect.get_stop_words()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5: Other CountVectorizer Options" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **max_features:** int or None, default=None\n", "- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# remove English stop words and only keep 100 features\n", "vect = CountVectorizer(stop_words='english', max_features=100)\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 100\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.869863013699\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "# all 100 features\n", "print vect.get_feature_names()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']\n" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "# include 1-grams and 2-grams, and limit the number of features\n", "vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 100000\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.885518590998\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **min_df:** float in range [0.0, 1.0] or int, default=1\n", "- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# include 1-grams and 2-grams, and only include terms that appear at least 2 times\n", "vect = CountVectorizer(ngram_range=(1, 2), min_df=2)\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 43957\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.932485322896\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6: Introduction to TextBlob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TextBlob: \"Simplified Text Processing\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print the first review\n", "print yelp_best_worst.text[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n", "\r\n", "Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\r\n", "\r\n", "While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best \"toast\" I've ever had.\r\n", "\r\n", "Anyway, I can't wait to go back!\n" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "# save it as a TextBlob object\n", "review = TextBlob(yelp_best_worst.text[0])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "# list the words\n", "review.words" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', 'had', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', 'had', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back'])" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "# list the sentences\n", "review.sentences" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "[Sentence(\"My wife took me here on my birthday for breakfast and it was excellent.\"),\n", " Sentence(\"The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.\"),\n", " Sentence(\"Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.\"),\n", " Sentence(\"It looked like the place fills up pretty quickly so the earlier you get here the better.\"),\n", " Sentence(\"Do yourself a favor and get their Bloody Mary.\"),\n", " Sentence(\"It was phenomenal and simply the best I've ever had.\"),\n", " Sentence(\"I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.\"),\n", " Sentence(\"It was amazing.\"),\n", " Sentence(\"While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.\"),\n", " Sentence(\"It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.\"),\n", " Sentence(\"It was the best \"toast\" I've ever had.\"),\n", " Sentence(\"Anyway, I can't wait to go back!\")]" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "# some string methods are available\n", "review.lower()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "TextBlob(\"my wife took me here on my birthday for breakfast and it was excellent. the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. our waitress was excellent and our food arrived quickly on the semi-busy saturday morning. it looked like the place fills up pretty quickly so the earlier you get here the better.\n", "\n", "do yourself a favor and get their bloody mary. it was phenomenal and simply the best i've ever had. i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. it was amazing.\n", "\n", "while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. it was the best \"toast\" i've ever had.\n", "\n", "anyway, i can't wait to go back!\")" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 7: Stemming and Lemmatization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Stemming:**\n", "\n", "- **What:** Reduce a word to its base/stem/root form\n", "- **Why:** Often makes sense to treat related words the same way\n", "- **Notes:**\n", " - Uses a \"simple\" and fast rule-based approach\n", " - Stemmed words are usually not shown to users (used for analysis/indexing)\n", " - Some search engines treat words with the same stem as synonyms" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# initialize stemmer\n", "stemmer = SnowballStemmer('english')\n", "\n", "# stem each word\n", "print [stemmer.stem(word) for word in review.words]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u\"'m\", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u\"n't\", u'wait', u'to', u'go', u'back']\n" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lemmatization**\n", "\n", "- **What:** Derive the canonical form ('lemma') of a word\n", "- **Why:** Can be better than stemming\n", "- **Notes:** Uses a dictionary-based approach (slower than stemming)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# assume every word is a noun\n", "print [word.lemmatize() for word in review.words]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', 'had', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', 'had', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back']\n" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "# assume every word is a verb\n", "print [word.lemmatize(pos='v') for word in review.words]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', \"'ve\", 'ever', u'have', 'I', \"'m\", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', \"'ve\", 'ever', u'have', 'Anyway', 'I', 'ca', \"n't\", 'wait', 'to', 'go', 'back']\n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "# define a function that accepts text and returns a list of lemmas\n", "def split_into_lemmas(text):\n", " text = unicode(text, 'utf-8').lower()\n", " words = TextBlob(text).words\n", " return [word.lemmatize() for word in words]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "# use split_into_lemmas as the feature extraction function\n", "vect = CountVectorizer(analyzer=split_into_lemmas)\n", "tokenize_test(vect)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Features: 16452\n", "Accuracy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.920743639922\n" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "# last 50 features\n", "print vect.get_feature_names()[-50:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'yuyuyummy', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zen-like', u'zero', u'zero-star', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zipps', u'ziti', u'zoe', u'zombi', u'zombie', u'zone', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\\xe4uter', u'zzed', u'\\xe9clairs', u'\\xe9cole', u'\\xe9m']\n" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 8: Term Frequency - Inverse Document Frequency (TF-IDF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **What:** Computes \"relative frequency\" that a word appears in a document compared to its frequency across all documents\n", "- **Why:** More useful than \"term frequency\" for identifying \"important\" words in each document (high frequency in that document, low frequency in other documents)\n", "- **Notes:** Used for search engine scoring, text summarization, document clustering" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# example documents\n", "train_simple = ['call you tonight',\n", " 'Call me a cab',\n", " 'please call me... PLEASE!']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "# CountVectorizer\n", "vect = CountVectorizer()\n", "pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cabcallmepleasetonightyou
0010011
1111000
2011200
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ " cab call me please tonight you\n", "0 0 1 0 0 1 1\n", "1 1 1 1 0 0 0\n", "2 0 1 1 2 0 0" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "# TfidfVectorizer\n", "vect = TfidfVectorizer()\n", "pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cabcallmepleasetonightyou
00.0000000.3853720.0000000.0000000.6524910.652491
10.7203330.4254410.5478320.0000000.0000000.000000
20.0000000.2660750.3426200.9010080.0000000.000000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ " cab call me please tonight you\n", "0 0.000000 0.385372 0.000000 0.000000 0.652491 0.652491\n", "1 0.720333 0.425441 0.547832 0.000000 0.000000 0.000000\n", "2 0.000000 0.266075 0.342620 0.901008 0.000000 0.000000" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 9: Using TF-IDF to Summarize a Yelp Review" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create a document-term matrix using TF-IDF\n", "vect = TfidfVectorizer(stop_words='english')\n", "dtm = vect.fit_transform(yelp.text)\n", "features = vect.get_feature_names()\n", "dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "(10000, 28881)" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "def summarize():\n", " \n", " # choose a random review that is at least 300 characters\n", " review_length = 0\n", " while review_length < 300:\n", " review_id = np.random.randint(0, len(yelp))\n", " review_text = unicode(yelp.text[review_id], 'utf-8')\n", " review_length = len(review_text)\n", " \n", " # create a dictionary of words and their TF-IDF scores\n", " word_scores = {}\n", " for word in TextBlob(review_text).words:\n", " word = word.lower()\n", " if word in features:\n", " word_scores[word] = dtm[review_id, features.index(word)]\n", " \n", " # print words with the top 5 TF-IDF scores\n", " print 'TOP SCORING WORDS:'\n", " top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]\n", " for word, score in top_scores:\n", " print word\n", " \n", " # print 5 random words\n", " print '\\n' + 'RANDOM WORDS:'\n", " random_words = np.random.choice(word_scores.keys(), size=5, replace=False)\n", " for word in random_words:\n", " print word\n", " \n", " # print the review\n", " print '\\n' + review_text" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "summarize()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "TOP SCORING WORDS:\n", "insurance\n", "surgery\n", "dr\n", "estimate\n", "fish\n", "\n", "RANDOM WORDS:\n", "check\n", "credits\n", "small\n", "called\n", "allowed\n", "\n", "Dr Fish is an ok oral surgeon. However, his financial practices are shady. He gave us a high estimate for surgery, 6k; we had to pay 4k before the surgery. He then lowered the estimate after the surgery but refused to credit our card; we had insurance that paid 2k. Two months later, we received a check for a small amount of the overage. Their office person told me they dont do credit card credits. I then called our insurance company and learned Dr. Fish had received $1800 more from us than the insurance co allowed in Dr Fish's agreement with the co. Now we have to go fight him for the refund. His office person said, oh, insurance co's dont like to pay. Hmmm--why contract with them then? I would say, look elsewhere!\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 10: Sentiment Analysis" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print review" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n", "\r\n", "Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\r\n", "\r\n", "While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best \"toast\" I've ever had.\r\n", "\r\n", "Anyway, I can't wait to go back!\n" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "# polarity ranges from -1 (most negative) to 1 (most positive)\n", "review.sentiment.polarity" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 40, "text": [ "0.40246913580246907" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "# understanding the apply method\n", "yelp['length'] = yelp.text.apply(len)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "# define a function that accepts text and returns the polarity\n", "def detect_sentiment(text):\n", " return TextBlob(text.decode('utf-8')).sentiment.polarity" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "# create a new DataFrame column for sentiment\n", "yelp['sentiment'] = yelp.text.apply(detect_sentiment)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "# boxplot of sentiment grouped by stars\n", "yelp.boxplot(column='sentiment', by='stars')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 44, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEaCAYAAAAcz1CnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+cXHV97/HXJ4kQMMASpUT5FStqRSmLIpdKbA5abUAu\nhnujFtvK+ui1PlpzlavtFa26yX148Uf7uF011vZa7ab+QDHV+APlhzVnK7SIKBuxgrdRFzEYBJMI\nCSRC8rl/nDM7Z+fM7uzuzM73e2bez8dj4DszZ8585puz53u+P4+5OyIi0r8WhQ5ARETCUkEgItLn\nVBCIiPQ5FQQiIn1OBYGISJ9TQSAi0udUEEjbzOyQmd1uZuNm9m0z+60O7z8xsy+12GZ1p7+3G8xs\nwsyWN3l9X4e/p5L5I92xJHQA0hMedvezAczsJcC7gaTLMVwAPAT823w+bGYG4N2fWDPd93U6jjnn\nj5ktcffHOhyHREg1Aum044DdkJ1czewvzewOM/uumb0if33EzN6Rp3/XzMbybUfN7G/N7Ftm9gMz\ne2njzs1suZltNbPtZvZvZnamma0EXgf8j7xmsqrhMyeY2Y1m9j0z+0jtKtzMVubfsxm4Azhlmnin\n1EjMbJOZXZ6nJ8zsvfn23zSzpxa+c4uZ3Zo/np+//gQzu6EWC2DTZaSZ/Z98u6+Z2RPN7Klm9u3C\n+08rPi+8/gYz+/c8jz5lZqc15o+ZXWxmt5jZd/K8+bX8sxvM7ONmdhOw2cyelcd/e76/01v8+0sV\nubseerT1AB4DbgfuBPYCZ+ev/1fgBrKT3a8BdwMnAkcB3yO7Sr0LeEq+/SjwlTx9OnAPcCRZ7eJL\n+esfBN6Rpy8Abs/Tw8CbpolvE/CWPP27wGFgObASOAScO0O8K4rfX4jh1Xn6x8Bb8/QfFuL8FHB+\nnj4V+H6e/gDw9jx9US2WJjEfBi7L0+8APpinvw6claevAl7f5LM7gcfl6WOb5Q8wUEj/N+Cv8vQG\n4FvAkYV4X5WnlwBLQx9venT+oaYh6YRHvN40dB7wceDZwCrgU56dRX5uZmNkJ90vmdlrgW8Ab3T3\nH+f7ceAaAHffYWY/An6j4bvOB/5Lvs22/Ar7mPy96a6uzwfW5p+53sz2FN67291vLWzXGO/zgAdb\n/P6r8/9/GvjrPP07wDPzFieAY8zs8cALgEvzWL7SEEvRYeAzefoTwOfy9N8DrzGzNwGvyONr9F3g\nU2a2FdhaeL2YP6eY2TVkBd0RwI/y1x34orsfzJ//G/AXZnYy8Dl33zFNvFJhahqSjnL3W4AnmtkJ\nZCeV4snHqLd9/yZwP3BSi10ebvLatM0pM5juM/tbbOdkNZ7i38pRM3xP7fcZ8J/c/ez8cYq77y+8\nNxfFfPsccCFwMXCbuzcrSF4KfAh4DvAtM1vcZJsPAh9w998kazYq/qaHJ3+M+9XAfwYeAb5iZhfM\nMXapABUE0lFm9htkx9UDZFf8rzSzRXnB8ALg1rzN+k3A2cCFZnZu7ePAy/P+gqcCvw78oOErvgH8\nfv5dCXC/uz9E1hF6DM3dTHb1XOvMPn6a7Rrj/W3gVuAnwBlmdoSZDQAvbPjcKwv//9c8fQPwhkK+\nnJUn/wV4Vf7ahTPEsgh4eZ5+VR4b7n4AuB74MPAPjR+yrApyqrunwJVkfTbLKOfPscC9eXqouIuG\n/T3F3X/s7h8EvgCcOU28UmFqGpJOOMrMbs/TBlyeN6983rIhi9vJrmj/3N1/bmY3Am92911m9kfA\nqJk9L9/mJ2Qn32OB17n7r8zMqV8RbwA+Zmbbya7mL89f/xKwxcxeBqx395sL8W0ErjazPyRr6thF\ndmI8trBf3L1pvAB5M8r3yPoEvtPw+4/P4zkAXJa/9gbgQ/nrS4Ax4E8LsVxGVmjcPU2e7gfONbO3\nA/dRL2wg63+4lKywabQY+LiZHUf2b/F+d/9l3tldy5//nufjZ/Omqa8Dp9Wygakjll6R59ujwM+A\n/z1NvFJhlv29ioRnZv9A1tn6uZYbz22/RwCH3P1QfqL/kLs/Z477+Apwtbt/vOH1HwPPdffdnYu4\nZSx/Bhzj7sPd+k7pbaoRSD84FbjGzBYBvwJeO9PGZrYBeKq7/2HtNXe/aJrNF/RKysxGgXvcvTbc\n9vPAUyg3T4nMmwoCiYa7v2aB9ruDrON0Ifb96wux3xm+79Jufp/0B3UWS+WZ2VvM7Kdm9qCZ3WVm\nL8w7nK80sx1m9oCZfcbMjs+3X2lmh83s1WZ2t5ndb2Zvy99bA7yVrNP4oVrfh5mleX8GZjZkZjfn\nE7725N/xfDN7jZn9xMzuM7NXF+I70sz+Kv+uXWb2YTNbmr+X5LG/Kf/cvWY2lL/3x2Qdxf8zj+UL\nXcxW6SMqCKTSzOwZwOuBc9z9WOAlwARZZ+0lZCN/ngTsIRtSWXQ+8HTgRcA7zewZ7n4d2UStT7v7\nMbX5EZQ7Uc8l61ReTjaP4BqyWsdTgT8ANpnZ0fm27yGbIHdW/v+TgHcW9nUiWcf1k4E/IutkPs7d\n/y/wSeC9eSwvm1cmibSggkCq7hDZ7ONnmdnj3P0n7v4jsrHxb3f3e939UbLROuvyfoKaje5+0N2/\nS3ZSrw3xNFqP9f+xu2/OR0ddQ3YS/1/u/qi730jWF3F6PpzztWSzeve6+z6ytZh+r7CvR/PPHnL3\nrwL7gGcU3p/PvAmRWVMfgVRaPgP5CrLhkM8ys+uBN5MtH/F5MytOSHuM7Oq7Zlch/TDZePvZuq+Q\nfiSP5f6G15YBJwBHA98uzDI2pl6E/cLdi3HONRaRtqhGIJXn7le7+wvIxsI78F6y+Qhr3P34wuNo\nd//ZbHbZwfAeICsUzijEMZA3Y82GxnfLglNBIJVmZk/PO4ePBA6STep6DPhb4CozOzXf7gQzu2SW\nu90FrLTCJfx85Vf6HwFG8tnKmNlJ+Qzn2biPbIa1yIJRQSBVdyRZm/v9ZDNfn0g26uf9wBeBG8zs\nQbIZxecWPjfTlfZn8///wsxua/J+Y8dxq/29BdgB3GJmvwRuJOukns1nP0q2vMUeM+voRDuRmrZn\nFpvZx8gWufq5uzddh8TMPkC2UNbDwJC7395sOxER6b5O1Aj+AVgz3ZtmdhFwurs/DfhjssWyREQk\nEm0XBO7+DbIx2tO5BNicb/tNYMDMTpxhexER6aJu9BGcRHanqZqfAid34XtFRGQWutVZ3OxmHyIi\nEoFuTCjbCZxSeH5y/toU+ZrzIiKyQNy96ZDobtQIvgi8GibvZ7vX3e9rtmGoGzcXH6tXDwePIbbH\n8LDypNlDx0r5AcqTWPNkJm3XCMzsamA12X1q7wGGgcflJ/a/8+wG3ReZ2Q6yuy4tyFLDIiIyP20X\nBO5+2Sy2Wd/u93TL0qUToUOIzsTEROgQoqRjpZmJ0AFEIU2zR2aCDRuyVJJkj9hoZnGDNWsGQ4cQ\nncFB5UkzOlbK1q5VnpTFnyfR3LPYzDyWWERkfpKkeCUsAAMDsHdv6CjAzPCAncUi0idiOOHFZlEF\nzrIVCLG7Ul3OlChPmlO+ZEZG6m3f27enk+mRkbBxhVTMkz174s8T3ZhGRNpyxRXZA2DJEjUNwdQ8\nedzj4s8T9RGISFuKI2Q2boTh4Swd6wiZbogxT2bqI1BBICIdYwb6M54qljxRZ/EcqN23THnSnPIl\ns2oVLF2aPSCdTK9aFTqycKqWJ+ojEJG2rFuX9Q0AjI3Beedl6bVrw8UU2k031dNmcOBAuFhmQzWC\nBkm/NmrOQHnSnPKlmSR0AFFYvx5WrswekEym10e6xoJqBCLSlsHB+vyBsbF6Z2g/T0hftw6e+MQs\nvXEjDA1l6VivHVQjaKB23zLlSXPKl8z4eHGUTDqZHh8PGVVM0tABtKQagYi0RTWCsve/H7Ztqz+v\nTSTbvj3OWoGGj4pIx8QyVDK0kRHYujVLj43B6tVZeu3a+kSzbtM8AmlLmsZ5FRPaqlVTR4f0q/Xr\n4ctfztJ33w2nnZalL74YNm0KF1csYikcNY9gDtTuWzY6moYOIUq33JKGDiEKO3dmTUNZ81A6md5Z\nuiFt/7j00mzV0YEBgHQyfemloSNrTn0EIvMUw1VeDN74RjjrrCy9cWO96UO1yOpQ05A0FeNaKTFY\ntQpuuy1LHzwIRx6Zpc85p3+biY46qvmEqaVL4ZFHuh9PDGLMk5mahlQjaKD28EzjCb92q71+9653\nTS0gr7wyS/fzMfPud0/fMdqvvvrV6S+kYqQaQYOhoZTR0SR0GFFRnjS3eHHKoUNJ6DCiYpbinoQO\nI7ipo4ZSVq9OgHhHDalG0GDXrtARxKefx4PP5AlPCB1BHIonPahf9YY86YW2YwdMTNSf19I7doSI\npjXVCFB7uMzPyEj/nuiK+nn4qFnTC2xgO3BGnl4CPJanvw+cVdq6G+c+1QhaKJ7w01Tt4TI7uj9v\npmrr6nTSbE7g2TyC2qn2N4E4Lr6LVBAwtUYwNpayYUMCqEZQk6apVtpsIptHkASOIrwtW+o1Aqj3\nJz3wgP5+MimxHycqCJh6wr/lFtUIZHrFi4brr68fK/180bBpU70JyGxq27jUlqKOm/oIGmzYoIKg\nkdrCm3v842H//tBRhBfjujpSpj6COejXq7qZbN2qP+iaYo3g4YdVIwCtPtoLVCNooPbwssHBlPHx\nJHQY0dGY+czUUUMpp52WAP0xamg2YpmHo0XnZM5GRupXubU11JOkvq56vzrzzOz+vLV79NbSZ54Z\nNq6QNm/Oho3efXf2vJbevDlsXLGoQj6oRiAtJUm9OUTqFi2Cw4dDRxHemWfCnXdm6UOHYPHiLP3M\nZ8Idd4SLKxZVWIZafQQi0pbjjqvXkA4dqqePOy5cTDI3ahpqMDKShg4hOs9+dho6hCgdfXQaOoQo\nDA7CihXZA9LJtDqLa9LQAbSkGkED3XC7bN260BHEozhqaP9+jRoCuPZa+MlP6s9r6WuvVWdxVaiP\noMHQEIyOho4iLlqau7nly2H37tBRdE+V1tWJSSxzk9RH0ELxKm/z5vpMwH6+yitSQVBXPFb27Omv\nGkGvrKvTbTEUAq2oIGDqH/HWrfW1hiQzMZES+1opYaQoXxqlKE+mqsLcJBUETL3K2769v67ypqNa\nUnPj41OH0tbSAwP9nS9SbeojaKA+grIlS+Cxx1pv1280j6AsljHzUqY+gjmowkqBEk6xpuSu2qP0\nBs0jaDAwkIYOIQqrVsHSpdnj0KF0Mr1qVejIYpKGDiBCaegAojM0lIYOoSXVCKSpm26qp484Ag4c\nCBdLTKbehKXejKibsGQuvzx0BPHZvDn+5mYVBA327k1ChxCdRYuS0CFEY+ptGZO+ui3jbMSwymZ8\nktABtKSmIWnpnHNCRyAiC0k1AqZ2AG7cmFIrwdUBmFm3LqUKVzXdMHX4aEqaJoCGj9ZUYcx896XE\n/vejgoCpJ/yJiWrMBOym667THcpqxsamrkdVSx9/vPJIqqvtgsDM1gAjwGLg7939vQ3vJ8AXgB/l\nL/2Tu7+r3e9dKCtXJqFDiM6uXUnoEKLxxjfCWfnyORs3JpMnf10EZ1QbKBseTkKH0FJbE8rMbDHw\nA+B3gJ3At4DL3P3OwjYJ8CZ3v6TFvqKYUKZ1dTJTm8tgeDhL93tz2apVcNttWfrgQTjyyCx9zjlT\nR1r1q1gWWJOyhbxV5bnADnefcPdHgU8DL2sWQ5vf0zVbtqShQ4hQGjqArjOzpo+bbzYOHswe8PXJ\n9M03N9++32R9bFKUVuD2fu0WBCcB9xSe/zR/rciB55vZdjP7ipmdQcSuuy50BBIDd2/5gEWz2EYk\nfu32EczmSP8OcIq7P2xmFwJbgae3+b0L5rHHktAhRGHq6JhEi6s1lYQOIEJJ6ACiU4V+k3YLgp3A\nKYXnp5DVCia5+0OF9FfN7G/MbLm7l27pMTQ0xMp8sZ+BgQEGBwcnM7FWvVqI5yMjMDqaPb/77oQk\ngb17U1atgk2bFv77Y3w+NpZy112wdGn2/K67svd37IgjvhieZ7No44knhufKj3iej4+Ps3fvXgAm\nJiaYSbudxUvIOotfBNwL3Eq5s/hE4Ofu7mZ2LnCNu69ssq9gncWN8whqvfz93DGqPGkt1Zj5ErMU\n9yR0GFEZGkqjmHE9U2dx28tQ5809teGjH3X3d5vZ6wDc/e/M7PXAn5Ddv+5hshFEtzTZTxSjhlas\nSDVcssERR6T86ldJ6DCio4KgLJaTXkxiKRwXtCDolFgKgnPPhVtvDR1FXJQnIvMXyz0aFnL4aM85\nI+oxTWG8732hIxCRhaSCoMGuXWnoELpuujHztccFF8z8fr+Oma93kEqN8qSZNHQALWmtIaZ2jF5/\nff/ddapVk1wsbZyxGR3tj+NDep/6CBokydSbk0s8bZyxUb70tuXLYc+e0FFkjj8edpcG3M+N7lnc\nQrFGMDbWfzUCkU7ppbWG9uyJp6Bf6JZX1QgaLF+esnt3EjqMqKhpqDnlS1kv5UmnanydGGbciVg0\namgOjjgidATx0X1oRXqbagTAyAhs3Zqlx8Zg9eosvXatbjYi01MfQVkv5UlMv2WhawQqCBqos1hm\nq5fawzslppNnu2L6LWoa6rBWY+HHxkY0Zr6BxoY3lyRp6BAilIYOIDpV+Pvpu4Kg1frxz3ve4CzX\nohfpHcuXZ1ed7T6g/X0sXx42L/qRmoZKccRTHRTplpiO+1hiiSUOUNOQREDt4CK9TQVBSRo6gOjo\nPrTNVaHtt9uUJ2VVyBMVBCLzNDoaOgKRzlAfQSmOeNoFY6E8aa6X8iWm3xJLLLHEAQvfR6C1hhoM\nD4eOQERi4BhEMlLcC/9dCGoaaqCx4c2koQOIVBo6gOhUoT18tgzPLsPbfKTbtrW9D1vAQgBUEMgs\naK0hkd6mPgKReYqpDbldMf2WWGKJJQ7QPAKRjtMsWpGpVBA06KU2zk7ptTyp3XCk3ce2bWnb+4jl\nDlid0mvHSidUIU80aqiB7kMr/aifRshImfoISnHE0y4oCyOmf+NYYoklDognlljiAPURSAS01pBI\nb1NBUJKGDiA6WmuouSq0/Xab8qSsCnmigkBEpM+pj6AURzztgrHotTyJ6ffEEksscUA8scR0I8Lj\nj4fdu9vbh9YamgOtNSQi0LnCKJaCbSZqGmqgtYaaSUMHEKUqtP12m/KkmTR0AC2pRtDjli/vzKSl\nTlSTO1G97QSNmReZSn0EPS6mamksscQSB8QTS6+1h8cknn9j9RGIyAz6qT1cytRH0EBtnGXKk+aU\nL82koQOIzuWXp6FDaEkFQQPdh1ZEOmloKHQEramPoBRHb1VtY/o9scQSSxwQVyyd0Gu/p5dorSER\nEZmWCoKSNHQA0VFbeHPKl7IqtId3WxWOExUEItIxVWgPlzL1EZTi6K02zph+TyyxaMy8dNOGDXEs\n5T5TH0FPFQSdmkXbCdH8gcd01oM4SoIOiaVgk7jFcpz0TWdxJ+5F24n70MZ0L1qj/R+TbtvWkZv8\nWs8tpZCGDiA6VWgP7740dAAt9VRBICIic9dTTUOxVMEgnlhiiQPiiqUTeu33dEIs7eExieU46Zs+\nglgyHOKJJZY4IK5YOqHXfk8nKE/KYsmTBe0jMLM1ZnaXmf2Hmb1lmm0+kL+/3czObvc7F5LaOMuU\nJ81pzHwzaegAolOF46StgsDMFgObgDXAGcBlZvbMhm0uAk5396cBfwx8uJ3vFImFxszLbFThOGmr\nacjMfgsYdvc1+fMrAdz9PYVt/hbY5u6fyZ/fBax29/sa9qWmoR6OA+KKRRaG/o3jtZBNQycB9xSe\n/zR/rdU2J7f5vSIi0iHtFgSzLfsbS6EFuWbIbkHY3iNt8/O1h8dyL0Q68XPSTmQJxx8fOic6S30n\nZVVoD++2Khwn7d6hbCdwSuH5KWRX/DNtc3L+WsnQ0BArV64EYGBggMHBQZIkAeqZOdPzC9iG++y3\nb/Y8f3Hen689N0vZlqbz/nynnrebH0mSYJZNtOtEPNDd37+Qz8fHx6OKJ4bntfbwWOLp5+fj4+Ps\n3bsXgImJCWbSbh/BEuAHwIuAe4Fbgcvc/c7CNhcB6939IjM7Dxhx9/Oa7Et9BJHqpd/SSRozL7MR\ny3GyoPMIzOxCYARYDHzU3d9tZq8DcPe/y7epjSzaD7zG3b/TZD8qCCLVS7+lk5QvMhuxHCeaUDYH\naaE5J3QssTBLJ5uYpE75Utapv59eEstx0jeLzomIyNypIGigq5my4eEkdAiRSkIHEJ00TUKHEKEk\ndAAtqWlogcQUiywM/RuXKU/KYskTNQ3NQX2Yo9QoT5rTmPlm0tABRKcKx4kKApF5qsIaMhJeFY4T\nNQ0tkJhiEekWHffxUtOQiEggVWhZ7bmCQOvqdN7QUBo6hCip76Rs2bI0dAjRGR1NQ4fQUk8VBJ24\n6Xyn9rN7d9i86KTNm0NHIDEws5aPffsOttyml8wmTzZvvjX6POmpPoLOxKE2zkbKk+ZiWUMmtJER\n2Lo1S4+NwerVWXrtWrjiinBxhZSm9SahjRtheDhLJ0n2CKFvlpjoTBw66TVSnjSnfMmoIJjZwADk\ni4AGpc7iOUlDBxChNHQAXTebKj+si77K331p6ACiMDJSv/r/5S/TyfTISNi4ptPu/QhEetJsaqcn\nnZSyc+eWLkQTtyuuqF/5m1VjlMxCGxys1wLGxurNQYODwUKakZqGGqjdt0x50tyyZbBvX+gowoux\nPTwmS5fCgQOho1AfgUjHqD28bP16+PKXs/Tdd8Npp2Xpiy+GTZvCxRWLY46Bhx4KHYX6COZEY8PL\nlCfTSUMHEIV167JlFLKlFNLJ9Lp1IaMKq9hHsG+f+ghEekrV2n4ljGK/yRFHxN9vooKgge5HUKY8\nmU4SOoAojI8XT3TJZHpgoH/7CIr9Jo8+mkz2scXab6KCQETaUrz6XbYs/qvfbphaOBJ94aiCoMHQ\nUMroaBI6jKgoT6aTolrB1Kvf/ftTNmxIgHivfrthauGYRn/nNhUEDTZvhtHR0FHERXkiMyme8D/2\nMQ01blSFeYUqCEqS0AFEKAkdQKSS0AFEZ9GiJHQIUSjWkvbti7+PQPMISnFo/ZhGypPmBgeztmCp\nO/102LEjdBRxSZI4+k1mmkegGkFJSr9d6bVeE+dvMPvTlvuJoSDvphUrUvrtWGmmOMnuhz9MJ0eZ\n9fMku2KNYGws/n4TFQTS8gS+bFnKvn39dZKfjTVrQkcQh2LH6FFHxXH1G1rxhH/LLfH3m6ggaDA8\nnIQOIQpTR4LE38YZwuBgEjqEKBSPlQMHdKyUJaEDaEkFQYPYS+5u2bKlvn4M1EcNPfCA/rhrRkeV\nF9Larl2hI2hNaw010Lo6mU2bYGIieyxalE6mtYhY3fh4GjqECKWhA4hQGjqAllQjkKaK1f3Dh1F1\nP1fMl+3blS8w9bd/4AOqVcPUDvTt2+v5E2sHugqCBlpXp5kkdACRSkIHEJ3ly5PQIUSh2IGeJEn0\nHeiaRyAtaR5Bc7GMDw9NN6aZWSzHieYRzIHW1ckUbzYCKStXJoBuNlK0dGmKagVTT/if+ER9zLxk\nnv3slNiPExUEDbSujsyW5hFkijWCH/5Q/SaNqnCDHjUNleJQM0gj5YnMlpbdKEvTOApENQ3JnBVH\nPUD8ox4kHI2kmlksBcFMVCMoxZHinoQOIypHHply8GASOozopGmqUWYNTj89ZceOJHQYUYml31E1\nAmnLIk07lFlatix0BHEo1pI2b4aVK7N0rLUk1QgarFkD110XOorwNCRQZkvHysw0fLSCqrAuiEhM\nGk/4mllcPX1XELRee/8LmL2s5X5iqL10T0rs46BDUB9B2cREio4V3Y8ges1O4MURMmNjKatXZ9v0\n8wgZrR8j8zE4GDqCOBT/fiYm4v/76buCoJnBQdi7N0uPjSWT/4A6qDOnnpqEDiFKqg2U6R4NZbVZ\n+TFTQUA2AabYmVNLDwzEWY3rBo0Nl/mowpj5bqtCfqggKElRG2dj1VbrxzSjPoIy9RE0kxJ7nqgg\nYOqSsYsWxTHUS6QqqjZmXsrmPY/AzJYDnwFOAyaAV7j73ibbTQAPAoeAR9393Gn2F8U8gqOOgkce\nCR1FXFTdl9nasCH+jtF+NdM8gnbmjF4J3OjuTwf+OX/ejAOJu589XSEQ2shI/erlwIF6emQkbFyx\nUCEg0tvaKQguATbn6c3A2hm2bTV4P6grrqhXb5cuTSfT/Tp0tNH69WnoEKKk+1uXDQykoUOIzshI\nGjqEltopCE509/vy9H3AidNs58DXzOw2M3ttG9/XFQcPho4gPjfdFDoCqQoNuS6rwrLcM3YWm9mN\nwIomb/1F8Ym7u5lN18B/vrv/zMxOAG40s7vc/RvzC3dhFDu73BMNlWwwMJCEDiFKGjFUpjwpq/w8\nAnd/8XTvmdl9ZrbC3XeZ2ZOAn0+zj5/l/7/fzD4PnAs0LQiGhoZYmQ85GBgYYHBwcPLAqlXDF+L5\nli2wZUuaR5EwOgoHDqRs374w31eF5+vXp9x0U1YIjI3B4GD2/tBQkjelxRWvnsfxHLIJmbHEE+r5\nyEjK+HhWCGzcWBtWm/39dCt/xsfH2ZvPlJ2YmGAm7Ywaeh/wC3d/r5ldCQy4+5UN2xwNLHb3h8zs\n8cANwEZ3v6HJ/oKNGpq6emLK8HACqEZQMziYMj6ehA4jOqnmEZTEsvZ+TGLJk4VaffQ9wDVm9kfk\nw0fzL3sy8BF3fylZs9Ln8oXelgCfbFYIhFY84W/cqOFvItJf5l0QuPtu4HeavH4v8NI8/SMg+u6j\nYo0A1EfQaGgoCR1ClFQbyEydUJZoQlmDKvz9aGaxtKRhtDKTxhO+atRTVaEwbGf4aI9KQwcQnXpH\noBQpX8pqnaJSV4XjRDUCpl7RXHWVrmhE5kvzCKpJ9yxG91wVkd63UGsNSZ+oQM1WIqFjpawKeaKC\noCQNHUB0RkfT0CFEqQptv92mY6WsCnmiPgKmNgF95CPqIxCRztm1K3QErakgaHDCCUnoEKKgseGt\naR5BRsdKWTFPrr8+/rlJ6ixuMDQEo6Oho4iLbjYis6VjpSxbWyh0FAu3xETPmHpFk06uFhhr6d1t\nug9tc1rHd2cOAAAHL0lEQVRrqEzHSqZ4Thkbq9/zO9ZzigoCpv7jTEzoiqaRxobLbOlYyRTPKbfc\nEv85RU1DDVS1FZFOiuWconkEcxBjtU1EqqsK5xQVBCVp6ACio/HyzSlfypQnzaShA2hJBYG0VIV7\nrorI/KkgKElCBxCdvXuT0CFESSOGypQnZVXIExUEDTSHQET6jYaPNhgfT1GtoHwf51qexDoOOgTN\nIyhTnpRVIU9UEDD1pLd9O9FPB+8Gza0Q6R9qGipJQgcQndpMa5kq9qu8EJQnZVXIE00oa7BsGezb\nFzqKuKRp/9aMZG50rMRLE8paSNP67L/9+9PJtIZE16ShA4iSxsyXVWHt/W6rwnGigkBEpM+ps7gk\nCR1AdKrQxhmC8iWj+xHMrArHifoIGgwOaiatyHzFssCalKmPYA5WrEhDhxCdKrRxhqB8KcvuRyBF\nVThOVBA0WLMmdAQi1aX7EVSTmoZERPqAmoZERGRaKggaVKE9r9uUJ80pX8qUJ2VVyBMVBCIifU59\nBA00RV5EepH6COagArU4EZGOUkHQQOOgy6rQxhmC8qVMeVJWhTzREhM0TpFHU+RFpK+oj6CBpsiL\nSC9SH4GIdEUFWkGkCRUEDQYG0tAhRKcKbZwhKF/KdD+CsiocJyoIGmitFBHpN+ojEJG2FAdbbNwI\nw8NZWoMt4jJTH4FGDYlIWxpP+BpsUT1qGmpQhfa8blOeNKd8KdM8nLIqHCcqCESkY9THVk3qIxAR\n6QOaRyAiItOad0FgZi83s383s0Nm9pwZtltjZneZ2X+Y2Vvm+33dUoX2vG5TnjSnfClTnpRVIU/a\nqRHcAVwK/Mt0G5jZYmATsAY4A7jMzJ7ZxncuuPHx8dAhREd50pzypUx5UlaFPJl3QeDud7n7/2ux\n2bnADnefcPdHgU8DL5vvd3bDVVftDR1CdPbuVZ40o3wpU56UVSFPFrqP4CTgnsLzn+avReuBB0JH\nICLSXTNOKDOzG4EVTd56m7t/aRb7r9wwIPeJ0CFEZ2JiInQIUVK+lClPyqqQJ20PHzWzbcCb3f07\nTd47D9jg7mvy528FDrv7e5tsW7lCQ0SkShZ6iYmmOwduA55mZiuBe4FXApc123C6AEVEZGG1M3z0\nUjO7BzgPuNbMvpq//mQzuxbA3R8D1gPXA98HPuPud7YftoiIdEo0M4tFRCQMzSzOmdnHzOw+M7sj\ndCyxMLNTzGxbPnHwe2b2htAxhWZmS83sm2Y2bmbfN7N3h44pFma22MxuN7PZDCTpeWY2YWbfzfPk\n1tDxzEQ1gpyZvQDYB/yju58ZOp4YmNkKYIW7j5vZMuDbwNp+b94zs6Pd/WEzWwLcBPyZu98UOq7Q\nzOxNwHOBY9z9ktDxhGZmPwae6+67Q8fSimoEOXf/BrAndBwxcfdd7j6ep/cBdwJPDhtVeO7+cJ48\nAlgMRP+HvtDM7GTgIuDvmX7wSD+qRF6oIJBZyUd+nQ18M2wk4ZnZIjMbB+4Dtrn790PHFIG/Bv4c\nOBw6kIg48DUzu83MXhs6mJmoIJCW8mahLcAb85pBX3P3w+4+CJwM/LaZJYFDCsrMLgZ+7u63U5Er\n4C45393PBi4EXp83P0dJBYHMyMweB/wT8Al33xo6npi4+y+Ba4FzQscS2POBS/I28auBF5rZPwaO\nKTh3/1n+//uBz5OtvRYlFQQyLTMz4KPA9919JHQ8MTCzJ5rZQJ4+CngxcHvYqMJy97e5+ynu/hTg\n94Cvu/urQ8cVkpkdbWbH5OnHAy8hW7E5SioIcmZ2NfCvwNPN7B4ze03omCJwPvAHwAX5ELjbzWxN\n6KACexLw9byP4JvAl9z9nwPHFBsNRYQTgW8UjpMvu/sNgWOaloaPioj0OdUIRET6nAoCEZE+p4JA\nRKTPqSAQEelzKghERPqcCgIRkT6ngkBkBmZ2RT5xTKRnaR6ByAzyZRPOcfdfzOEzi9xdi69JZXTq\nnsUilZcvBXANcBLZ8tKfJVt2e5uZ3e/uLzKzD5OtLXQUsMXdN+SfnQA+TbbkxPvM7ETgdcBjZEt0\nNL1Xt0gMVBCI1K0Bdrr7SwHM7FjgNUBSuLnI29x9j5ktJlti+Nnu/j2yZRUecPfn5p/dCax090fz\n/YhES30EInXfBV5sZu8xs1Xu/mCTbV5pZt8GvgM8Czij8N5nGvb1KTP7feDQgkUs0gEqCERy7v4f\nZDffuQN4l5m9s/i+mT0FeDPwQnc/i2wJ6qWFTfYX0i8FPgQ8B/hWXoMQiZIKApGcmT0JOODunwT+\niqxQeBCoNe0cS3ayfzDvA7hwmv0YcKq7p8CVwHHA4xc2epH5Ux+BSN2ZwF+a2WHgV8CfkN105Toz\n25l3Ft8O3AXcQ3bj+mYWAx83s+PI7tj1/mmamUSioOGjIiJ9Tk1DIiJ9TgWBiEifU0EgItLnVBCI\niPQ5FQQiIn1OBYGISJ9TQSAi0udUEIiI9Ln/D6Qmug9VL5qWAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "# reviews with most positive sentiment\n", "yelp[yelp.sentiment == 1].text.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 45, "text": [ "254 Our server Gary was awesome. Food was amazing....\n", "347 3 syllables for this place. \\r\\nA-MAZ-ING!\\r\\n...\n", "420 LOVE the food!!!!\n", "459 Love it!!! Wish we still lived in Arizona as C...\n", "679 Excellent burger\n", "Name: text, dtype: object" ] } ], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "# reviews with most negative sentiment\n", "yelp[yelp.sentiment == -1].text.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ "773 This was absolutely horrible. I got the suprem...\n", "1517 Nasty workers and over priced trash\n", "3266 Absolutely awful... these guys have NO idea wh...\n", "4766 Very bad food!\n", "5812 I wouldn't send my worst enemy to this place.\n", "Name: text, dtype: object" ] } ], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "# widen the column display\n", "pd.set_option('max_colwidth', 500)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 47 }, { "cell_type": "code", "collapsed": false, "input": [ "# negative sentiment in a 5-star review\n", "yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunnylengthsentiment
390106JT5p8e8Chtd0CZpcARw2009-08-06KowGVoP_gygzdSu6Mt3zKQ5RIP AZ Coffee Connection. :( I stopped by two days ago unaware that they had closed. I am severely bummed. This place is irreplaceable! Damn you, Starbucks and McDonalds!reviewjKeaOrPyJ-dI9SNeVqrbww100175-0.302083
128757-dgZzOnLox6eudArRKgw2008-08-28sksXE8krD3WvqSOhtlSUyQ5Obsessed. Like, I've-got-the-Twangy-Tart-withdrawal-shakes level of addiction to this place. Please make one in Arcadia! Pleeeaaassse.reviewgEnU4BqTK-4abqYl_Ljjfg335134-0.625000
3075PwtYeGu-19v9bU4nbP9UbA2011-12-058yfOlQGxQlCgQL9TnnzQkw5Unfortunately Out of Business.review0fOPM1H03gF5EJooYvkL1Q02030-0.500000
3516Bc4DoKgrKCtCuN-0O5He3A2009-12-19-qqrl4101KbQKIdar1lMRw5Cashew brittle, almond brittle, bacon brittle! Go now, before it's too late!reviewwHg1YkCzdZq9WBJOTRgxHQ98677-0.375000
6726FURgKkRFtMK5yKbjYZVVwA2012-08-138xx8i94sKvBhWZv8ZVyfBA5Brown bag chicken sammich, mac n cheese, fried okra, and the bourbon drink. Nuff said.reviewhFP7Si9jvdOUmmMesg4ghw00087-0.600000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 48, "text": [ " business_id date review_id stars \\\n", "390 106JT5p8e8Chtd0CZpcARw 2009-08-06 KowGVoP_gygzdSu6Mt3zKQ 5 \n", "1287 57-dgZzOnLox6eudArRKgw 2008-08-28 sksXE8krD3WvqSOhtlSUyQ 5 \n", "3075 PwtYeGu-19v9bU4nbP9UbA 2011-12-05 8yfOlQGxQlCgQL9TnnzQkw 5 \n", "3516 Bc4DoKgrKCtCuN-0O5He3A 2009-12-19 -qqrl4101KbQKIdar1lMRw 5 \n", "6726 FURgKkRFtMK5yKbjYZVVwA 2012-08-13 8xx8i94sKvBhWZv8ZVyfBA 5 \n", "\n", " text \\\n", "390 RIP AZ Coffee Connection. :( I stopped by two days ago unaware that they had closed. I am severely bummed. This place is irreplaceable! Damn you, Starbucks and McDonalds! \n", "1287 Obsessed. Like, I've-got-the-Twangy-Tart-withdrawal-shakes level of addiction to this place. Please make one in Arcadia! Pleeeaaassse. \n", "3075 Unfortunately Out of Business. \n", "3516 Cashew brittle, almond brittle, bacon brittle! Go now, before it's too late! \n", "6726 Brown bag chicken sammich, mac n cheese, fried okra, and the bourbon drink. Nuff said. \n", "\n", " type user_id cool useful funny length sentiment \n", "390 review jKeaOrPyJ-dI9SNeVqrbww 1 0 0 175 -0.302083 \n", "1287 review gEnU4BqTK-4abqYl_Ljjfg 3 3 5 134 -0.625000 \n", "3075 review 0fOPM1H03gF5EJooYvkL1Q 0 2 0 30 -0.500000 \n", "3516 review wHg1YkCzdZq9WBJOTRgxHQ 9 8 6 77 -0.375000 \n", "6726 review hFP7Si9jvdOUmmMesg4ghw 0 0 0 87 -0.600000 " ] } ], "prompt_number": 48 }, { "cell_type": "code", "collapsed": false, "input": [ "# positive sentiment in a 1-star review\n", "yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunnylengthsentiment
178153YGfwmbW73JhFiemNeyzQ2012-06-22Gi-4O3EhE175vujbFGDIew1If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.reviewHqgx3IdJAAaoQjvrUnbNvw0121190.766667
23533Srfy_VeCgwDbo4iyUFOtw2006-08-23K8tXedC2NMBEZ8p77zg23Q1My co-workers and I refer to this place as \"Pizza n' Ants\". The staff will be happy to serve you with bare hands, right after using the till. Also, as the nickname suggests, there has been a noticable insect problem. \\r\\n\\r\\n\\r\\n\\r\\nAs if that could all be overlooked, the pizza isn't even good. If you are in this part of town, go to Z Pizza or Slices for great pizza instead!reviewrPGZttaVjRoVi3GYbs62cg0103720.567143
5257cXx-fHY11Se8rFHkkUeaUg2009-10-272yHyr0N_XNZggmIfZ7JaHw1Remember how I said that the Trivia was the best thing about this place? Well, they got rid of long time Triva host, Dave (who had been featured in the College Times and was the best thing about the trivia). Without Dave's personality, this place just doesn't cut it. Will never go here again. Bummer.reviewnx2PS25Qe3MCEFUdO_XOtw2403040.650000
6222fDZzCjlxaA4OOmnFO-i0vw2012-07-09F5aRE4oqmHthiHudmnShLQ1My mother always told me, if I didn't have anything nice to say, say nothing!reviewJ92bzxYVmyoLHULzh9xNCA121770.750000
670277oW-QeIXbUoTbUbrdD2aA2012-01-05oVYk9Gxa3TY63FAeoeCEzg1Most livable city my eye!\\r\\nPlastic yuppies around every corner looking for a reason to belong. I can't wait for the homosexuals to take control of this dog park and give it some class.\\r\\n\\r\\nAvoid at all cost.reviewek4GWXatDshMorJwGC2JAw1242070.625000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 49, "text": [ " business_id date review_id stars \\\n", "1781 53YGfwmbW73JhFiemNeyzQ 2012-06-22 Gi-4O3EhE175vujbFGDIew 1 \n", "2353 3Srfy_VeCgwDbo4iyUFOtw 2006-08-23 K8tXedC2NMBEZ8p77zg23Q 1 \n", "5257 cXx-fHY11Se8rFHkkUeaUg 2009-10-27 2yHyr0N_XNZggmIfZ7JaHw 1 \n", "6222 fDZzCjlxaA4OOmnFO-i0vw 2012-07-09 F5aRE4oqmHthiHudmnShLQ 1 \n", "6702 77oW-QeIXbUoTbUbrdD2aA 2012-01-05 oVYk9Gxa3TY63FAeoeCEzg 1 \n", "\n", " text \\\n", "1781 If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating. \n", "2353 My co-workers and I refer to this place as \"Pizza n' Ants\". The staff will be happy to serve you with bare hands, right after using the till. Also, as the nickname suggests, there has been a noticable insect problem. \\r\\n\\r\\n\\r\\n\\r\\nAs if that could all be overlooked, the pizza isn't even good. If you are in this part of town, go to Z Pizza or Slices for great pizza instead! \n", "5257 Remember how I said that the Trivia was the best thing about this place? Well, they got rid of long time Triva host, Dave (who had been featured in the College Times and was the best thing about the trivia). Without Dave's personality, this place just doesn't cut it. Will never go here again. Bummer. \n", "6222 My mother always told me, if I didn't have anything nice to say, say nothing! \n", "6702 Most livable city my eye!\\r\\nPlastic yuppies around every corner looking for a reason to belong. I can't wait for the homosexuals to take control of this dog park and give it some class.\\r\\n\\r\\nAvoid at all cost. \n", "\n", " type user_id cool useful funny length sentiment \n", "1781 review Hqgx3IdJAAaoQjvrUnbNvw 0 1 2 119 0.766667 \n", "2353 review rPGZttaVjRoVi3GYbs62cg 0 1 0 372 0.567143 \n", "5257 review nx2PS25Qe3MCEFUdO_XOtw 2 4 0 304 0.650000 \n", "6222 review J92bzxYVmyoLHULzh9xNCA 1 2 1 77 0.750000 \n", "6702 review ek4GWXatDshMorJwGC2JAw 1 2 4 207 0.625000 " ] } ], "prompt_number": 49 }, { "cell_type": "code", "collapsed": false, "input": [ "# reset the column display width\n", "pd.reset_option('max_colwidth')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 50 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 11: Adding Features to a Document-Term Matrix" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create a new DataFrame that only contains the 5-star and 1-star reviews\n", "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n", "\n", "# split the new DataFrame into training and testing sets\n", "feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']\n", "X = yelp_best_worst[feature_cols]\n", "y = yelp_best_worst.stars\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 51 }, { "cell_type": "code", "collapsed": false, "input": [ "# use CountVectorizer with text column only\n", "vect = CountVectorizer()\n", "train_dtm = vect.fit_transform(X_train[:, 0])\n", "test_dtm = vect.transform(X_test[:, 0])\n", "print train_dtm.shape\n", "print test_dtm.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(3064, 16825)\n", "(1022, 16825)\n" ] } ], "prompt_number": 52 }, { "cell_type": "code", "collapsed": false, "input": [ "# shape of other four feature columns\n", "X_train[:, 1:].shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 53, "text": [ "(3064L, 4L)" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "# cast other feature columns to float and convert to a sparse matrix\n", "extra = sp.sparse.csr_matrix(X_train[:, 1:].astype(float))\n", "extra.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 54, "text": [ "(3064, 4)" ] } ], "prompt_number": 54 }, { "cell_type": "code", "collapsed": false, "input": [ "# combine sparse matrices\n", "train_dtm_extra = sp.sparse.hstack((train_dtm, extra))\n", "train_dtm_extra.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 55, "text": [ "(3064, 16829)" ] } ], "prompt_number": 55 }, { "cell_type": "code", "collapsed": false, "input": [ "# repeat for testing set\n", "extra = sp.sparse.csr_matrix(X_test[:, 1:].astype(float))\n", "test_dtm_extra = sp.sparse.hstack((test_dtm, extra))\n", "test_dtm_extra.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 56, "text": [ "(1022, 16829)" ] } ], "prompt_number": 56 }, { "cell_type": "code", "collapsed": false, "input": [ "# use logistic regression with text column only\n", "logreg = LogisticRegression(C=1e9)\n", "logreg.fit(train_dtm, y_train)\n", "y_pred_class = logreg.predict(test_dtm)\n", "print metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.917808219178\n" ] } ], "prompt_number": 57 }, { "cell_type": "code", "collapsed": false, "input": [ "# use logistic regression with all features\n", "logreg = LogisticRegression(C=1e9)\n", "logreg.fit(train_dtm_extra, y_train)\n", "y_pred_class = logreg.predict(test_dtm_extra)\n", "print metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.922700587084\n" ] } ], "prompt_number": 58 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 12: Fun TextBlob Features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# spelling correction\n", "TextBlob('15 minuets late').correct()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 59, "text": [ "TextBlob(\"15 minutes late\")" ] } ], "prompt_number": 59 }, { "cell_type": "code", "collapsed": false, "input": [ "# spellcheck\n", "Word('parot').spellcheck()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 60, "text": [ "[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]" ] } ], "prompt_number": 60 }, { "cell_type": "code", "collapsed": false, "input": [ "# definitions\n", "Word('bank').define('v')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 61, "text": [ "[u'tip laterally',\n", " u'enclose with a bank',\n", " u'do business with a bank or keep an account at a bank',\n", " u'act as the banker in a game or in gambling',\n", " u'be in the banking business',\n", " u'put into a bank account',\n", " u'cover with ashes so to control the rate of burning',\n", " u'have confidence or faith in']" ] } ], "prompt_number": 61 }, { "cell_type": "code", "collapsed": false, "input": [ "# language identification\n", "TextBlob('Hola amigos').detect_language()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 62, "text": [ "u'es'" ] } ], "prompt_number": 62 } ], "metadata": {} } ] }