{
 "metadata": {
  "name": "",
  "signature": "sha256:217a675665d9d52af77e2656091e39dd1b522040e8fd3c0e64135e07865179bf"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Natural Language Processing (NLP)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What is NLP?\n",
      "* 'Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human\u2013computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input...' - Wikipedia\n",
      "* Using computers to process (analyze, understand, generate) natural human languages (as opposed to unnatural computer languages).\n",
      "* NLP is concerned with the interface between human and computer language.  However, language is often ambiguous, so this isn't always a straightforward task.\n",
      "\n",
      "Why NLP?\n",
      "* Most knowledge created by humans is unstructured text, so we need some way to make sense of it.\n",
      "* Enables quantitative analysis of text data at large scale.\n",
      "* Provides a repeatable, \"unbiased\" way to look at text."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Motivation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For the purpose of this class, we can pretend we are Data Scientists working for Kaggle.  Everyone loves Kaggle, but we're interested into digging a little deeper into what the web is saying about Kaggle."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Imports"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import tweepy # Twitter API wrapper\n",
      "import nltk # Classic NLP package"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Getting Data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Variables that contains the user credentials to access Twitter API \n",
      "consumer_key = \"Consumer_Key\" # Replace with your own consumer_key\n",
      "consumer_secret = \"Consumer_Secret\" # Replace with your own consumer_secret\n",
      "\n",
      "# Create authorization for API\n",
      "auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)\n",
      "#auth.set_access_token(access_token, access_token_secret)\n",
      "\n",
      "# Initialize API object by passing it your credentials\n",
      "api = tweepy.API(auth)\n",
      "\n",
      "# Use the api to search\n",
      "tweets = api.search(q=\"kaggle\", count=10, result_type=\"recent\")\n",
      "print tweets[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Status(contributors=None, truncated=False, text=u'37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT', in_reply_to_status_id=None, id=595283912592588800L, favorite_count=0, _api=<tweepy.api.API object at 0x000000002631D080>, author=User(follow_request_sent=None, profile_use_background_image=True, _json={u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 88491546, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'verified': False, u'profile_text_color': u'333333', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'entities': {u'description': {u'urls': []}}, u'followers_count': 14, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'88491546', u'profile_background_color': u'C0DEED', u'listed_count': 0, u'is_translation_enabled': False, u'utc_offset': -7200, u'statuses_count': 10, u'description': u'', u'friends_count': 66, u'location': u'', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'following': None, u'geo_enabled': True, u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'screen_name': u'KidOfSahel', u'lang': u'fr', u'profile_background_tile': False, u'favourites_count': 4, u'name': u'Mamadou Diaby', u'notifications': None, u'url': None, u'created_at': u'Sun Nov 08 19:39:55 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Greenland', u'protected': False, u'default_profile': True, u'is_translator': False}, time_zone=u'Greenland', id=88491546, _api=<tweepy.api.API object at 0x000000002631D080>, verified=False, profile_text_color=u'333333', profile_image_url_https=u'https://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', profile_sidebar_fill_color=u'DDEEF6', is_translator=False, geo_enabled=True, entities={u'description': {u'urls': []}}, followers_count=14, protected=False, id_str=u'88491546', default_profile_image=False, listed_count=0, lang=u'fr', utc_offset=-7200, statuses_count=10, description=u'', friends_count=66, profile_link_color=u'0084B4', profile_image_url=u'http://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', notifications=None, profile_background_image_url_https=u'https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_color=u'C0DEED', profile_background_image_url=u'http://abs.twimg.com/images/themes/theme1/bg.png', name=u'Mamadou Diaby', is_translation_enabled=False, profile_background_tile=False, favourites_count=4, screen_name=u'KidOfSahel', url=None, created_at=datetime.datetime(2009, 11, 8, 19, 39, 55), contributors_enabled=False, location=u'', profile_sidebar_border_color=u'C0DEED', default_profile=True, following=False), _json={u'contributors': None, u'truncated': False, u'text': u'37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT', u'in_reply_to_status_id': None, u'id': 595283912592588800L, u'favorite_count': 0, u'source': u'<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [42, 49], u'text': u'kaggle'}], u'urls': [{u'url': u'https://t.co/9GLyaK3OhT', u'indices': [52, 75], u'expanded_url': u'https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot', u'display_url': u'kaggle.com/c/facebook-rec\\u2026'}]}, u'in_reply_to_screen_name': None, u'in_reply_to_user_id': None, u'retweet_count': 0, u'id_str': u'595283912592588800', u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 88491546, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'verified': False, u'profile_text_color': u'333333', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'entities': {u'description': {u'urls': []}}, u'followers_count': 14, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'88491546', u'profile_background_color': u'C0DEED', u'listed_count': 0, u'is_translation_enabled': False, u'utc_offset': -7200, u'statuses_count': 10, u'description': u'', u'friends_count': 66, u'location': u'', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'following': None, u'geo_enabled': True, u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'screen_name': u'KidOfSahel', u'lang': u'fr', u'profile_background_tile': False, u'favourites_count': 4, u'name': u'Mamadou Diaby', u'notifications': None, u'url': None, u'created_at': u'Sun Nov 08 19:39:55 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Greenland', u'protected': False, u'default_profile': True, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Mon May 04 17:48:39 +0000 2015', u'in_reply_to_status_id_str': None, u'place': None, u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}}, coordinates=None, entities={u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [42, 49], u'text': u'kaggle'}], u'urls': [{u'url': u'https://t.co/9GLyaK3OhT', u'indices': [52, 75], u'expanded_url': u'https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot', u'display_url': u'kaggle.com/c/facebook-rec\\u2026'}]}, in_reply_to_screen_name=None, in_reply_to_user_id=None, retweet_count=0, id_str=u'595283912592588800', favorited=False, source_url=u'http://twitter.com', user=User(follow_request_sent=None, profile_use_background_image=True, _json={u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 88491546, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'verified': False, u'profile_text_color': u'333333', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'entities': {u'description': {u'urls': []}}, u'followers_count': 14, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'88491546', u'profile_background_color': u'C0DEED', u'listed_count': 0, u'is_translation_enabled': False, u'utc_offset': -7200, u'statuses_count': 10, u'description': u'', u'friends_count': 66, u'location': u'', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', u'following': None, u'geo_enabled': True, u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'screen_name': u'KidOfSahel', u'lang': u'fr', u'profile_background_tile': False, u'favourites_count': 4, u'name': u'Mamadou Diaby', u'notifications': None, u'url': None, u'created_at': u'Sun Nov 08 19:39:55 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Greenland', u'protected': False, u'default_profile': True, u'is_translator': False}, time_zone=u'Greenland', id=88491546, _api=<tweepy.api.API object at 0x000000002631D080>, verified=False, profile_text_color=u'333333', profile_image_url_https=u'https://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', profile_sidebar_fill_color=u'DDEEF6', is_translator=False, geo_enabled=True, entities={u'description': {u'urls': []}}, followers_count=14, protected=False, id_str=u'88491546', default_profile_image=False, listed_count=0, lang=u'fr', utc_offset=-7200, statuses_count=10, description=u'', friends_count=66, profile_link_color=u'0084B4', profile_image_url=u'http://pbs.twimg.com/profile_images/2112597225/421667_3138983843457_1530736888_3009828_102270204_n_normal.jpg', notifications=None, profile_background_image_url_https=u'https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_color=u'C0DEED', profile_background_image_url=u'http://abs.twimg.com/images/themes/theme1/bg.png', name=u'Mamadou Diaby', is_translation_enabled=False, profile_background_tile=False, favourites_count=4, screen_name=u'KidOfSahel', url=None, created_at=datetime.datetime(2009, 11, 8, 19, 39, 55), contributors_enabled=False, location=u'', profile_sidebar_border_color=u'C0DEED', default_profile=True, following=False), geo=None, in_reply_to_user_id_str=None, possibly_sensitive=False, lang=u'en', created_at=datetime.datetime(2015, 5, 4, 17, 48, 39), in_reply_to_status_id_str=None, place=None, source=u'Twitter Web Client', retweeted=False, metadata={u'iso_language_code': u'en', u'result_type': u'recent'})\n"
       ]
      }
     ],
     "prompt_number": 122
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tweets[0].text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 123,
       "text": [
        "u'37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT'"
       ]
      }
     ],
     "prompt_number": 123
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Some quick vocab:\n",
      "* **corpus** - collection of documents\n",
      "* **corpora** - plural form of corpus\n",
      "\n",
      "Let's build our corpus of tweets."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tweets_text = []\n",
      "for tweet in tweepy.Cursor(api.search, q='kaggle', result_type='recent').items(1000):\n",
      "    tweets_text.append(tweet.text.encode('ascii','ignore'))\n",
      "print tweets_text[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT\n"
       ]
      }
     ],
     "prompt_number": 121
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since I did not provide my credentials above, I saved all of the tweet text to a CSV file.  We can read it back into a list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Running this will overwrite the current data.\n",
      "'''\n",
      "with open('../data/kaggle_tweets.csv','w') as f:\n",
      "    for tweet in tweets_text:\n",
      "        f.write('\"%s\"\\n' % tweet)\n",
      "'''"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('../data/kaggle_tweets.csv','r') as f:\n",
      "    tweets_text = [tweet.replace('\\n','').replace('\"','') for tweet in f.readlines()]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tweets_text[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "'37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT'"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Tokenization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first thing we need to do with our corpus of text is to break the documents into smaller units.  This is known as **tokenization**.  There are two natural ways to go about this:  breaking the documents apart into sentences or into words.  This gives more structure to the previously unstructured text.  This also allows us to more easily perform other tasks upon our corpus.\n",
      "\n",
      "**Note**:  Breaking documents and paragraphs into sentences and words is easier is some languages than others.  English has obvious (to us) breaks in the text for sentences and words.  However, this might not be the case for other languages.  In addition, there are nuances in the English language with hyphenated words and phrases that are independent clauses but might be part of a larger sentence.\n",
      "\n",
      "First let's try breaking our tweets down into sentences."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Tokenize into sentences\n",
      "sentences = []\n",
      "for tweet in tweets_text:\n",
      "    for sent in nltk.sent_tokenize(tweet):\n",
      "        sentences.append(sent)\n",
      "sentences[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "['37 spots up.',\n",
        " 'Is anybody else even trying?',\n",
        " '#kaggle - https://t.co/9GLyaK3OhT',\n",
        " 'RT @antgoldbloom: Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted ',\n",
        " '@alexjc There was an Elo-based Kaggle comp recently; the forums might have useful info.',\n",
        " 'https://t.co/NUxbAvTTcO',\n",
        " 'RT @benhamner: The most brutal way to learn about overfitting?',\n",
        " 'Watching yourself drop hundreds of places when a @kaggle final leaderboard ',\n",
        " 'RT @benhamner: The most brutal way to learn about overfitting?',\n",
        " 'Watching yourself drop hundreds of places when a @kaggle final leaderboard ']"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's break our tweets into individual words, referred to as tokens."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Tokenize into words\n",
      "tokens = []\n",
      "for tweet in tweets_text:\n",
      "    for word in nltk.word_tokenize(tweet):\n",
      "        tokens.append(word)\n",
      "tokens[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "['37', 'spots', 'up', '.', 'Is', 'anybody', 'else', 'even', 'trying', '?']"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is really messy though.  Do we care about analyzing punctuation and other non alphanumeric characters?  We will exclude those using regular expressions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Only keep tokens that start with a letter (using regular expressions)\n",
      "import re\n",
      "clean_tokens = [token for token in tokens if re.search('^[a-zA-Z]+', token)]\n",
      "clean_tokens[:20]# Tokenize into words"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "['spots',\n",
        " 'up',\n",
        " 'Is',\n",
        " 'anybody',\n",
        " 'else',\n",
        " 'even',\n",
        " 'trying',\n",
        " 'kaggle',\n",
        " 'https',\n",
        " 'RT',\n",
        " 'antgoldbloom',\n",
        " 'Proud',\n",
        " 'to',\n",
        " 'share',\n",
        " 'that',\n",
        " 'over',\n",
        " 'the',\n",
        " 'weekend',\n",
        " 'Kaggle',\n",
        " 'passed']"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now perform the \"hello world\" task of text analysis and get a list of the most popular words"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Count the tokens\n",
      "from collections import Counter\n",
      "c = Counter(clean_tokens)\n",
      "c.most_common(25) # Most frequent tokens"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "[('http', 780),\n",
        " ('kaggle', 482),\n",
        " ('RT', 471),\n",
        " ('Kaggle', 364),\n",
        " ('to', 347),\n",
        " ('https', 304),\n",
        " ('a', 236),\n",
        " ('the', 194),\n",
        " ('for', 147),\n",
        " ('I', 142),\n",
        " ('and', 139),\n",
        " ('up', 126),\n",
        " ('of', 120),\n",
        " ('in', 115),\n",
        " ('DataScience', 115),\n",
        " ('on', 111),\n",
        " ('video', 109),\n",
        " ('with', 108),\n",
        " ('MachineLearning', 103),\n",
        " ('BigData', 99),\n",
        " ('is', 88),\n",
        " ('machinelearning', 88),\n",
        " ('R', 86),\n",
        " ('benhamner', 85),\n",
        " ('via', 84)]"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What do you notice about this list of words?  Are there any duplicated words?  What should we do about that?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Stemming and Lemmatizing (Normalizing)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Stemming** reduces a word to its base (stem) form.  It often makes sense to treat multipe word forms the same way.  \n",
      "\n",
      "Stemming uses a \"simple\" rule-based approach that runs very quickly.  The output isn't always the best for irregular words.  Stemmed words are not usually shown to users but rather used for analysis/indexing."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Initialize stemmer\n",
      "from nltk.stem.snowball import SnowballStemmer\n",
      "stemmer = SnowballStemmer('english')\n",
      "\n",
      "# Some exmaples\n",
      "print 'charge:', stemmer.stem('charge')\n",
      "print 'charging:', stemmer.stem('charging')\n",
      "print 'charged:', stemmer.stem('charged')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "charge: charg\n",
        "charging: charg\n",
        "charged: charg\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's stem all of our tokens and recompute the count of most popular tokens."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Stem the tokens\n",
      "stemmed_tokens = [stemmer.stem(t) for t in clean_tokens]\n",
      "\n",
      "# Count the stemmed tokens\n",
      "c = Counter(stemmed_tokens)\n",
      "c.most_common(25)       # all lowercase"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "[(u'kaggl', 848),\n",
        " (u'http', 780),\n",
        " ('rt', 471),\n",
        " ('to', 348),\n",
        " (u'https', 304),\n",
        " ('a', 261),\n",
        " (u'the', 220),\n",
        " (u'machinelearn', 191),\n",
        " (u'competit', 172),\n",
        " (u'for', 148),\n",
        " ('i', 145),\n",
        " (u'and', 142),\n",
        " (u'datasci', 140),\n",
        " ('up', 127),\n",
        " ('of', 120),\n",
        " ('in', 118),\n",
        " (u'bigdata', 113),\n",
        " ('on', 113),\n",
        " (u'predict', 112),\n",
        " (u'video', 111),\n",
        " (u'with', 110),\n",
        " (u'data', 102),\n",
        " ('is', 99),\n",
        " (u'model', 92),\n",
        " (u'facebook', 89)]"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "However, some of these are still a bit uninterpretable for humans.  That's where lemmatizing comes in.\n",
      "\n",
      "**Lemmatization**, or normalization, dervies the canonical form (i.e. lemma) of a word.  This can be better than stemming in some cases, because it reduces words to a \"normal\" form.  This often uses a dictionary based approach and can be slower than stemming.  This is the tradeoff for \"better\" results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Initialize lemmatizer\n",
      "lemmatizer = nltk.WordNetLemmatizer()\n",
      "\n",
      "# Compare stemmer to lemmatizer\n",
      "print 'dogs - stemmed:', stemmer.stem('dogs'), ', lemmatized:', lemmatizer.lemmatize('dogs')\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "dogs - stemmed: dog , lemmatized: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "dog\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'wolves - stemmed:', stemmer.stem('wolves'), ', lemmatized:', lemmatizer.lemmatize('wolves')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "wolves - stemmed: wolv , lemmatized: wolf\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's lemmatize our Twitter dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Lemmatize the tokens\n",
      "lemmatized_tokens = [lemmatizer.lemmatize(t).lower() for t in clean_tokens] # I lowercased things too.\n",
      "\n",
      "# Count the lemmatized tokens\n",
      "c = Counter(lemmatized_tokens)\n",
      "c.most_common(25)       # all lowercase"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "[(u'http', 1084),\n",
        " ('kaggle', 847),\n",
        " ('rt', 471),\n",
        " ('to', 348),\n",
        " ('a', 264),\n",
        " ('the', 220),\n",
        " ('machinelearning', 191),\n",
        " (u'competition', 168),\n",
        " ('for', 148),\n",
        " ('i', 145),\n",
        " ('and', 142),\n",
        " ('datascience', 140),\n",
        " ('up', 127),\n",
        " ('of', 120),\n",
        " ('in', 118),\n",
        " ('on', 113),\n",
        " ('bigdata', 113),\n",
        " ('video', 111),\n",
        " ('with', 110),\n",
        " ('data', 102),\n",
        " ('is', 99),\n",
        " ('facebook', 89),\n",
        " ('r', 87),\n",
        " ('benhamner', 85),\n",
        " ('via', 84)]"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The lemmatizing didn't do much here since msot of the popular words don't have significantly different normal forms."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# One more example\n",
      "print 'is - stemmed:', stemmer.stem('is'), ', lemmatized:', lemmatizer.lemmatize('is')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "is - stemmed: is , lemmatized: is\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is not what I learned in grammar school.  Why isn't the result \"be\"? \n",
      "\n",
      "The lemmatizer assumes everything is a noun unless explicitly told otherwise."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lemmatizer.lemmatize('is',pos='v')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "u'be'"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nltk.pos_tag(nltk.word_tokenize('Lloyld loves NLP'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "[('Lloyld', 'NNP'), ('loves', 'VBZ'), ('NLP', 'NNP')]"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Stopword Removal"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Stopwords** are common words that will most likely appear in any text.  They are \"useless\" words that don't contain much information.  For the purpose of word counts and other word frequencies, they are not particularly useful.  \n",
      "\n",
      "Let's remove the stopwords from our tweets and look at the most popular words."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# View the list of stopwords\n",
      "stopwords = nltk.corpus.stopwords.words('english')\n",
      "print stopwords[0:25]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they']\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have a list of stopwords, we can remove them from our token list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Stem the stopwords\n",
      "stemmed_stops = [stemmer.stem(t) for t in stopwords]\n",
      "\n",
      "# Remove stopwords from stemmed tokens\n",
      "stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in stemmed_stops]\n",
      "c = Counter(stemmed_tokens_no_stop)\n",
      "most_common_stemmed = c.most_common(25)\n",
      "\n",
      "# Remove stopwords from cleaned tokens\n",
      "clean_tokens_no_stop = [t.lower() for t in clean_tokens if t.lower() not in stopwords]\n",
      "c = Counter(clean_tokens_no_stop)\n",
      "most_common_not_stemmed = c.most_common(25)\n",
      "\n",
      "# Compare the most common results for stemmed words and non stemmed words\n",
      "for i in range(25):\n",
      "    text_list = most_common_stemmed[i][0] + '  ' + str(most_common_stemmed[i][1]) + ' '*25\n",
      "    text_list = text_list[0:30]\n",
      "    text_list += most_common_not_stemmed[i][0] + '  ' + str(most_common_not_stemmed[i][1])\n",
      "    print text_list"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "kaggl  848                    kaggle  847\n",
        "http  780                     http  780\n",
        "rt  471                       rt  471\n",
        "https  304                    https  304\n",
        "machinelearn  191             machinelearning  191\n",
        "competit  172                 competition  147\n",
        "datasci  140                  datascience  140\n",
        "bigdata  113                  bigdata  113\n",
        "predict  112                  video  109\n",
        "video  111                    data  102\n",
        "data  102                     facebook  89\n",
        "model  92                     r  87\n",
        "facebook  89                  benhamner  85\n",
        "r  87                         via  84\n",
        "benhamn  85                   new  80\n",
        "via  84                       prediction  77\n",
        "new  80                       challenge  77\n",
        "challeng  77                  scikit-learn  72\n",
        "scikit-learn  72              top  69\n",
        "spot  69                      intro  69\n",
        "top  69                       spots  67\n",
        "intro  69                     model  62\n",
        "learn  68                     kirkdborne  61\n",
        "use  61                       recruiting  58\n",
        "kirkdborn  61                 training  57\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These results are a bit more interesting.  You can see the most popular words that occur in the tweets about \"kaggle\".  We could dig further into this and look at the specific tweets for some of the more interesting occurences.  For instance, it seeems that there are more occurences of \"R\" than \"Python\" (which doesn't even show up in our top list).  Does that mean that R is more popular than Python for Kaggle competitions?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Named Entity Recognition"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Named Entity Recognition (NER)** is the automatic extraction of names, places, organizations, etc.  This can help you identify \"important\" words.  NER classifiers can work in different ways, but the most interesting and relevant ones use some sort of supervised machine learning technique.  There is some sort of tagged dataset that has a model/algorithm fit to it.  With what we've learned in class so far, you could build your own NER classifier!  However, it's often better to use existing classifiers.  \n",
      "\n",
      "First, let's build a NER extraction function that takes in a sentence."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def extract_entities(text):\n",
      "    entities = []\n",
      "    # tokenize into sentences\n",
      "    for sentence in nltk.sent_tokenize(text):\n",
      "        # tokenize sentences into words\n",
      "        # add part-of-speech tags\n",
      "        # use NLTK's NER classifier\n",
      "        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))\n",
      "        # parse the results\n",
      "        entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])\n",
      "    return entities"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Let's look at all of the words in this dataset and see which named entities are identified.\n",
      "for entity in extract_entities('Kevin and Brandon are instructors for General Assembly in Washington, D.C.'):\n",
      "    print '[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[PERSON] Kevin\n",
        "[PERSON] Brandon\n",
        "[ORGANIZATION] General Assembly\n",
        "[GPE] Washington\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This seems to work pretty well!  But how resilient is it?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for entity in extract_entities('kevin and BRANDON are instructors for @GA_DC, DC'):\n",
      "    print '[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ORGANIZATION] BRANDON\n",
        "[ORGANIZATION] DC\n"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The accuracy decreased dramatically!  There are companies who are working to solve this problem, but as you get into more unstructured, \"wild\" data (like social media), this gets harder to do. \n",
      "\n",
      "We could run this on our entire dataset, but it would take a while, so I'll provide the code, but not actually run it.\n",
      "\n",
      "``` {python}\n",
      "named_entities = []\n",
      "for tweet in tweets_text:\n",
      "    temp_entities = extract_entities(tweet)\n",
      "    for temp_entity in temp_entities:\n",
      "        named_entities.append((temp_entity.label(), temp_entity.leaves()[0][0]))\n",
      "```\n",
      "\n",
      "Let's at least run it on one tweet."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print tweets_text[21]\n",
      "for entity in extract_entities(tweets_text[21]):\n",
      "    print '[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Data Science Challenge: predict Taxi trajectories. @Gabriellilor, @siffolone  are you in? http://t.co/yO9iCOnTlA\n",
        "[PERSON] Data"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "[ORGANIZATION] Science\n",
        "[GPE] Gabriellilor\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Topic Modeling"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Topic modeling allows us to discover \"topic grougs\" in our dataset.  There are several different versions of this, but we'll talk about one specifically, LDA.\n",
      "\n",
      "**Latent Dirichlet Allocation (LDA)** is a topic modeling method that allows us to discover clusters of words that appear together frequently.  We can use this to look for clusters of words in our Kaggle corpus.\n",
      "\n",
      "While the code below intorduces some new specifics, the overall process is similar to what we've done before."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import lda # Latent Dirichlet Allocation\n",
      "import numpy as np\n",
      "from sklearn.feature_extraction.text import CountVectorizer\n",
      "\n",
      "# Instantiate a count vectorizer with two additional parameters\n",
      "vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) \n",
      "sentences_train = vect.fit_transform(np.array(tweets_text))\n",
      "\n",
      "# Instantiate an LDA model\n",
      "model = lda.LDA(n_topics=10, n_iter=500)\n",
      "model.fit(sentences_train) # Fit the model \n",
      "n_top_words = 10\n",
      "topic_word = model.topic_word_\n",
      "for i, topic_dist in enumerate(topic_word):\n",
      "    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]\n",
      "    print('Topic {}: {}'.format(i+1, ', '.join(topic_words)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "WARNING:lda:all zero row in document-term matrix found\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Topic 1: kaggle, http, 10, 10 packages, packages, facebook, packages kaggle, rt, 10 packages kaggle\n",
        "Topic 2: kaggle, http, data, rt, stories, big, mining, big data, data mining\n",
        "Topic 3: http, kaggle, model, training, model training, video model, video model training, nearest, prediction nearest\n",
        "Topic 4: kaggle, http, https, rt, kaggle http, science, kaggle https, places, data science"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 5: kaggle, rt, competition, https, http, kaggle https, kaggle competition, kaggle competition http, machinelearning"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 6: kaggle, http, rt, benhamner, microsoft, azure, ve, challenge, got\n",
        "Topic 7: http, rt, kaggle, new, datascience, bigdata, intro, intro machinelearning, machinelearning"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 8: prediction, neighbors, prediction nearest neighbors, learning, nearest neighbors http, machine, machine learning, https, kaggle scikit learn\n",
        "Topic 9: http, kaggle, rt, based, classification, tree, methods, classification methods, based classification"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 10: kaggle, https, spots, spots kaggle, kaggle https, rt, top10, moved, http"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These results could be interesting.  There are vague clusters about machinelearning/scikit learn, methods of classification, and others.  These are not hard and fast clusters or groups, but they are something to investigate.\n",
      "\n",
      "Let's try this again on a different corpus."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Imports\n",
      "import requests\n",
      "from bs4 import BeautifulSoup\n",
      "\n",
      "# Get Data Science Wiki page\n",
      "r = requests.get(\"http://en.wikipedia.org/wiki/Data_science\")\n",
      "b = BeautifulSoup(r.text)\n",
      "paragraphs = b.find(\"body\").findAll(\"p\")\n",
      "paragraphs_text = [p.text for p in paragraphs]\n",
      "text = \"\"\n",
      "for paragraph in paragraphs:\n",
      "    text += paragraph.text + \" \"\n",
      "\n",
      "# Data Science corpus\n",
      "text[:500]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "u'In general terms, Data Science is the extraction of knowledge from data.[1][2] It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information theory and information technology, including signal processing, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression and high'"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# tokenize into sentences\n",
      "sentences = [sent for sent in nltk.sent_tokenize(text)]\n",
      "sentences[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "u'In general terms, Data Science is the extraction of knowledge from data.'"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can try running LDA using the paragraphs as documents."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Instantiate a count vectorizer with two additional parameters\n",
      "vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) \n",
      "sentences_train = vect.fit_transform(paragraphs_text)\n",
      "\n",
      "# Instantiate an LDA model\n",
      "model = lda.LDA(n_topics=10, n_iter=500)\n",
      "model.fit(sentences_train) # Fit the model \n",
      "n_top_words = 10\n",
      "topic_word = model.topic_word_\n",
      "for i, topic_dist in enumerate(topic_word):\n",
      "    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]\n",
      "    print('Topic {}: {}'.format(i+1, ', '.join(topic_words)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "WARNING:lda:all zero row in document-term matrix found\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Topic 1: business, processing, sciences, finance, subject, research, social, emerging, medical\n",
        "Topic 2: clinical, clinical data, data scientist, trials, clinical trials, safety, genomics, handling, efficacy novel\n",
        "Topic 3: areas, models, technical, international, computing, discipline, theory, cleveland, broad"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 4: term, term data, term data science, statistician, data scientist, lecture, entitled statistics data, data, survey\n",
        "Topic 5: methods, used, computer, applications, published, issues, related, naur, conference\n",
        "Topic 6: learning, machine, machine learning, information, big, statistics machine learning, knowledge, big data, signal processing\n",
        "Topic 7: data, scientists, data scientists, large, marketing, structured, software, crucial, work"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Topic 8: data, science, data science, statistics, scientist, statistical, field, analysis, science data\n",
        "Topic 9: journal, needed, field statistics, april, data collection, jeff, publication, mahalanobis, research\n",
        "Topic 10: security, fraud, insights, using, using data, security data, statistics machine, security data science, security fraud\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also use the sentences as documents.  While topic modeling usually does better with longer documents (emails vs. tweets), LDA has been shown to do well even with short documents (specifically tweets)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Instantiate a count vectorizer with two additional parameters\n",
      "vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) \n",
      "sentences_train = vect.fit_transform(paragraphs_text)\n",
      "\n",
      "# Instantiate an LDA model\n",
      "model = lda.LDA(n_topics=10, n_iter=500)\n",
      "model.fit(sentences_train) # Fit the model \n",
      "n_top_words = 10\n",
      "topic_word = model.topic_word_\n",
      "for i, topic_dist in enumerate(topic_word):\n",
      "    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]\n",
      "    print('Topic {}: {}'.format(i+1, ', '.join(topic_words)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since topic modeling is more of a clustering technique, it can be difficult to determine which one is \"better\"."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Textblob"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's talk about a new NLP package, Textblob.  Textblob's tagline is \"Simplified Text Processing\".  You can do many of the same things in Textblob that you can do in NLTK, but it is \"simpler\" to use.  That's obviously a subjective thing, but it's good to be aware of both packages."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from textblob import TextBlob, Word\n",
      "\n",
      "# Textblob has a different syntax, but it generally performs the same functions as NLTK.\n",
      "blob = TextBlob('Kevin and Brandon are instructors for General Assembly in Washington, D.C.  They both love Data Science.')\n",
      "print 'Sentences:', blob.sentences\n",
      "print 'Words:', blob.words\n",
      "print 'Noun Phrases:', blob.noun_phrases"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Sentences: [Sentence(\"Kevin and Brandon are instructors for General Assembly in Washington, D.C.\"), Sentence(\"They both love Data Science.\")]\n",
        "Words: ['Kevin', 'and', 'Brandon', 'are', 'instructors', 'for', 'General', 'Assembly', 'in', 'Washington', 'D.C', 'They', 'both', 'love', 'Data', 'Science']\n",
        "Noun Phrases: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['kevin', 'brandon', u'general assembly', 'washington', 'd.c', 'data']\n"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Textblob has many useful functionalities:  \n",
      "* Singularizing and pluralizing words\n",
      "* Spell check\n",
      "* Word defintions\n",
      "* Translation"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Singularize and pluralize\n",
      "blob = TextBlob('Put away the dishes.')\n",
      "print [word.singularize() for word in blob.words]\n",
      "print [word.pluralize() for word in blob.words]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['Put', 'away', 'the', 'dish']\n",
        "['Puts', 'aways', 'thes', 'dishess']\n"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Spelling correction\n",
      "blob = TextBlob('15 minuets late')\n",
      "print 'Original: 15 minuets late    Corrected:', blob.correct()\n",
      "\n",
      "# Spellcheck\n",
      "print 'Original: parot    Corrected:', Word('parot').spellcheck()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Original: 15 minuets late    Corrected: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15 minutes late\n",
        "Original: parot    Corrected: [('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]\n"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Definitions\n",
      "print Word('bank').define()\n",
      "print ' '\n",
      "print Word('bank').define('v')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# translation and language identification\n",
      "blob = TextBlob('Welcome to the classroom.')\n",
      "print 'English: \"Welcome to the classroom.\"    Spanish:', blob.translate(to='es')\n",
      "print ''\n",
      "blob = TextBlob('Hola amigos')\n",
      "print '\"Hola amigos\" is the language', blob.detect_language()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "English: \"Welcome to the classroom.\"    Spanish: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Bienvenido a la sala de clase .\n",
        "\n",
        "\"Hola amigos\" is the language "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "es\n"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Sentiment"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Sentiment allows us to convert a limited range of emotion into a number.  It gives us an idea of how \"positive\", \"negative\", or \"neutral\" a piece of text is.  We built a sentiment API function in our API's class, so we could use that.  But for the sake of variety, let's use the built in functionality of Textblob.\n",
      "\n",
      "Textblob has two different \"types\" of sentiment, a polarity sentiment (positive/negative) and a subjectivity sentiment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The sentiment polarity score is a float within the range [-1.0, 1.0].\n",
      "print 'I love pizza    Sentiment =', TextBlob('I love pizza').sentiment.polarity\n",
      "print 'I hatee pizza    Sentiment =', TextBlob('I hate pizza').sentiment.polarity\n",
      "print 'I feel nothing about pizza    Sentiment =', TextBlob('I feel nothing about pizza').sentiment.polarity"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "I love pizza    Sentiment = "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.5\n",
        "I hatee pizza    Sentiment = -0.8\n",
        "I feel nothing about pizza    Sentiment = 0.0\n"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.\n",
      "print 'I am a cool person    Subjectivity =', TextBlob(\"I am a cool person\").sentiment.subjectivity # Pretty subjective\n",
      "print 'I am a person    Subjectivity =', TextBlob(\"I am a person\").sentiment.subjectivity # Pretty objective"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "I am a cool person    Subjectivity = 0.65\n",
        "I am a person    Subjectivity = 0.0\n"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But once again, it's not perfect."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# different scores for essentially the same sentence\n",
      "print TextBlob('Kevin and Brandon are instructors for General Assembly in Washington, D.C.').sentiment.subjectivity\n",
      "print TextBlob('Kevin and Brandon are instructors in Washington, D.C.').sentiment.subjectivity"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.5\n",
        "0.0\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "With this idea of sentiment in mind, let's see how positive, negative, and neutral people are about our Kaggle tweets."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Let's loop through our tweets and calculate sentiment\n",
      "sentiments = [TextBlob(tweet).sentiment.polarity for tweet in tweets_text]\n",
      "print tweets_text[0], sentiments[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT 0.0\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Average sentiment\n",
      "avg_sentiment = np.sum(sentiments)/len(sentiments)\n",
      "print avg_sentiment"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.0880874324259\n"
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This average sentiment is pretty neutral.  In my experience, this is usually the case; people don't express as much sentiment as you think and the positives and negatives often cancel each other out.\n",
      "\n",
      "Let's look at the distribution of the sentiment to get a better idea."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline\n",
      "import seaborn as sns\n",
      "import matplotlib.pyplot as plt\n",
      "sns.distplot(sentiments)\n",
      "plt.title('Distribution of Sentiment')\n",
      "plt.xlabel('Sentiment (-1 to 1)')\n",
      "plt.ylabel('Frequency')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 39,
       "text": [
        "<matplotlib.text.Text at 0x110eb8a90>"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAfUAAAFwCAYAAAChNeJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8XHd97//XjKTRvtiWLNuxY4ckH7I0kBSSlIQlLSSX\nnfJrCaSXspbbe+kPeoH2FugtpP114/KD5sdSCi1b2AqUpQFCQgIEEkISsptsHxwnsY1tSbaWGWmk\nmZFmfn+cM7IiaxlJczTy0fv5ePjhWc98dDSj93yX8z0gIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIi\nIiIiIrViZp8ws/9dpW2dbGYZM0uE128yszdXY9vh9q41sz+s1vaW8Lp/a2YDZnZwtV97jloyZrar\n1nWI1Fqi1gWIrDYzexzYDEwCU8CDwNXAp9y9tIxtvcndf7SE5/wY+IK7f2YprxU+90rgVHdf9RCf\nVcfJwMPADnc/Os9j3gv8EdADDAM/c/fXVOG1byLYf59e6bZWajm/f5EoJWtdgEgNlICXunsHcDLw\nj8BfAMsJiRILfDk2s/plVbj2nQwcXSDQXw+8Fni+u7cDzwRurNJrL+mLV8QW/P2LrDa9GWXdMbPH\ngDfPbF2Z2fnAbcA57v6gmX0O2O/uf2Vm3cDngIuBIvAA8DyC1v0fADmCFv9fA/8B7CVoob4feAx4\nffh/vbsXw5b6z4HnA2cAPwbe6O5DZnYJQSt0x4zaHgfeDDQA/0nwuc0Be9z9vJkt17CL/y/D128G\nrgPe5u7psHt6L/AG4P8BWoB/cve/n2c/dQIfBV4IZIF/Bf4+rPsaoDG8/evu/qZZz/0oMOnu71hg\n2x8GXhTu088C7w/3zxvC+n8e/tzDwFvd/Toz+zuCL2AFgp6Wz7r7282sCJzm7nvD310W2AU8B7gX\neBXwHuB1wGHgCne/N6xlW/hzPgcYDffJR8P7rgTOAsaBVwL7gNe7+11m9gVm/f7d/f+d6+cVWS1q\nqYsA7v4L4ADBH3YIWmDlFuG7gP1AN0G3/XvcvRR2ge8jaPW3z/qD/lyCwP4vHP/lOUEQLm8EthKE\n00cWKK8ElNz9OoJQ/ffw9c6bo9Y3EnyJuAR4CtAGfGzW9i4GjCCc32dmZ8zzuh8F2oFTCL7EvI7g\ny8eNBGF8MKzjTXM89zbgdWb2Z2b2TDOrm3X/54A8cCpwHnAZQZCXXUDQvb8J+D+EvSju/pfAzcCf\nhK/99nlqfxXBl5vu8HVuA34BbCT44vVhADNLAt8B7gG2hfvkf5rZZTO29TLgK0AnwZeZj4W1LPT7\nF6kJhbrIMQcJ/ujPlicI313uPuXuP6tgW1e6+7i75+a4rwRc7e4PunsW+Cvg8vJEukUkWLiH7b8C\nH3L3x919jKB1+powvMr+2t1z7n4/cB/w9NkbCUP41QRfYMbc/QngQ0B5LH/BWt39S8DbCL7U3AT0\nmdn/CrfdS/Cl4B3hPhoArgJmjrc/4e6fDuc4XA1sNbPNs/bDfErAN939nnD/fwsYc/cvhtv7GsEX\nCYDzgW53/1t3n3T3x4B/m1XLze5+XfjcLzLH/hJZK+I63ieyHNuBwRnXy8HxQeBK4AdmBsGEug8s\nsq39S7h/H0HXenfFlc5vK/DErG3XA70zbjs843IWaJ1jO91hTbO3dVKlhbj7l4Evh18QXgl8yczu\nJehObwAOhfsTggbGvrlqdPds+Lg2oD+8ebFx9f4ZlydmXR8PtwWwE9hmZkMz7q8Dfjrjet+My1mg\nycyS7l5cpAaRVadQF2F6TH0bcMvs+9x9FPgz4M/M7GzgR2Z2h7v/mPnDZbHQOXnW5QJwBBgjGOsu\n11VHMHu80u0eJBhLnrntSYJgOnmuJ8zjSFjTLuChGds6sIRtAODuU8B/mNlfAGcTdGXngE3LDMZq\nTpTbDzzm7jbP/Yu91lqatCeiUJd1q3zMeAfB+PdVBJPNHph5f/iYlxKM7z4KpAkmRZXDqI9gXHgp\nhzQlgNea2dUELeG/IZhsVjIzJ2gJvhi4AXgvwYS0ssPAC8wsMc/hd18B/sLMvk8QzOUx+OKMVvFc\n9TyJu0+Z2deAvzOz1xGMbb+DoNdiUeHs9wGC8e8xgm74s4Hb3f2wmf0A+LCZ/VV4/ynASe7+0/m2\nOUN5n1f88yzgDiATDg18lGCo5Uygyd3vrGBby/n9i0RGY+qyXn3HzNIEXb7vIRgvfuOM+2dOPjuN\nIGAzwK3Ax939J+F9/wD8bzMbMrN3znjubKVZl68mmCx2CEgBbwdw9xHgrQTjugcIZmPP7Kr/evj/\nUTO7c47X+QzwBYLu470E3cVvm6eOhW4jfN5YuJ2bgS8RzFJf7HkQfPl5L8GXliGCwwb/u7vfGt7/\nOoKf+0GCIY+vA1tmbHf2tmde//+A3zezQTO7ap6fZ/b+nnN7YS/CS4Fzw59zAPgU0FFhLXP9/kVq\nJrJD2sxsB8Efrs0EH4JPuftHwkNE/ojgwwPBRJzroqpDRERkvYgy1LcAW9z9XjNrA+4Cfhe4HMi4\n+4ejem0REZH1KLIxdXc/TDiD1d1Hzewhjs2c1aI3IiIiVbYq4RquZPUTgoky7yIYuxwB7gTe5e7D\nq1GHiIhInEU+US7sev8P4E/DQ4M+QTDT9VyCSUIfiroGERGR9SDSlrqZNQDfBb7v7sfNUg1b8N9x\n93Pm20axWCwlEuqtFxGR9SOxzOCLbEw9XPLy08CDMwPdzLa6+6Hw6iuB3QttJ5FIMDCQiapMCfX0\ntGs/R0z7OHrax6tD+3ntinLxmYsJTr14v5ndE972XuAKMzuX4DC3x4A/jrAGERGRdSPK2e+3MPeY\n/fejek0REZH1TCvKiYiIxIRCXUREJCYU6iIiIjGhUBcREYkJhbqIiEhMKNRFRERiQqEuIiISEwp1\nERGRmFCoi4iIxIRCXUREJCYU6iIiIjGhUBcREYkJhbqIiEhMKNRFRERiQqEuIiISEwp1ERGRmFCo\ni4iIxIRCXUREJCYU6iIiIjGhUBcREYkJhbqIiEhMKNRFRERiQqEuIiISEwp1ERGRmFCoi4iIxIRC\nXUREJCYU6iIiIjGhUBcREYkJhbqIiEhMKNRFRERiQqEuIiISEwp1ERGRmFCoi4iIxIRCXUREJCYU\n6iIiIjGhUBcREYkJhbqIiEhMKNRFRERiQqEuIiISEwp1ERGRmFCoi4iIxIRCXUREJCYU6iIiIjGh\nUBcREYmJ+loXIBJ3pVKJTCZNKlUknc4A0N7eQSKRqHFlIhI3CnWRiGUyaW64fQ89PRsZHcsxnh3j\n0gtPo6Ojs9aliUjMKNRFVkFzSyutbR0Umah1KSISYxpTFxERiQmFuoiISEwo1EVERGJCoS4iIhIT\nCnUREZGYUKiLiIjERGSHtJnZDuBqYDNQAj7l7h8xs43AV4GdwOPA5e4+HFUdIiIi60WULfUC8A53\nPxv4LeBPzOxM4N3ADe5uwA/D6yIiIrJCkYW6ux9293vDy6PAQ8BJwMuBz4cP+zzwu1HVICIisp6s\nypi6me0CzgNuB3rdvS+8qw/oXY0aRERE4i7yZWLNrA34BvCn7p4xs+n73L1kZqXFttHT0x5hhVKm\n/RyNVKpIW+sgAO1tTSTJ093dTmen9ncU9D5eHdrPa1OkoW5mDQSB/gV3/3Z4c5+ZbXH3w2a2Fehf\nbDsDA5koyxSCD6j2czTS6QyjYzla2yAzOkF2LMeRIxnyeR18Um16H68O7ee1K7K/KmaWAD4NPOju\nV8246xrg9eHl1wPfnv1cERERWbooW+oXA68F7jeze8Lb3gP8I/A1M3sz4SFtEdYgIiKybkQW6u5+\nC/P3BLwgqtcVERFZrzSoJyIiEhMKdRERkZhQqIuIiMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhI\nTCjURUREYkKhLiIiEhMKdRERkZhQqIuIiMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhITCjURURE\nYkKhLiIiEhMKdRERkZhQqIuIiMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhITCjURUREYkKhLiIi\nEhMKdRERkZhQqIuIiMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhITCjURUREYkKhLiIiEhMKdRER\nkZhQqIuIiMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhITCjURUREYkKhLiIiEhMKdRERkZhQqIuI\niMSEQl1ERCQmFOoiIiIxoVAXERGJCYW6iIhITCjURUREYkKhLiIiEhMKdRERkZhQqIuIiMSEQl1E\nRCQm6qPcuJl9BngJ0O/u54S3XQn8ETAQPuw97n5dlHWIiIisB5GGOvBZ4KPA1TNuKwEfdvcPR/za\nIiIi60qk3e/ufjMwNMddiShfV0REZD2q1Zj628zsPjP7tJl11agGERGRWKlFqH8COAU4FzgEfKgG\nNYiIiMRO1GPqx3H3/vJlM/s34DuLPaenpz3SmiSg/RyNVKpIW+sgAO1tTSTJ093dTmen9ncU9D5e\nHdrPa9Oqh7qZbXX3Q+HVVwK7F3vOwEAm2qKEnp527eeIpNMZRsdytLZBZnSC7FiOI0cy5PM6orTa\n9D5eHdrPa1fUh7R9BXge0G1m+4H3A5eY2bkEs+AfA/44yhpERETWi0hD3d2vmOPmz0T5miIiIuuV\n+v9ERERiYtGWupldD3wM+K67l6IvSURERJajkpb6J4H/Cew1s78ws00R1yQiIiLLsGiou/s33f35\nwIuBk4AHzOxqM3tG5NWJiIhIxZYypl4iWN61AEwAV5uZ1m8XERFZIyoZU/994K3AVoKx9TPdfdTM\n6oE9wDujLVFEREQqUckhbW8EPgD8YOZEOXefNLO3R1aZiIiILEklof7S2bPezSzh7iV3vyaiukRE\nRGSJKhlTv9nMNpSvhLPffxJdSSIiIrIclYR6m7tPnxPd3Y8CWslfRERkjakk1JNm1lq+YmZtQEN0\nJYmIiMhyVDKm/hXgBjP7Z4JD2v4H8KVIqxIREZElWzTU3f0fzOwg8AqCY9X/xd2vjrwyERERWZKK\nztLm7p8HPh9xLSIiIrIClSw+0wu8DTh1xuNL7n55lIWJiIjI0lTSUv8G8CBwA1AMb9PZ2kRERNaY\nSkK9y93/W+SViIiIyIpUckjbL83spMgrERERkRWppKW+EdhtZj8jODsbaExdRERkzakk1L8c/ptJ\nY+oiIiJrTCXHqX9uFeoQERGRFVp0TN0Ct5jZ4+H13zSzK6MuTERERJamkolynwD+DhgOr98HaDxd\nRERkjakk1Dvd/fuE4+juPgXkI61KRERElqySUJ80s1T5Snh421R0JYmIiMhyVNr9/k2g28z+GrgF\n+FCkVYmIiMiSVTL7/fNmthd4GdAMvM7db468MhEREVmSSs/SdjOgIBcREVnDKjlL2y/muLnk7hdE\nUI+IiIgsUyUt9T+fcbkJuAI4GE05IiIislyVjKnfNPO6mV0P/CyqgkRERGR5Kpn9Plsn0FvtQkRE\nRGRlljqmngSegg5pExERWXOWOqY+Cex1d42pi4iIrDFLHlMXERGRtamS7vcBgnXfE3PcXXL3zVWv\nSkRERJasku73fwE2Ap8iCPY3A0PAZyKsS0RERJaoklB/sbs/Y8b1t5nZne7+vqiKEhERkaWr5JC2\nDjPrKV8JL3dEV5KIiIgsRyUt9auAe83suwTd7y8G/j7SqkRERGTJFm2pu/vHgRcBDwC7gRe5+z9H\nXZiIiIgsTUVnaQMeB37m7ncBmFnC3UuRVSUiIiJLtmhL3cxeTNBK/2Z4/XzgmojrEhERkSWqZKLc\n3wAXAIMA7v4L4NQoixIREZGlq+iELu5+aNZN+QhqERERkRWoJNTTZralfMXMLiFYfEZERETWkEom\nyr0HuBbYZWY/AU4HXh5pVSIiIrJkC4a6mSWBCeB3gIvCm2919+GoCxMREZGlWTDU3b1oZl9093MI\nWusiIiKyRlUypv4rMzsl8kpERERkRSoZU+8A7jezW4DR8LaSu18eXVkiIiKyVPOGupl9yN3fBXwR\n+BpPPoxNq8mJiIisMQu11H8HwN0/Z2b3uPt5S924mX0GeAnQH47LY2Ybga8COwmWn71cE+9ERERW\nrqLFZ1bgs8ALZ932buAGdzfgh+F1ERERWaGFWupNZnYWwelWy5enufuDi23c3W82s12zbn458Lzw\n8ueBm1Cwi4iIrNhCod4MfC+8nJhxuWy5M+J73b0vvNwH9C5zOyIiIjLDvKHu7ruifnF3L5mZJt2J\niIhUQaXnU6+mPjPb4u6HzWwr0L/YE3p62lehLNF+jkYqVaStdRCA9rYmkuTp7m6ns1P7Owp6H68O\n7ee1qRahfg3weuAD4f/fXuwJAwOZqGta93p62rWfI5JOZxgdy9HaBpnRCbJjOY4cyZDPRz1Pdf3R\n+3h1aD+vXZGGupl9hWBSXLeZ7QfeB/wj8DUzezPhIW1R1iAiIrJeRBrq7n7FPHe9IMrXFRERWY/U\n/yciIhITCnUREZGYUKiLiIjEhEJdREQkJhTqIiIiMaFQFxERiQmFuoiISEwo1EVERGJCoS4iIhIT\nCnUREZGYUKiLiIjEhEJdREQkJhTqIiIiMaFQFxERiQmFusgqKJVKtS5BRNYBhbpIxJ7oG+Nbtx7i\nQH+m1qWISMwp1EUi9qtfZygWwfcN17oUEYk5hbpIxI6mcwAc6B+tcSUiEncKdZGIlUM9k82TyeZr\nXI2IxJlCXSRiR0Zy05cPHc3WsBIRiTuFukiEJqeKDI3maWoIPmqHFeoiEiGFukiEjo5MUCpB74ZG\nWprqOTyY1eFtIhIZhbpIhPqHxwFoba5n++Z2JvJTjGQna1yViMSVQl0kQv1DQai3NdWxfXNbcNtw\nbqGniIgsm0JdJEID5ZZ6U71CXUQip1AXidB0S725jvaWFB0tDRwZyTNV1Li6iFSfQl0kQgPD4zSl\nkqTqg4/aps4mJqdKpMcKNa5MROJIoS4SkVKpxMDwON0djSQSCQCaG+sByIwr1EWk+hTqIhEZHs2T\nnyyyqbNx+ramVB0Ao+OaAS8i1adQF4lIeZJcd8fMUFdLXUSio1AXiUh5kly3WuoiskoU6iIRKS88\ns2lmS70xDHUtQCMiEVCoi0RkYK5QV/e7iERIoS4Skf6hceqSCTa0paZvU/e7iERJoS4SkYHhcbo7\nm0gmE9O31dclqa9LqKUuIpFQqItEYCI/yeh4gZ6u5uPua2xIqqUuIpFQqItEYDQbtMQ7WlPH3dfY\nkCQzXtApWEWk6hTqIhEod6+3NTccd19Tqo5iEbI5tdZFpLoU6iIRyIQt9faW40O9sSH42KXH8qta\nk4jEn0JdJAKj40Fgz9VSL4d6OfhFRKpFoS4SgfKYelvz3GPqoJa6iFSfQl0kAuUx9bm635sagmPV\nM1mFuohUl0JdJAKjC0yUa0yFLXV1v4tIlSnURSIwWslEObXURaTKFOoiEciMF0gArU0LTJTTmLqI\nVJlCXSQCmWye1uaGJy0RW9bYkCSBut9FpPoU6iIRGB0vzDmeDpBIJGhtrtfsdxGpOoW6SJUVS6Ug\n1OcYTy9ra67X7HcRqTqFukiVZScmKZWgfZ6WOgT3jU1MMjlVXMXKRCTuFOoiVbbQ4Wxlbc31gFaV\nE5HqUqiLVNmxw9mOX02u7FioqwteRKpHoS5SZZkF1n0vK3fN61h1EakmhbpIlS208EzZdEt9TN3v\nIlI99bV6YTN7HEgDU0DB3S+oVS0i1bTQudTL2tRSF5EI1CzUgRJwibsP1rAGkaqbPkPbAi319rCl\nrlAXkWqqdff78cttiZzgymPqCx3S1tai7ncRqb5ahnoJuNHM7jSzt9SwDpGqWuhc6mWaKCciUahl\n9/vF7n7IzHqAG8zsYXe/ea4H9vS0r3Jp65P2c3VMTBapr0tw8vYuEokEqVSRttZglKm9rYkkeU7a\n2kmqPsl4fkr7vcq0P1eH9vPaVLNQd/dD4f8DZvYt4AJgzlAfGMisZmnrUk9Pu/ZzlQylJ2htbuDI\nkVEA0ukMo2M5WtsgMzpBdizH0aOjtLc0MDgyof1eRXofrw7t57WrJt3vZtZiZu3h5VbgMmB3LWoR\nqbbRbGHB8fSy9pYU6WyeUqm0ClWJyHpQqzH1XuBmM7sXuB34rrv/oEa1iFTN5FSRbG5ywcPZyjpa\nUxQmi0zkp1ahMhFZD2rS/e7ujwHn1uK1RaI0Vj5GfYElYsvKi9NksnmaG2s5vUVE4qLWh7SJxEp5\n4ZlKut87wuBP66QuIlIlCnWRKjp2OFtlY+oAmTEd1iYi1aFQF6mi8mlXF1r3vayjVceqi0h1KdRF\nqmh63fdKQl3d7yJSZQp1kSoazZaXiK1kopy630WkuhTqIlVUyRnayjpayy11hbqIVIdCXaSKljKm\nfuyQNnW/i0h1KNRFqiizhNnv9XVJWhrr1VIXkapRqItU0chojubGOlINdRU9vr01pTF1EakahbpI\nFQ2P5ulsbaz48R0tDWTGCxSLWv9dRFZOa1OKVMnkVJHR8QLbe1orfk5HS4pSKRiLL0+cW4pSqUQm\nkz7u9vb2DhKJxJK3JyInNoW6SJWkw270zrbKW+rtM2bALyfUM5k0N9y+h+aWY18kxrNjXHrhaXR0\ndC55eyJyYlOoi1TJ8GgY6ksI547yDPixPPQs73WbW1ppaW1f3pNFJFY0pi5SJSOjOQC6ltBSP3as\nug5rE5GVU6iLVMnwdPf7UlrqWoBGRKpHoS5SJdMt9SV0v888p7qIyEop1EWqZCRsqXcsp/t9TN3v\nIrJyCnWRKhkJJ8p1LaH7ffqkLmqpi0gVKNRFqmR4NDe99GulWprqqUsmNKYuIlWhUBepkpGxPF1t\nqSUt+pJMJGhraSCj7ncRqQKFukgVFEsl0mP5Jc18L+toSamlLiJVoVAXqYLRbIGpYomuJaz7XtbR\n0sBEfop8YSqCykRkPVGoi1TBcHg423Ja6jOXihURWQmFukgVjCxj3feyjukZ8BpXF5GVUaiLVMHw\nMhaeKSsvQJPWedVFZIUU6iJVUD5GfSUtdXW/i8hKKdRFqmA5C8+Ulb8IDGVyVa1JRNYfhbpIFQyP\nlSfKLb2l3ruxGYC+wfGq1iQi649CXaQKRkbzJBMJ2psblvzc7s4m6pIJ+oayEVQmIuuJQl2kCoZH\nc7S3NpBMVr6aXFldMklPVzOHj2YplUoRVCci64VCXWSFSqVSsETsMhaeKduysYVsbpLMuA5rE5Hl\nU6iLrNB4borCZHFZC8+UbdnYAkDfoLrgRWT5FOoiKzQSTpJbzsz3si2bglA/fFShLiLLV/k5IkVk\nTsPlY9RX0P3euyGYAX9Yk+WWrVQqMTIyQjqdmb6tvb1jSWfNEznRKdRFVmhktAot9enudx3WtlyZ\nTJrrf76fYin4szaeHePSC0+jo6OzxpWJrB6FusgK9Q0FQbyps2nZ2+hoTdHcWMdhjamvSEtLK0WW\n/+VK5ESnMXWRFXricNDdu7O3fdnbSCQS9G5ooX8oS7Gow9pEZHkU6iIr9ERfhq621LJWk5tpy8YW\nJqdKHE1PLOl5xWKJPQdGdEIYEVH3u8hKpMfyDGVyPP3UTSveVnlc/fBglp6u5oqekytMceuDgxwe\nypFMwNmnbOTULep+Flmv1FIXWYF9fWHX+5bld72X9c4I9Uqks3k+/p/O4aEcvRuaaW6sZ/feQX5w\n1wCDOjmMyLqkUBdZgSf6Vj6eXrZlCaE+VSzy4a/ey77+LDs3N3Pp+Tt4+bNP4axdG8jmpvjyDx+n\nqCVnRdYdhbrICkxPkqtKS718trbFQ/1nuw+zr2+UZ5y+kWdaF8lkgob6JM94ag/bNjWx5+AoN/5i\n/4prEpETi0JdZAWe6MvQ1tzAhvaVTZIDaErV09WWWjTUJ/KTfOvmvaQakrz8opOetLhKIpHgN0/r\npK25nv/4yV5+PTC64rpE5MShUBdZprGJAgPDE+zc0l61Vcu2bGzhaDpHrjA172Ouv2M/I6N5XnjB\nyXS2Hj8prilVx6sv2cnkVJFPfedBcoUpSqUS6fTIcf/idFa4OP0sIsul2e8iy7SvL2gFV2M8vWzn\nlnYe3jfM/Y8e5fwzNh93/8hojutu30dHa4oXXngy+YmxObdzzildXHLuNm669yCfvfYhrrhkOzfe\n8SjNLa3Tj4nLimuD6Qk+9Z0H2d+fobuzka2b2jhlW0etyxKpCYW6yDJVczy97LlP38b1d+znhjv3\nzxnqX7/pUXKFKV79O6fRlKonv8Ah7X9wqfHrI2Pc8VA/3e31tLS00tJavVrXgj0HRvjYt3aTHsvT\n0VLP/v4s+/uz/OrACM87Z0OtyxNZdep+F1mm6cPZetuqts2tm1p52qmb2HNghMcOpZ903/2PHuXW\nXx5m55Z2nvP0rYtuq74uyZ+88hw2djRy7R0HefTQ2IJd1CdaF/1dj/TzgS/fzWi2wBXPP52/fv3T\n+L3nbucp2zoYyuT4xSPDOgJA1h2FusgyPdGXobmxvuKFYir1gmduB+DGO4/NXh/PTfL56x6mLpng\nTS8+k7pkZR/djtYUb/+9p9HYkOSePSNcf8d+huY5hj2TSXPD7Xu4Zfeh6X833L6HTCY95+Nrac+B\nET55zYPU1yd556ufzqXn7yCRSNDe0sBFv7GF3g3N/ProBNf/4lCtS122ub5krdUvWLJ2KNQjUCqV\njvsn8TKUyXH4aJadvW1VP7Xn2bs2snVTC3c81M9weAa4r/94D0OZHC951k52bF5az8DJve28+zVn\ns21TE/1D43z3Z49z3e37eOTAKIcGx5/Umm0Ou+jL/2aOwZfVOmz6hrJ85Bv3UywWecNlp7B9Yx3p\n9AiZTJoSJZLJBM87bxutTXVcf+ch7nqkf9Vqq6bZX7LW6hcsWVs0pl5lhUKB/7zhNppajo1d5sfT\n/F8vem4Nq1ofSqXScX/0ymEzM3hnn2N7qc8rlUp88QePUAIuOKu32j8GiUSCS5+5g6uvf4Srvn4f\n2YlJjoxMcFJPKy+9aNeytrmhPcVFZ21kcCzBLx8bZGBonP6hcXY/9iCtTc5pJ3XSuyHFcGac7o0N\ntDTW05SGVEWwAAASEElEQVSqm3Nb5bApB352bJRnnd1Le/uTJ6dFcS7zvqEs//S1+xgdL3DOyc0c\nHclyy+7gEMDBI330bN5MY3MjTal6LjprIzfdf4TPXPswO3rb2TxPj8pcv/+o6l+q5hjOg5BoKdSr\nrFQq0dzWSWvHsbXAs7X9u7BuzAybUqnEYKbA4OAgiUQdnZ0ddLU1MJkfP27Gd/l5Tc0t5AtFiiXI\npgeoq2uga2Pwe5w5U/yOh/q551dHOOPkLp779G2R/CzP+o0tfPvmvezrG6WtuYFzT+vmVb99KvV1\nK+tc2765je2b25jIT7L3wFEgyRP9We579OiMRw1PX0rVJ/nZA4N0tTfT2BCEfKFQ4PDQOIXJLPnJ\nIpNTU1x3T5qG+iQtjXW0NtXR0QyvePapnHby5gWDca5AnS9M73y4n89c+xAT+Skue8YWOlqSTwq8\n7NiTj8nvbG3gVc/dyZd/9Dif+PYvee9rn0FD/fH7b/aXFIjPkQGy/ijUJVaKyUYeOZjj0V+PkJ2Y\nLN8KBKHV2VLP4fRjbN7YRntLisx4noMDaZ44PMbYRIbCVBGARAKaGwps6R5l55Z2upqDJVzTY3m+\ndIOTakjyhhedQTKillxjQx3vf+MF5ApT9G5ornqLsSlVz67eFp59zlY6OjoZHs2xZ18/tz7Qz8Rk\ngvHcFBO5SbITBYZGCxwaPH6afVOqjlR9koa6EnXJBFOlJCNjkwyNFgB48CsPsKnjUc4+ZSO/ccpG\nbEcXHbOOq5+r1f/bzzyV5pY2CpNFRsby7Pn1CA8+PsTdPkCqIclbXnoWZ5/czC27Fx8vv+CMTewb\nyHHL7kN85UbnD//LU+fcl2oRS1zUJNTN7IXAVUAd8G/u/oFa1FFtpVKJsfGCzoddA6PjBb51y35+\nurufUgka6pKctr2TZHGCZCJBXaqZI8MTHBmZ4K5fDQKDT3p+XTKYZNXWkqIumWAkkyWbK7L3YJq9\nB9MkEnDtHX0US1CYLHLF809n84aWqtU/V4u1DujdsDpdwF1tjdj2DvqHxma1fjM8+5ytYciWgBKZ\nTIY7H+mjtS3obj/Sf4hkso6N3ZvDz8Akjx88Sq4wxWOHx/npfQf56X0HgWDi3kndraTCFvNELs/h\noXFyhVHyk0WKxRLX3n3PnDVu72njv7/ibLZ1t5JOj1T8s/3Xy4zHD6e56d6DjIzlefNLzqSlqWHe\nx08Vi2ti1vxEfopDgxPUZ4IjGYqTeab0t0UWseqhbmZ1wMeAFwC/Bn5hZte4+0OrXUs1pLN57nq4\nn4eeGML3D5POBq2U+rpBOlpTnNzbzuaW+VcHq6WZQZJKFUmnM2tiHHEp0tk8N939a37wi/1kc5O0\nNtXx9NN62LW1nfq65JMCB2A0k+Y3nrKJyVIjI2M52ltSNNcXuP/RI9MhBUFQJRJJig0d7OvLcOjI\nKC1NDSSTSU49qZPnP2N7Vcdi13oXcEN9HQ3hX4vJfN28P18ikaCtpYEtHSXyuQJnnL+ZwUyBvuEc\nR4fHKRThoSeGnrztugTNTQ20NTdQnCpAqUhjYyN1SWioT9LWWOJlFz+FU3cs3JU/n8aGOv7sNefx\nyWse4J5fHeFvPncnL7t4Fz1dzbS3NPCrfUPc/1ia0Ylh0mMFRscLJBLwk/uPsnlDK2fu3MC5p3dz\nUndr5J+N0fECtz1wmLt9AD8wTLH45PvveGSY88/YzIVn9WI7uk6oz6qsjlq01C8A9rj74wBm9u/A\nK4ATJtTHc5M88NggP3/gMPc/enT62/OG9kbOecoGDh8dZbKYZDiTYzAdzF7efeA2nvHUzTzDetix\nuY1ksvYfxnR6hOtv20NDqpnW1kbSw8Nc9lunr4kQWcjoeAHfP8x9e47w8wf6mJwq0tpUzysu2k4y\nMXXchK2ZkskEG9sbn/QzptMjc/5xTCQS9HQ109PVTHZ783RX9cznVTOIK+kCnv1FIpNJQ2nhx8z3\nuKg1NbfQ2tZBaxvs2Hqs1d/Y3Dbdm5Udy3D7Q33TP/fsL2Hlx2zualpRgHW0pnjXq8/lWzfv5Xs/\nf4JPf2/uPzfNjXX0bmhmcmqSwmSJh54Y4qEnhvjmT/fS3dnEuad1c+7p3Zy+vWvO8fnlyE5M8tAT\ng9z5yAB3PTLA5FSRBLBjcwstjUnaW1uYnCoyksnSP5LnpnsPctO9B9mysYXnPn0bF57VW5VzD8xl\nvi+u3d3VW5tBqqsWoX4SMPP0UQeAC2tQx4IKk1MMpnNM5KcYz00yMDxO39A4jx1K4/uHp4N8x+Y2\nLj5nK+ee3k1PZxOFQoEbbnuY1o5N5AtTHBgY5dH9RzkyPM53b32c7976OI0NSXb2trJraxcbO5rY\n2N5Ic2M9DfVJUg3JsFWUpD6ZqOgP2cyHlEqQn5wil58iXygyUZgkly+SHstxJD3BYDrH0ZEJjqYn\nGBnNEfwYxz60P/rlvUFNHY1s6mia/tfe0kCqoW66vsb6JPX1SUql4INfLJUolTj2f7FEsVgiV5hi\nIl/+N8lIZpSJfJF8YYqJQpFcfopcYYrJYolkIkEiAclEglQqRTK8TAImJvKMjOUZHitwZOTYcdbd\nHY0892mbufDMTeQnxrjvsbmXTY3K7CCuJFArDd35HnfbA/00twZfJAaP9NHS2kFL27EaxrNj/OTu\nwelJfvM9brW/IMy1rYnx0UW3tdwa5nreC87dxFNPamb/QJbBdI50dpINrTCRh229G0iFkwHLX0Dq\nUi3sfvQo9/zqCLv3HuHGuw5w410HSCagu7OR3g3BZ6M5VU9LUx3NqTo2drWTaqgjQfmzGbyvS6XS\n9N+TvqNpjqbz9A9PsL9/jHKv+tZNLTznadt41tm9JIoT3LL70PT7KzuW4llnb+HQcJFbdh/izocH\n+NqP9/C1H+9hW3fQo7BlYwubOpvobE3RUJekri5BfV2S+rokdclE+FkNP5/h5za4DFNTRbITk2TG\nC4xm82SyBY6OjLLnwDCFYoJcvshUsUSxWKS5aTftzSnaWxpob03R0RJcbm1qoKWpnubGeloag/9T\nDUmSiQT19Uk6Wo4/V4FUVy1C/YQYFPr7L9w9fa7s2Xb2tvO0Uzfxm9Zz3BKhiUSCwvgI2VLQ5b6l\nFZo3j5LrbSGTa6B/ZJLhsSJ+IIMfmHv7UUsmoLM1xUndTUzkJkk11FNXl2Q0m6Ohvo7hzAQHj6xu\nOFYqVZ9gU1sdzfU5upphx5Y2JrLD/OSuYYYGj9Da2gEzvuRMjI+RTNaTHQv29Xh2bM6AGM8++edd\n7vOGjvZz3cH9dHYdW6K0XFdbWyPZsdyCj5lZ+0KPK4d6UGt2us6Ztc82+3Gzt7/UGsqPm72v5rtt\nuduqdH9NjI9RXw9TxcSCz0sm6+js2kBjAnpaj21rMt/IZD54XPn33d4OZ+1oZkdXJ62Jo2QLzfSP\nTDKSLXI0PUH/8NwL+SxFV0uSDa1w2TNP4oxdPcEX+eLEce+v8ewY2bEM2zd28JrnbedlF27hLh/k\noX0j7DmYifwzm6pPUFeXJJkMvij0D4+zr39pZwG8/LdP44UXnhxRhQK1CfVfAztmXN9B0FqfU0KD\nRiKyDny01gWsgu9+uNYVxF8tQv1O4HQz2wUcBF4NXFGDOkRERGJl1ZeJdfdJ4P8GrgceBL56os58\nFxERERERERERERERERERERERkbVvzR0uZmavAq4EzgDOd/e753lcLNePXy1mthH4KrATeBy43N2H\n53jc4wSr00wBBXe/YBXLPCFV8t40s48ALwKywBvcfe4Fz2VOi+1jM7sE+E9gb3jTN9z9b1e1yBOc\nmX0GeAnQ7+7nzPMYvY9XYLF9vJz38arPfq/AbuCVwE/ne8CM9eNfCJwFXGFmZ65OebHxbuAGdzfg\nh+H1uZSAS9z9PAX64ip5b5rZi4HT3P104L8Bn1j1Qk9gS/j8/yR8356nQF+WzxLs4znpfVwVC+7j\n0JLex2su1N39YXf3RR42vX68uxeA8vrxUrmXA58PL38e+N0FHrvmenTWsErem9P73t1vB7rMrHd1\nyzyhVfr51/t2Bdz9ZmBogYfofbxCFexjWOL7eM2FeoXmWj/+pBrVcqLqdfe+8HIfMN+HsQTcaGZ3\nmtlbVqe0E1ol7825HrM94rripJJ9XAIuMrP7zOxaMztr1apbP/Q+jt6S38e1Op/6DcCWOe56r7t/\np4JNnBDrx9faAvv5L2decfeSmc23Ty9290Nm1gPcYGYPh98uZW6Vvjdnf/vWe7pyleyru4Ed7p41\nsxcB3wYs2rLWJb2Po7Xk93FNQt3dL13hJpa0fvx6tdB+NrM+M9vi7ofNbCvQP882DoX/D5jZtwi6\nPhXq86vkvTn7MdvD26Qyi+5jd8/MuPx9M/tnM9vo7oOrVON6oPdxxJbzPl7r3e/zjSVMrx9vZimC\n9eOvWb2yYuEa4PXh5dcTfAN8EjNrMbP28HIrcBnBREaZXyXvzWuA1wGY2W8BwzOGQmRxi+5jM+s1\ns0R4+QIgoUCvOr2PI7ac9/Gam0hiZq8EPgJ0AyPAPe7+IjPbBvyru78kfNyLOHZIy6fd/R9qVfOJ\nKDyk7WvAycw4pG3mfjazpwDfDJ9SD3xJ+3lxc703zeyPAdz9k+FjyrO3x4A3znfopsxtsX1sZn8C\n/A9gkuBwq3e6+201K/gEZGZfAZ5H8Le4D3g/0AB6H1fLYvtY72MRERERERERERERERERERERERER\nERERERERERGR9cHMXmVmd5vZPWb2kJl9aYXb6zSz/zXrtn81s4tXVumSarjSzBoWuD9hZjeb2Zzr\nfpvZa83sfjMrhMfdzredncs9v4CZXRaen2DCzD44674Phqd0Fom9tb6inMgJI1xu9+PAy8LTJJ4J\nfHCRpy1mA/DnM29w97e4+89WuN2leB+QWuD+lxGcNW2+pZrvIVj17cssvDb4KQSn8FyOR4E3M/f+\n/jDBzyASezVZ+10kprYABWB6GUd3v7d82cwuBP4B6Ahvep+7X2tmuwiWPv0X4MVAC/DmMLg/TnBK\ny3uAMXd/tpndBHzQ3b9nZp8DJoDTgVMJlvv9HkGIbQf+yd0/Er7+U4F/Ili9KgVc5e6fC+8rEpzo\n55XAJuDP3f2bZvbxsNZbw8dc4u4js37utwAfnW+nuPsDM15joVUsPw7sCn/WX7n75WZ2PsEKky0E\nq5a93d3vnOM1Hg1f47hTCIcnJDpiZhe5+60LvL7ICU8tdZHquRe4A9hnZl83sz8Nl+PFzLqATwB/\n4O7PJGjdftLMygG/EbjV3X8T+BvgA+HtbyVYU/s8d392eFuJJ7d4zyJYqvNM4ArgNe7+HOBi4O/C\nNfzrCVrK73D3C4DnAO8xs5lnfBoJ7/tDgiDF3cvd5c8Ka3hSoJtZEnguUI2lK98KPBi+zuXhuu7f\nIDh749OBvwK+Ef4sS3Ur8Pwq1CiypinURarE3Uvu/krgEuDHwEuA+81sA3ARQffy98OW6LVAETgt\nfPqou18bXr6doNUNi5+foQR8290L7j4OPELQUsfdDwJDBC12A84A/j18/Z8SrDF95oxt/fuM198W\nhupiuoGku6creOxiZv+sTwVy7v5jAHf/IZAPb1+qA8BTVlaeyNqn7neRKgu7mx8A/tnMHiAI+Rxw\nv7s/b/bjw+733IybpljaZ3P2cyfm2FYCOOLu5y2wnYmw/qmwAV9PEKIVM7NzgKvDqz9y93fNekgt\nz7e95k5gJVJtaqmLVImZbTOzZ824vh3oAfYSdP+ebmaXzLj//Ao2mwZazKxuCaXMFV4PA1kze+2M\n1z+jfGrdRWSArnnuOwKUyttx991h9/l5cwR6Yp7aytJA54zrjwCp8j4zs98h+KLxyALbmG/72wl+\nDyKxppa6SPXUA1ea2U5gnOBL81+6+30AZvZy4INmdhXBRLVHCcbW4fgWbAnA3QfDw+J2m9ngjHH1\n4x67wPVy6/tlwFVm9ucEpyw9DFxewTY+BPzIzLLAb88cV3f3opn9FHgW8IM5asPMrgD+D8FM/peb\n2buBS9394VkPvQ94xMx2Aw+F4+q/B3zEzFqBUeD33X1yjtd4NvAVgkmICTN7DfAmd78hfMizCMbk\nRUREZD5m9rtm9tla1zEfM9saDoOIxJ6630VkRdz928CpZnZSrWuZxzuBK2tdhIiIiIiIiIiIiIiI\niIiIiIiIiIiIiIiIiIiIiEiM/P/DG6b5DXqngAAAAABJRU5ErkJggg==\n",
       "text": [
        "<matplotlib.figure.Figure at 0x114cb9d10>"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It looks like people tend to be more positive about Kaggle than negative.\n",
      "\n",
      "Let's look at the really negative tweets!  We'll define negative as less than or equal to -0.25.  We'll exclude all the tweets with links in them."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Loop through sentiments and look for negative sentiments. \n",
      "for i in range(len(sentiments)):\n",
      "    if sentiments[i] <= -0.25 and 'http' not in tweets_text[i]:\n",
      "        print tweets_text[i], sentiments[i]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "#kaggle kaggle is really so slow today -0.3\n",
        "Are you serious teacher, HE IS GOING TO FRIGGIN PUT THE ASSIGNMENT ON KAGGLE FFFFFFFFFFFFFFF -0.333333333333\n",
        "RT @ewald_zietsman: A forest of random forests. I'm making one. #noobtactics #machinelearning #kaggle -0.5\n",
        "RT @ewald_zietsman: A forest of random forests. I'm making one. #noobtactics #machinelearning #kaggle -0.5\n",
        "A forest of random forests. I'm making one. #noobtactics #machinelearning #kaggle -0.5\n",
        "@kaggle The bad thing about semi-automatic tweets when one moves up the leaderboard is that they take up almost all #kaggle tagging. -0.7\n"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's look at the postive ones, too.  We'll define positive as greater than or equal to 0.25."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Loop through sentiments and look for positive sentiments. \n",
      "for i in range(len(sentiments)):\n",
      "    if sentiments[i] >= 0.25 and 'http' not in tweets_text[i]:\n",
      "        print tweets_text[i], sentiments[i]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "RT @antgoldbloom: Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted  0.4\n",
        "RT @antgoldbloom: Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted  0.4\n",
        "RT @antgoldbloom: Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted  0.4\n",
        "RT @antgoldbloom: Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted  0.4\n",
        "Proud to share that over the weekend Kaggle passed a big milestone: over 1MM machine learning models have been submitted to our competitions 0.4\n",
        "@what_mojo @KaggleStatus @kaggle  hahah then thanks both! 0.25\n",
        "@what_mojo  now they do! check @KaggleStatus, thanks @kaggle! 0.25\n",
        "You can follow @KaggleStatus for updates on our platform's performance. (Relevant information available now.) 0.4\n",
        "@gef0rce @kaggle @kdd_news who knows :D 1.0\n",
        "Kdd cup 2015 website is total ripoff of @kaggle . only difference: it doesnt work :P :P @kdd_news 0.375\n",
        "if you're a data scientist, many cool projects to try on kaggle now. itching to have a go (incl. one on west nile virus). but work calls. 0.425\n",
        "RT @mariandragt: My first @Kaggle @OttoGroup_Com challenge submission with #AzureML! Lot to improve, but happy with this first step! #Machi 0.475\n",
        "My first @Kaggle @OttoGroup_Com challenge submission with #AzureML! Lot to improve, but happy with this first step! #MachineLearning 0.475\n",
        "@tompy_ Kaggle is cool, I agree. My class assignment on there? Not so fun 0.325\n",
        "Need more ram to read data from @kaggle , or use #spark..the wonders of the latter 0.25\n",
        "Moved up 221 places on #kaggle. Top 3% now. Thank you, no-learn and lasagne. (And #python in general.) 0.275\n",
        "@benhamner @kaggle @Azure I also look 69 when I'm wearing sunglasses. But in my mirror shades: a 69-year-old Fat Maverick from Top Gun. 0.5\n",
        "Thanks @mjcavaretta, I'll be sure to take a look @kaggle @kdnuggets 0.35\n",
        "Submitting my first #Kaggle. #Predict all them bike share demand #machinelearning #DCTech #DataScience 0.25\n",
        "@benhamner @kaggle @Azure I think they just wanted a geotagged, targeting cookie bound, dataset of human faces :-) 0.25\n",
        "@HeatherOhana The MSPA does have weight. Do some Kaggle comps for experience/resume lines too, I respect that as a hiring mgr. Also, hi :) 0.5\n",
        "@AdamTaran @kaggle @combine_au Thanks for the mention, I hope you enjoy the series! 0.35\n",
        "@benhamner @kaggle Great visualization. I am now interested in this contest. 0.525\n",
        "@benhamner @kaggle You started on that quickly! 0.416666666667\n",
        "Digging the new scripts functionality on the @kaggle competitions. Great way to learn EDA methodologies. 0.468181818182\n",
        "@kaggle Shouldn't it be better to change it to something like #KaggleChallenge, with the name of the actual challenge? 0.25\n",
        "I am a giant nerd, but that's okay. #kaggle 0.25\n",
        "@treycausey They're selecting for the perfect FB dev-- smart enough to win Kaggle while not smart enough to avoid working for free. 0.329591836735\n",
        "Got an email with the title Our A/B tests predict you will open this email. Good job #kaggle 0.35\n",
        "Should you get a Coursera certificate if they already have Master's or PhD? Coursera/Kaggle projects are great for real-world data/projects 0.8\n",
        "decades of worldwide seismic data readings and live data is now accessible. Earthquake prediction is an ideal open data prize @kaggle 0.352840909091\n",
        "91 places up &amp; Top 10% , aww yeah ! 0.4375\n",
        "@PeterDiamandis hey Peter. I read your book, Bold, and am curious if kaggle is the best platform to crowdsource data mining task? 0.411111111111\n"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It seems like there might be some false positives in there.  However, it's what we have to work with, so let's get a count of the \"negative\" (-1.00 to -0.25), \"neutral\" (-0.25 to 0.25), and \"positive\" (0.25 to 1.00) tweets."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Loop through all of the sentiments and put them into the appropriate group\n",
      "pos_neg_neutral = []\n",
      "for sentiment in sentiments:\n",
      "    if sentiment <= -0.25:\n",
      "        pos_neg_neutral.append('negative')\n",
      "    elif sentiment >= 0.25:\n",
      "        pos_neg_neutral.append('positive')\n",
      "    elif sentiment > -0.25 and sentiment < 0.25:\n",
      "        pos_neg_neutral.append('neutral')\n",
      "\n",
      "sns.barplot(np.array(pos_neg_neutral))\n",
      "plt.title('Positive, Negative, and Neutral Sentiment')\n",
      "plt.xlabel('Sentiment Category')\n",
      "plt.ylabel('Frequency')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 42,
       "text": [
        "<matplotlib.text.Text at 0x110e6a810>"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAfQAAAFwCAYAAABO94lEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XucXVV5+P9PSEAMAUw0CVcBkUeJ1RZEpFUrIvaHVgFb\nBbyiUNuvKKC2VGKrpVpuXhCrUu8SqkSCXBqVSxCKRagiCoJEfEScSkCS4AyQGNEQ5vfHWiOHcS5n\nIDtnsvm8X6+8su/72efsOc9ea6+9F0iSJEmSJEmSJEmSJEmSJEmSJEmSJLVXRMyPiM+OMf91EXHp\nhoxpMoiIiyLiDb2OYyIioi8iXtzrOB6piDghIv6z13EARMSqiNi513Fo8prW6wC08YuIPmAOsA74\nNXAx8PbM/PUj2V5mntyx7Z2B24Bpmflgnf9l4MuPLurx1eN6PLBLZq6p0/4GeF1mvqjhfZ8A7JqZ\nv0/gmfmyJvfZkMH67w9ExJnAG4HnZub36rSnApmZmzzaHUfElcB/ZubnH8VmRoy9Yx/vAf4GmA3c\nA1ydmYc9iv0NbfdKhsWemVs+2u0+wlj6gCMy84pe7F/de9R/NBLlR+/l9QdnT2Av4J/X8z6mrOft\ndWsT4Nge7fuxoB/4t4a2PV4y7qZAM+p5FxGHA68HXlzP/b2Ab04owtGNGfsGNkjv/v40AZbQtV5l\n5p0RcQnwRwARcSBwMrAdcAPw1sy8pc57N3A0sBVwJ3BUZl4xrHT6P3XT90TEIPAXwNOBIzPzBRHx\nH8DqzDxuKIaI+C/gysz8aERsB3wceAGwGvhoZn68y8MZBD4M/GNEnJGZ9w5fICKeXre/J7ASeG9m\nnlvnPRE4E/hz4CfAEuCFmfmCOv9jwCuBrYGfAu/IzG9HxAHAfGBKRBwM3JqZewyV2oAvAcuB52Xm\nzXVbs4H/A56cmXdHxMspiXInYCnw/zLzpvEOOCKeULe/N+X34eq67h11/pWU72Q/4FnA/wKvzcxf\n1flvqPvdAjhtnN0NAguA10bEn2fm/wxfICK2rtt5KfAg8EXgXzLzweG1GB21OZsC76d85/tExOnA\nFzPzmIh4EHg78E7Kxdquo30P431WlAR+aWb+HCAzlwOf6zL2N1FK9v8LHEkp3R+VmZdExIljxP7U\nzLyt1m6sAXauy94AvJpy3rwRuAt4TWbeUGMZ9e+gfo7zgN/Uz+EXwOGZ+f16u+HJwNciYh3wr5n5\n4S4+G/WAJXStL1MAImJHyg/YDyIigLOBY4AnARdRfhg2jYinAW8D9srMrSiJuq9uq7N08oL6/9aZ\nuVVmfmfYfs8GDh0aiYiZwEuAhRGxCfA14HrKBcWLgXdExF9M4LiuA64E/mH4jIjYAriMkgBnA4cB\nZ0TE7nWRTwKrgLnA4ZQf2s5juxb4Y2BmPY5zI2KzzLwEOAn4SmZumZl71OUHgcHM/C1wHvCajm0d\nQrmIuTsi9gA+D7wFmAV8GlgcEZt1cbyb1HWfXP/9BvjEsGVeA7yJcptls6HPJiLmAWcAr6N83k8E\ndhhnf2vqsZ44yvwzgd8BuwJ7UM6Tv6nzRivFDmbmPwFXAW+rn+ExHfMPAp5DSWIwyvcwTtwA3wHe\nGBH/EBF7RcTUCcQO5aLpFsrn9EHK5844sXd6NfBPlL+t39V4vkf5zr9KvaDq8u/gFcBCykXNYup3\nXi+WfkGtgTOZT24mdK0PU4ALI2KA8kN0JaVUfijw9cy8PDPXUUq7jwf+lHK//XHAMyJi08z8RWbe\n1rE9RhgeybeBwYgYSvyvAq7JzLsoP9pPysx/y8wHaknqc5TE261B4H3A0RHxpGHzXg78PDMXZOaD\ntTR0PvDq+uP+V5QS2f2Z+WNKafT3x5OZX87MgbruafXzeFrHcY917GcPO47X1mkAfwt8OjO/l5mD\nmXkW8Ftgn/EONjP7M/OCGvNqSrJ94bDP44uZeWtm3g8sAv6kznsV8LXM/HZm/g54L6VkOpZBygXH\nk2vNxO9FxFzKxeE7M/M3mbkSOL3juLupBh5pmZMz8556YTTe9zCq2pbjaOD/o5zzyyPiH7uMHeD/\nMvPzmTkInAVsGxFzxol9yCBwfmZeX4/jAuDXmfmlur1FlIsI6O7v4KrMvKSu+yXKBY42Mla5a30Y\nBA4a3mgmIralXN0DkJmDEXE7sH1m/k9EvAM4gZLULwXelZm/nMiO6za/Qik1XkVJbGfV2TsB29UL\njSFTeagav9t93BwRXweOB37cMWsn4LnDtj+t7v9Jdfj2jnnLOrcbEf8AHEEpNQ1Sbj0Mv2gYzZXA\n9IjYG1hB+QG+oCOuN0bE0R3LbwpsO95GI2I68FFKkppZJ8+IiCn1xx5Kde6Q3wAz6vB2dBxjZq6J\niF+Nt8/M/F1EfAD4AA9PMjvVuH9ZKnuAUgj5Bd0bqRTf+Z08qu8hM88Gzq4XcK8EvhwRN1Cq0MeL\n/a6O7aypy82gfJ+jxd5pRcfw/cPGO7+Xbv4OlncMrwE2j4hNhhqiauNgQleT7gSeOTQSEVOAHYE7\nADJzIaVqfEtKKe1USrV0p24aBy0ElkTEqZRqzIPq9F9QStAx6prd+xfgB8BHOqb9AvhWZv5BFX79\ngX+Acrw/rZN37Jj/AuA4YL+O++D9PFQqG/O4M3NdRCyiXMisoJSMh54q+AVwYmaeNKEjLP4eCGDv\nzFwREX9COe4p48VE+b6HbjcMXRw8cZx1ho73TODdwF93zLudUrPwxFESy2pgesf4NsPmj1ol3xHj\neN9DV2oN1Fdru5BnUM7JsWIfz/psFHc7Y/8djLevydRAT2MwoatJi4DjI2I/Sun5WEpJ4pp6f30H\nSqOr39bpI/2IrqRU2+7KQ4nxYTLzhoi4m1KNeElm3ldnXQusqtWgH6fcZ9wd2Dwzr4uIfYErsotH\npDLzZxFxTj2GG+vkbwCnRMTrgXPqtD8BVmXmLRFxPnBCfdRtJ+ANlIZrAFtSEv7d9X7t8ZSS4ZC7\ngP2HlYzh4Z/R2cB/AXcD7+mY/lnggoj4JuWe6nRgX8rFx+raoGowM988wqHOoJTu7o2IWZQLmeFG\nS3bnAd+JiOfV/b6fsW/rdd5+eCAi/oXyPQ1N+2VELAFOi4j3Uh6J3IVaw0NpCPbu2m7jPkqDsE7L\nKefNWMb7HkZVW7mvpJzbv6bUajwD+G5m3jVO7OMZL/aJXHCM+XfQxbaGYvGxtUnOe+hqTGYm5bGe\nj1N++P4SeEVmPkC5T3lynf5LShXn0A/y759dzvL894nA1RHRHxHP7Zzf4WxKy+uh+8jUktHLKUn2\ntrqvz/DQD/aOlAuKbr2fkhyHYltFaeh0GKXW4Zf1mIYaVL2d0sjoLsr984WUH1OAS+q/pDQG/A0P\nr449t/7/q4i4rmP67487M6+llFK3pTz7PzT9+5QGcZ+gPBb2Ux7eIG9HStuDkZxOaedwN3BN3e7w\nz3pw2PDQ53EzpaHj2ZTSej/DqrdH2E7nthbW9TqnvZHyeS6t2zuXWhLPzG9SLqRupFxAfG3Yuh8D\nXlXPm9NHiWG872HU5+gpFxHvoVykDQCnUJ4IuGa82EfZ7kRiH77+qNurtQdj/R2MF8vJwD9HxEBE\nvGuEWDRJNPpsYUQcS2nVOQX4bGZ+rF71n0MpsfQBh2TmPXX5+ZR7WeuAYzJzSZPx6bEtytvoFmXm\nZRtof6cCc0YpGW8QtRR6PfCs+kMvqSUaS+gR8UeUK+7nAGspV8H/D/g74O7M/GC93zQzM4+vj7yc\nXZffnvKChrBRhjZW9dG8xwE3Uc7rb1Cen1/c08AktVKTVe5Pp9xLur+WBL5FafByIKX6kfr/wXX4\nIGBhZq7NzD7gVkoDJ2ljtSXlvvJq4CvAh03mkprSZKO4HwEn1ir2+4GXUV7SMTfLG5WgNLaYW4e3\no7wYYcgySkld2ijVBke79ToOSY8NjZXQs7ze81TK6y4vprRIXTdsmbEanDDOPEmSVDX62FpmfgH4\nAkB9P/EyytuUtqmPdWzLQy9DuIOO53QpjzTdMdb21659YHDatOFvW5Qkqb2mTJkyYvu3RhN6RMyp\nL6d4MuU1mPtQnsU8nFJ6Pxy4sC6+mPLGpdMoVe27UZ6fHNXAwJqmQpckaaPS9HPoX42ImynJ+qgs\nvVWdArwkIpLy3PApAJm5lPIikqWUKvqjhr1QQ5IkjWKj7uN2xYr7TPiSpMeUOXO2GjF3+6Y4SZJa\nwIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoB\nE7okSS1gQpckqQVM6JIktYAJXZKkFpjW6wCkNli3bh19fbf1Ogw1ZOedn8LUqVN7HYY0JhO6tB70\n9d3GsR9azPSt5/Q6FK1na+5dwceOO5Bdd92t16FIYzKhS+vJ9K3nMGPm9r0OQ9JjlPfQJUlqARO6\nJEktYEKXJKkFTOiSJLWACV2SpBYwoUuS1AImdEmSWqDR59AjYj7weuBB4CbgzcAWwDnATkAfcEhm\n3tOx/BHAOuCYzFzSZHySJLVFYyX0iNgZeAuwZ2Y+E5gKHAYcD1yWmQFcXseJiHnAocA84ADgjIiw\nBkGSpC40mTDvA9YC0yNiGjAduBM4EFhQl1kAHFyHDwIWZubazOwDbgX2bjA+SZJao7GEnpn9wEeA\nX1AS+T2ZeRkwNzOX18WWA3Pr8HbAso5NLAN8j6YkSV1ossp9V+AdwM6UZD0jIl7fuUxmDgKDY2xm\nrHmSJKlqslHcXsA1mfkrgIg4H/hT4K6I2CYz74qIbYEVdfk7gB071t+hThvVzJnTmTbNLg3VewMD\nM3odgho0a9YMZs/estdhSGNqMqHfArw3Ih4P3A/sD1wL/Bo4HDi1/n9hXX4xcHZEnEapat+tLj+q\ngYE1zUQuTVB//+peh6AG9fevZuXKVb0OQxpTk/fQfwicBVwH3FgnfwY4BXhJRCSwXx0nM5cCi4Cl\nwMXAUbVKXpIkjaPR59Az84PAB4dN7qeU1kda/iTgpCZjkiSpjXzOW5KkFjChS5LUAiZ0SZJawIQu\nSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7ok\nSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIk\ntYAJXZKkFpjW5MYj4mnAVzomPQV4L/Al4BxgJ6APOCQz76nrzAeOANYBx2TmkiZjlCSpDRotoWfm\nTzJzj8zcA3g2sAa4ADgeuCwzA7i8jhMR84BDgXnAAcAZEWEtgiRJ49iQyXJ/4NbMvB04EFhQpy8A\nDq7DBwELM3NtZvYBtwJ7b8AYJUnaKG3IhH4YsLAOz83M5XV4OTC3Dm8HLOtYZxmw/YYJT5KkjdcG\nSegRsRnwCuDc4fMycxAYHGP1seZJkiQabhTX4aXA9zNzZR1fHhHbZOZdEbEtsKJOvwPYsWO9Heq0\nEc2cOZ1p06Y2ErA0EQMDM3odgho0a9YMZs/estdhSGPaUAn9NTxU3Q6wGDgcOLX+f2HH9LMj4jRK\nVftuwLWjbXRgYE0jwUoT1d+/utchqEH9/atZuXJVr8OQxtR4lXtEbEFpEHd+x+RTgJdERAL71XEy\ncymwCFgKXAwcVavkJUnSGBovoWfmr4EnDZvWT0nyIy1/EnBS03FJktQmPuMtSVILmNAlSWoBE7ok\nSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIk\ntYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKkFjChS5LU\nAiZ0SZJawIQuSVILTGt6BxHxBOBzwDOAQeDNwE+Bc4CdgD7gkMy8py4/HzgCWAcck5lLmo5RkqSN\n3YYooX8MuCgzdweeBdwCHA9clpkBXF7HiYh5wKHAPOAA4IyIsBZBkqRxNJosI2Jr4AWZ+QWAzHwg\nM+8FDgQW1MUWAAfX4YOAhZm5NjP7gFuBvZuMUZKkNmi6yn0XYGVEfBH4Y+D7wDuAuZm5vC6zHJhb\nh7cDvtOx/jJg+4ZjlCRpo9d0dfY0YE/gjMzcE/g1tXp9SGYOUu6tj2aseZIkieZL6MuAZZn5vTr+\nVWA+cFdEbJOZd0XEtsCKOv8OYMeO9Xeo00Y0c+Z0pk2b2kDY0sQMDMzodQhq0KxZM5g9e8tehyGN\nqdGEXhP27RERmZnA/sDN9d/hwKn1/wvrKouBsyPiNEpV+27AtaNtf2BgTZPhS13r71/d6xDUoP7+\n1axcuarXYUhjavyxNeBo4MsRsRnwM8pja1OBRRFxJPWxNYDMXBoRi4ClwAPAUbVKXpIkjaHxhJ6Z\nPwSeM8Ks/UdZ/iTgpEaDkiSpZXzGW5KkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS1gQpck\nqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKk\nFjChS5LUAiZ0SZJawIQuSVILmNAlSWoBE7okSS0wbbwFIuJS4BPA1zNzsPmQJEnSRHVTQv808A7g\ntoh4d0Q8seGYJEnSBI2b0DPz/Mx8MfAyYHvg5og4KyKe3Xh0kiSpKxO5hz4ITAHWAvcDZ0XEaY1E\nJUmSJqSbe+ivAo4CtqXcS989M1dHxDTgVuBd46zfB9wHrAPWZubeETELOAfYCegDDsnMe+ry84Ej\n6vLHZOaSR3ZokiQ9dnRTQn8zcCowLzM/mZmrATLzAeCYLtYfBPbNzD0yc+867XjgsswM4PI6TkTM\nAw4F5gEHAGdEhC3xJUkaRzfJ8uWZeWlnC/eImAKQmYu73M+UYeMHAgvq8ALg4Dp8ELAwM9dmZh+l\nBmBvJEnSmLpJ6FdFxMyhkdrK/VsT2Mcg8M2IuC4i3lKnzc3M5XV4OTC3Dm8HLOtYdxmlIZ4kSRrD\nuPfQgRmZOTA0kpm/iogtJ7CP52XmLyNiNnBZRNzSOTMzByNirOfbR503c+Z0pk2bOoFQpGYMDMzo\ndQhq0KxZM5g9eyI/e9KG101C3yQitsjMXwNExAxg0253kJm/rP+vjIgLKFXoyyNim8y8KyK2BVbU\nxe8AduxYfYc6bUQDA2u6DUNqVH//6l6HoAb1969m5cpVvQ5DGlM3Ve4LKSXr10fEG4AlwJe72XhE\nTB8qzUfEFsBfADcBi4HD62KHAxfW4cXAYRGxWUTsAuwGXNvtwUiS9Fg1bgk9M0+OiDspDdYGgU9l\n5lldbn8ucEFEDO3ry5m5JCKuAxZFxJHUx9bqvpZGxCJgKfAAcJSvm5UkaXzDW59vVFasuM9kr0nh\nZz/7KfM/8x1mzLQNZ9usHriDk/92H3bddbdehyIBMGfOViPm7m5eLDMXOBrYtWP5wcw8ZP2FJ0mS\nHo1uGsWdR6kCvwx4sE6zZCxJ0iTSTUJ/Qmb+beORSJKkR6ybVu4/ighvDEqSNIl1U0KfBdwUEVdT\nelkD76FLkjSpdJPQz67/OnkPXZKkSaSb59DP3ABxSJKkR2Hce+hRfLv2a05E7BkRJzQdmCRJ6l43\njeL+AzgRuKeO/5D6ZjdJkjQ5dJPQt87Mi6n3zTNzHfC7RqOSJEkT0k1CfyAiNhsaqY+wrWsuJEmS\nNFHdVrmfDzwpIv4V+DbwkUajkiRJE9JNK/cFEXEb8Arg8cAbM/OqxiOTJEld6+Y5dGoCN4lLkjRJ\nddPb2vdGmDyYmXs3EI8kSXoEuimhH9cxvDnwGuDOZsKRJEmPRDf30K/sHI+IS4GrmwpIkiRNXDet\n3IfbGpi7vgORJEmP3ETvoW8CPAUfW5MkaVKZ6D30B4DbMtN76JIkTSITvocuSZImn26q3FdS3uM+\nZYTZg5k5Z71HJUmSJqSbKvdPAbOAz1CS+pHAAPCFBuOSJEkT0E1Cf1lmPrtj/OiIuC4z39dUUJIk\naWK6eWxtq4iYPTRSh7dqLiRJkjRR3ZTQTwduiIivU6rcXwac1O0OImIqcB2wLDNfERGzgHOAnYA+\n4JDMvKcuOx84gtI96zGZuWQCxyJJ0mPWuCX0zPwk8FLgZuAm4KWZecYE9nEssJTSsA7geOCyzAzg\n8jpORMwDDgXmAQcAZ0TEI3nxjSRJjzndJsw+4OrM/Hhm3hQRI7V4/wMRsQOlRP85HmolfyCwoA4v\nAA6uwwcBCzNzbWb2AbcCdgAjSVIXxk3oEfEySun8/Dr+HGBxl9v/KOXFNA92TJubmcvr8HIeeo3s\ndsCyjuWWAdt3uR9Jkh7TurmH/n5KSfkigMz8XkTsOt5KEfFyYEVmXh8R+460TGYORsTgSPOqseYx\nc+Z0pk2bOl4oUuMGBmb0OgQ1aNasGcyevWWvw5DG1E1CJzN/GRGdk37XxWp/BhxYS/ibU1rL/yew\nPCK2ycy7ImJbYEVd/g5gx471d6jTRjUwsKab8KXG9fev7nUIalB//2pWrlzV6zCkMXVzD/2+iNhm\naKSWtgfGWykz35OZO2bmLsBhwBWZ+QZKdf3hdbHDgQvr8GLgsIjYLCJ2AXYDru36SCRJegzrpoQ+\nn1LdvnNEfIuSaA98BPsaqj4/BVgUEUdSH1sDyMylEbGI0iL+AeCozByzyl2SJBVjJvT62Nj9wH6U\nKnSAa4aeG+9WZn4L+FYd7gf2H2W5k5jAM+6SJKkYM6Fn5oMR8aXMfCa1UZwkSZp8urmH/tN6T1uS\nJE1S3dxD3wq4MSK+DQw15R3MzEOaC0uSJE3EqAk9Ij6SmX8PfAlYxMMfVbOxmiRJk8hYJfT9ADLz\nzIi4PjP32EAxSZKkCbLzE0mSWmCsEvrmtQe0KR3Dv5eZSxuNTJIkdW2shP544Bt1eErH8BBbvkuS\nNEmMmtAzc+cNGIckSXoUvIcuSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0\nSZJawIQuSVILmNAlSWoBE7okSS1gQpckqQVM6JIktYAJXZKkFjChS5LUAiZ0SZJaYFpTG46IzYFv\nAY8DNgP+KzPnR8Qs4BxgJ6APOCQz76nrzAeOANYBx2TmkqbikySpTRoroWfm/cCLMvNPgGcBL4qI\n5wPHA5dlZgCX13EiYh5wKDAPOAA4IyKsQZAkqQuNJszMXFMHNwOmAgPAgcCCOn0BcHAdPghYmJlr\nM7MPuBXYu8n4JElqi0YTekRsEhE3AMuB/87Mm4G5mbm8LrIcmFuHtwOWday+DNi+yfgkSWqLpkvo\nD9Yq9x2AP4+IFw2bPwgMjrGJseZJkqSqsUZxnTLz3oj4BvBsYHlEbJOZd0XEtsCKutgdwI4dq+1Q\np41q5szpTJs2tZGYpYkYGJjR6xDUoFmzZjB79pa9DkMaU5Ot3J8EPJCZ90TE44GXAP8KLAYOB06t\n/19YV1kMnB0Rp1Gq2ncDrh1rHwMDa8aaLW0w/f2rex2CGtTfv5qVK1f1OgxpTE1WuW8LXFHvoX8X\n+FpmXg6cArwkIhLYr46TmUuBRcBS4GLgqFolL0mSxtFYCT0zbwL2HGF6P7D/KOucBJzUVEySJLWV\nz3lLktQCJnRJklrAhC5JUguY0CVJagETuiRJLWBClySpBUzokiS1gAldkqQWMKFLktQCJnRJklrA\nhC5JUguY0CVJagETuiRJLWBClySpBUzokiS1gAldkqQWMKFLktQCJnRJklrAhC5JUguY0CVJagET\nuiRJLWBClySpBUzokiS1wLReByBJ+kPr1q2jr++2Xoehhuy881OYOnXqet2mCV2SJqG+vts49kOL\nmb71nF6HovVszb0r+NhxB7Lrrrut1+02mtAjYkfgLGAOMAh8JjP/PSJmAecAOwF9wCGZeU9dZz5w\nBLAOOCYzlzQZoyRNVtO3nsOMmdv3OgxtJJq+h74WeGdmPgPYB3hbROwOHA9clpkBXF7HiYh5wKHA\nPOAA4IyI8D6/JEnjaDRZZuZdmXlDHV4N/BjYHjgQWFAXWwAcXIcPAhZm5trM7ANuBfZuMkZJktpg\ng5V+I2JnYA/gu8DczFxeZy0H5tbh7YBlHasto1wASJKkMWyQhB4RM4DzgGMzc1XnvMwcpNxfH81Y\n8yRJEhuglXtEbEpJ5v+ZmRfWycsjYpvMvCsitgVW1Ol3ADt2rL5DnTaimTOnM23a+m32Lz0SAwMz\neh2CGjRr1gxmz95yg+7Tc6rdmjinmm7lPgX4PLA0M0/vmLUYOBw4tf5/Ycf0syPiNEpV+27AtaNt\nf2BgTRNhSxPW37+61yGoQf39q1m5ctX4C67nfaq9mjinmi6hPw94PXBjRFxfp80HTgEWRcSR1MfW\nADJzaUQsApYCDwBH1Sp5SZI0hkYTemZ+m9Hv0+8/yjonASc1FpQkSS3kM96SJLWACV2SpBYwoUuS\n1AImdEmSWsCELklSC5jQJUlqARO6JEktYEKXJKkFTOiSJLWACV2SpBYwoUuS1AImdEmSWsCELklS\nC5jQJUlqARO6JEktYEKXJKkFTOiSJLWACV2SpBYwoUuS1AImdEmSWsCELklSC5jQJUlqARO6JEkt\nYEKXJKkFTOiSJLXAtCY3HhFfAP4SWJGZz6zTZgHnADsBfcAhmXlPnTcfOAJYBxyTmUuajE+SpLZo\nuoT+ReCAYdOOBy7LzAAur+NExDzgUGBeXeeMiLAGQZKkLjSaMDPzKmBg2OQDgQV1eAFwcB0+CFiY\nmWszsw+4Fdi7yfgkSWqLXpSA52bm8jq8HJhbh7cDlnUstwzYfkMGJknSxqqnVdqZOQgMjrHIWPMk\nSVLVaKO4USyPiG0y866I2BZYUaffAezYsdwOddqoZs6czrRpUxsKU+rewMCMXoegBs2aNYPZs7fc\noPv0nGq3Js6pXiT0xcDhwKn1/ws7pp8dEadRqtp3A64da0MDA2saDFPqXn//6l6HoAb1969m5cpV\nG3yfaq8mzqmmH1tbCLwQeFJE3A68DzgFWBQRR1IfWwPIzKURsQhYCjwAHFWr5CVJ0jgaTeiZ+ZpR\nZu0/yvJ76OSXAAALYElEQVQnASc1F5EkSe3kc96SJLWACV2SpBYwoUuS1AImdEmSWsCELklSC5jQ\nJUlqARO6JEktYEKXJKkFTOiSJLWACV2SpBYwoUuS1AImdEmSWsCELklSC5jQJUlqARO6JEktYEKX\nJKkFTOiSJLWACV2SpBaY1usAemHdunX09d3W6zDUkJ13fgpTp07tdRiStEE9JhN6X99tHPuhxUzf\nek6vQ9F6tubeFXzsuAPZddfdeh2KJG1Qj8mEDjB96znMmLl9r8OQJGm98B66JEktYEKXJKkFTOiS\nJLWACV2SpBaYdI3iIuIA4HRgKvC5zDy1xyFJkjTpTaoSekRMBT4BHADMA14TEbv3NipJkia/SZXQ\ngb2BWzOzLzPXAl8BDupxTJIkTXqTLaFvD9zeMb6sTpMkSWOYbPfQBzfUjtbcu2JD7UobUC+/V8+p\ndvKc0vr2mPheI2KfiLikY3x+RLy7lzFJkrQxmGwl9OuA3SJiZ+BO4FDgNT2NSJKkjcCkuoeemQ8A\nbwcuBZYC52Tmj3sblSRJkiRJkiRJkiRJkiRJmpQiYuuIeGvH+HYRcW4vY9LGKSJ2iohH9BRKRKxe\n3/Fo4xQRfxcRb6jDb4qIbTvmfdbXf0ujiIidI+KmXsehjV9E7BsRXxtl3piPxEbEqmai0sYsIv47\nIp7d6zg2VlN6HYAerj6DfzFwFfBnwB2U99lvT+m4ZjawBnhLZv4kInYFvgxMBxYDx2bmlhExA7gQ\nmAlsCvxzZi6OiK8ABwI/AZYAZwBfz8xnRsR3gCMyc2mN5UrgXXXZjwPPqNs6ITMXN/xRqCGP4Bw7\nE/haZp5X119Vz7HvAE8Hfg4sAAaAvwa2oDwS+3Lgvxh2DnZuY4McsBpTz6VLKO8Q2RO4GXgj5bz6\nEOVdJ98D3pqZv4uIU4BXAA8Al2bmP0bECcAqoA84k3I+rqnbuAT4e2AvYNfM/Me63zcBz87MoyPi\n9cDRwGbAd4GjMvPBhg99UppUz6Hr954KfCIz/wi4h/Ij+Wng6MzcCziOkogBPgZ8NDOfxcPfg/8b\n4JWZ+WxgP+Ajdfq7gZ9l5h6Z+W4eflH3FeAQgFrttU1m/gD4J+DyzHxu3daHImL6+j5obVATOcdG\neyXzu4Gr6rl0OuVc2gP468x8EXA/I5+DapcAPpmZ84D7KAn4i8Ah9XdpGvDWiJgFHJyZz8jMPwb+\nra4/CAzWC8brgNdm5p6Zef/QPOA84JUd+zwEWFir4w8B/iwz9wAeBF7X8PFOWib0yennmXljHf4+\nsDPlavXciLge+BSwTZ2/DzB0D3xhxzY2AU6OiB8ClwHbRcQcxq6VORd4VR0+pGO7fwEcX/f938Dj\ngB0f2aFpkpjIOTaa4efSILAkM++p46Odg2qX2zPzf+vwlygXb7dl5q112gLgz4F7gfsj4vMR8UpK\noWMkf/AblZl3A7dFxHMj4onA0zPzGuDFwLOB6+p5ux+wy/o6sI3NZHv1q4rfdgyvA+YC99Qr0G69\nDngSsGdmrouInwObj7VCZt4REb+KiGdSEvrfdcz+q8z86QT2r8ltIufYA9SL/4jYhFK1OZo1HcMT\nPge1UeqswZlCqfF54rBp1HNgb0oSfhXlraAvHmd7nYZqEG8Bzu+YviAz3/PIQm8XS+gbh/soV6ev\nAoiIKRHxrDrvOzxUqj6sY52tgBX1j+hFwE51+ipgrHuX51CqUrfKzB/VaZcCxwwtEBETubDQxmGs\nc6yPUgqC0v5i0zo8/FwaXrIa7RxUuzw5Ivapw6+lVJvvXNv3ALwBuDIitgCekJkXU9rm/HGdP4WH\nzp1VlPNmJBcAB1P69/hKnXY58KqImA0QEbMi4snr57A2Pib0yWn4Feog8HrgyIi4AfgR5YcV4B3A\nu+r0XSnVWlAayu0VETdS/qB+DJCZvwKujoibIuJUHrpHNeSrlE5xFnVM+wCwaUTcGBE/Av51/Rym\nemgi59hngRfW6fsAQ4+c/RBYFxE3RMQ7+MNzacRzcJT9a+P1E+BtEbEU2Bo4DXgz5fbNjZQank9R\nEvXX6i2Yq4B31vU7z5szgU9FxA8i4mG1OfVWzlLgyZl5XZ32Y+CfgSV1u0sY/1aRNDlFxOM7hg+L\niAt6GY+kxw4fg51cvIe+8Xt2RHyCUmU1ABzR43gkPbZY2yJJkiRJkiRJkiRJkiRJktYvO2eRGhQR\nrwbmU/7WNgd+kJmP+F3TEbE18HeZ+cGOaZ8FzszMqx9tvF3GcAJwYmauHWX+lsBJwAGUZ9bXAedl\n5snjbPdNwNW+kVB6ZHyxjNSQ2sHNJ4FX1A5Mdqf0QPVozKR0nPJ7mfmWDZXMq/cxyutfI2IKcBHl\nUabd66tkn0d5A9h43kTp6GODGK+LV2ljYwldakh9Re7Xgadm5h90RBERzwVO5qFXXb4vMy+qXVJe\nR3m71ssoXeMemZlXR8Q3KJ3l/Aj4dWY+v3Zz+6HM/Ebt6vR+YDfKmwMvBL5BScI7UHrm+/e6/6cB\nH6W8b30z4PTMPLPOe5DSy94rKe/lPi4zz4+ITwJvBW6i9Gy1b2YOvZ2QiNgf+DzwlMxcN8Ixv5jy\n5sHNKe/BODEzz4mINwP/Dqyg9tiVmVdExLuBv6rL3kHp0nV5ran4AjCvTr8TWJ6Zx9Wugz9O6XIT\n4KzM/FDd/5XA9ZQ33vVTun7ty8wPd3xnCzPz6cNjlyY7S+hSc24ArgV+ERHnRsSxtQtJIuIJwH9Q\nuorci9JH9KcjYii5zwKuycw9gfcDp9bpR1E7UcnM59dpw1+5Oo9S3b075b3Xh2XmCygl5RMjYnot\nnZ4NvDMz9wZeAMyPiM4S8r113hsoyZbMfFud96c1hnt5uD2B74+UzKvvA8+vx/US4MMRsXVmfpFy\nEXN03e4VtZ/rpwD71C5YL+ahLljfB/yq1nq8Gnh+x2fw3hrrMyk9yB0eEQd0fFa7AM/LzL+k9P/e\n2QnR2ym1KtJGx4QuNSQzBzPzlcC+lG5n/xK4MSJmUhLNLsDFtdvHiygl3qfW1Vdn5kV1+LuU0jaM\nX6s2CFyYmWtrrcBPKCV0MvNOytsEd6BUbT8d+Erd//9QOl3ZvWNbQx1gfJfS9elYvax17n+sGOcA\n59XXhV5CuXB5Wsf8znUPBPYHflBjPIqHOnjZl9LnNpk5QKmJGPJiyvvnycxVlG6F9++Yf3ZmPljn\n30LplOaA+r28gvI+cWmj4z0kqWGZeTNwM3BGRNxMSUa/BW7MzBcOX75WuQ/v3nQif6vD171/hG1N\nAe4ep0ve+2v862rBfRrwu3H2/QNKRx1TRyml/wflguOVABHxEx7eperw14h+YOg2wAimjDI80rzO\n7a4etuy/Uy4WnkFpvNfN/X5p0rGELjUkIraLiD/tGN8BmA3cBlwD7BYR+3bMf04Xm70PmB4RUycQ\nykgl5luANbVae2j/T68t1MezCnjCSDMy83JgGfCRiNi0bvdxETFUVb818H91+kt4qEYCyrF1bncx\n5eLgCR3bGerS9UrgjXX6E3ioZziAbwJH1nlbUnoPvGyM47mIUkvwTqxu10bMhC41ZxpwQkTcUquM\nvwH8U2b+sHYFeSDwL7X70aWU+8JDRurelMzsp3RLelNEfHuU/Y64bqdaen4FcFhE/LB2i/sJHurr\nfKxtfAS4onZxufUI+38p5dh/XLvP/B6wRZ13POW++fWUe98/7FjvM8D7IuL6iNgvM79Uj/VbtWvM\n6yi3KqC0K5gTET8Gzq/zhu7nfwCYUqv1r6E0ilsyQpxDn8UgcBZwW2b+aLTlJEnSehYR0yLicXV4\nq3pRst+j2N5lEfHX6y9CacPzHrqkjdEs4KJ662Fz4MuZecVENxIRe1Ea//0gM89bzzFKkiRJkiRJ\nkiRJkiRJkiRJkiRJkiRtrP5/RXcswMJz7x8AAAAASUVORK5CYII=\n",
       "text": [
        "<matplotlib.figure.Figure at 0x103b2f810>"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that most tweets are neutral, but there are far more positive tweets than negative tweets."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Conclusion"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We've made a few interesting discoveries here, but there is a lot more that could be explored.  You could collect more data, look at the sentiment for specific topics, track topic velocity over time, explore more complex topic modeling, and much more.  Natural Language Processing is a vast field with many subareas.  We've provided some links on the DAT5 readme to help you explore it more deeply."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}