{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Twitter text analysis\n",
    "\n",
    "Let's load one day's worth of tweets from India. These were\n",
    "[captured](https://github.com/gramener/twitter-stream) via the\n",
    "[Twitter API](https://dev.twitter.com/). The file is at <http://files.gramener.com/data/tweets.20130919.json.gz>.\n",
    "It's just under 7MB.\n",
    "\n",
    "First, let's download the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib\n",
    "\n",
    "tweetfile = 'tweets.json.gz'\n",
    "if not os.path.exists(tweetfile):\n",
    "    url = 'http://files.gramener.com/data/tweets.20130919.json.gz'\n",
    "    urllib.urlretrieve(url, tweetfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file is not *quite* a gzipped JSON file, despite the file name. Each row is a JSON string. Some lines might be blank -- especially alternate lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"created_at\":\"Wed Sep 18 03:39:02 +0000 2013\",\"id\":380174094936702976,\"id_str\":\n",
      "{\"created_at\":\"Wed Sep 18 03:39:02 +0000 2013\",\"id\":380174096635416577,\"id_str\":\n",
      "{\"created_at\":\"Wed Sep 18 03:39:06 +0000 2013\",\"id\":380174111076405248,\"id_str\":\n",
      "{\"created_at\":\"Wed Sep 18 03:39:16 +0000 2013\",\"id\":380174154751696896,\"id_str\":\n"
     ]
    }
   ],
   "source": [
    "import gzip\n",
    "for line in gzip.open(tweetfile).readlines()[:8]:\n",
    "    if line.strip():\n",
    "        print line[:80]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load this into a Pandas data structure. After some experimentation, I find that this is a reasonably fast way of loading it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "series = pd.Series([\n",
    "    line for line in gzip.open(tweetfile) if line.strip()\n",
    "]).apply(json.loads)\n",
    "\n",
    "data = pd.DataFrame({\n",
    "  'id'  : series.apply(lambda t: t['id_str']),\n",
    "  'name': series.apply(lambda t: t['user']['screen_name']),\n",
    "  'text': series.apply(lambda t: t['text']),\n",
    "}).set_index('id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've extracted just a few things from the tweets -- such as the ID (which we set as the index), the person who tweeted it, the text of the tweet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>380174094936702976</th>\n",
       "      <td>rgokul</td>\n",
       "      <td>பின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>380174096635416577</th>\n",
       "      <td>fknadaf</td>\n",
       "      <td>@rehu123 \\nHi..re   h r u..????</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>380174111076405248</th>\n",
       "      <td>neetakolhatkar</td>\n",
       "      <td>@sohamsabnis mhanunach jau dya..tyat phile jod...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>380174154751696896</th>\n",
       "      <td>pinashah1</td>\n",
       "      <td>@Miragpur7 jok of tha day</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>380174182803202050</th>\n",
       "      <td>MeghaLvsShaleen</td>\n",
       "      <td>@ilovearrt @shweet_tasu @akanksha_pooh31 @Miss...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               name  \\\n",
       "id                                    \n",
       "380174094936702976           rgokul   \n",
       "380174096635416577          fknadaf   \n",
       "380174111076405248   neetakolhatkar   \n",
       "380174154751696896        pinashah1   \n",
       "380174182803202050  MeghaLvsShaleen   \n",
       "\n",
       "                                                                 text  \n",
       "id                                                                     \n",
       "380174094936702976  பின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா...  \n",
       "380174096635416577                    @rehu123 \\nHi..re   h r u..????  \n",
       "380174111076405248  @sohamsabnis mhanunach jau dya..tyat phile jod...  \n",
       "380174154751696896                          @Miragpur7 jok of tha day  \n",
       "380174182803202050  @ilovearrt @shweet_tasu @akanksha_pooh31 @Miss...  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pure Python\n",
    "\n",
    "Now let's do some basic text analysis on this.\n",
    "\n",
    "## Most frequent words: `.split(' ')` and `.value_counts()`\n",
    "\n",
    "Let's get the full text as a string and count the words. Let's assume that words are split by a single space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "to     3256\n",
       "the    3235\n",
       "       2441\n",
       "in     2275\n",
       "a      2193\n",
       "dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = pd.Series(' '.join(data['text']).split(' '))\n",
    "words.value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are lots of errors in the assumption that words are split by a single space. That ignores punctuation, multiple spaces, hyphenation, and a lot of other things. But **it's not a bad starting point** and you can start making reasonable inferences as a first approximation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NLTK: `.word_tokenize()`\n",
    "\n",
    "The process of converting a sentence into words is called tokenization. NLTK offers an `nltk.word_tokenize()` function for this. Let's try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "@sohamsabnis mhanunach jau dya..tyat phile jodi ne ahe...mhanje imagination la break nahi\n",
      "[u'@', u'sohamsabnis', u'mhanunach', u'jau', u'dya..tyat', u'phile', u'jodi', u'ne', u'ahe', u'...', u'mhanje', u'imagination', u'la', u'break', u'nahi']\n",
      "\n",
      "@Miragpur7 jok of tha day\n",
      "[u'@', u'Miragpur7', u'jok', u'of', u'tha', u'day']\n",
      "\n",
      "@ilovearrt @shweet_tasu @akanksha_pooh31 @MissHal96 @Mishtithakur @SalgaonkarPriya @Shaleen_Ki_Pari Super cute :p\n",
      "[u'@', u'ilovearrt', u'@', u'shweet_tasu', u'@', u'akanksha_pooh31', u'@', u'MissHal96', u'@', u'Mishtithakur', u'@', u'SalgaonkarPriya', u'@', u'Shaleen_Ki_Pari', u'Super', u'cute', u':', u'p']\n",
      "\n",
      "Looking forward to interacting with the dynamic students, faculty &amp; team of @SriSriU. Its fast becoming a global centre of excellence !\n",
      "[u'Looking', u'forward', u'to', u'interacting', u'with', u'the', u'dynamic', u'students', u',', u'faculty', u'&', u'amp', u';', u'team', u'of', u'@', u'SriSriU', u'.', u'Its', u'fast', u'becoming', u'a', u'global', u'centre', u'of', u'excellence', u'!']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "for i in range(2, 6):\n",
    "    print data['text'][i]\n",
    "    print nltk.word_tokenize(data['text'][i])\n",
    "    print ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a few problems with this. User names like `@ilovearrt` are split into `@` and `iloverrrt`. Similarly, `&amp;` is split. And so on.\n",
    "\n",
    "NLTK offers other tokenizers, including the ability to custom-write your own. But for now, we'll just go with our simple list of space-separated words.\n",
    "\n",
    "**NOTE**: Tokenization is usually specific to a given dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLTK\n",
    "\n",
    "## Remove stopwords: `nltk.corpus.stopwords` and `.drop()`\n",
    "\n",
    "The bigger problem is that the most common words are also the most often used -- to, the, in, a, etc. These are called **stopwords**. We need a way of finding and removing them.\n",
    "\n",
    "NLTK offers a standard list of stopwords. This is what we get if we remove those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                           2441\n",
       "I                          1817\n",
       "I'm                         970\n",
       "u                           695\n",
       "-                           604\n",
       "@                           507\n",
       "The                         503\n",
       ":)                          493\n",
       "&amp;                       467\n",
       "like                        390\n",
       "hai                         364\n",
       "(@                          363\n",
       "good                        333\n",
       "one                         285\n",
       "get                         285\n",
       "!                           281\n",
       "time                        280\n",
       "love                        269\n",
       "n                           266\n",
       "r                           263\n",
       "day                         245\n",
       "RT                          244\n",
       "people                      242\n",
       ":D                          240\n",
       "7                           240\n",
       "#ForSale                    227\n",
       "#Flat                       226\n",
       "don't                       223\n",
       "iOS                         222\n",
       "ur                          222\n",
       "                           ... \n",
       "gained                        1\n",
       "grips                         1\n",
       "agreed..                      1\n",
       "election.#congreefights       1\n",
       "din....                       1\n",
       "http://t.co/RXM8hgBBoS        1\n",
       "langsunglah                   1\n",
       "policies.                     1\n",
       "thriller                      1\n",
       "dummy                         1\n",
       "Amen!                         1\n",
       "थीं                           1\n",
       "meetings                      1\n",
       "#Mathura                      1\n",
       "@AmypichardAmy                1\n",
       "http://t.co/necdoOUAHU        1\n",
       "Ravjiani,                     1\n",
       "#TheAsianAge                  1\n",
       "coffee.!!                     1\n",
       "race's                        1\n",
       "http://t.co/aH4i8A0Nz1\"       1\n",
       "real…                         1\n",
       "http://t.co/GNzghJBYX1        1\n",
       "update??                      1\n",
       "#lazy                         1\n",
       "107,#Gurgaon,                 1\n",
       "annaru                        1\n",
       "snooping                      1\n",
       "@BangaloreAshram              1\n",
       "जैन।                          1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "ignore = set(stopwords.words('english')) & set(words.unique())\n",
    "words.value_counts().drop(ignore)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still, it's not really clear what the words are. We need to go further.\n",
    "\n",
    "- Let's use lowecase for standardisation.\n",
    "- Let's remove punctuations. Maybe any word that *even contains punctuation*, like \"I'm\" or \"&amp;\"\n",
    "- All single-letter words are a good idea to drop off too, like \"u\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "relevant_words = words.str.lower()\n",
    "relevant_words = relevant_words[~relevant_words.str.contains(r'[^a-z]')]\n",
    "relevant_words = relevant_words[relevant_words.str.len() > 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "good           543\n",
       "like           418\n",
       "hai            386\n",
       "one            365\n",
       "love           351\n",
       "time           321\n",
       "get            300\n",
       "new            298\n",
       "people         297\n",
       "see            273\n",
       "day            271\n",
       "ios            255\n",
       "rt             247\n",
       "ki             242\n",
       "ur             242\n",
       "know           228\n",
       "go             221\n",
       "life           219\n",
       "best           214\n",
       "se             205\n",
       "back           201\n",
       "morning        200\n",
       "make           192\n",
       "never          192\n",
       "hi             192\n",
       "follow         188\n",
       "still          188\n",
       "want           185\n",
       "india          180\n",
       "way            178\n",
       "              ... \n",
       "chattarpur       1\n",
       "pleaseeeeee      1\n",
       "bhujiya          1\n",
       "chuploo          1\n",
       "enuff            1\n",
       "roost            1\n",
       "cantt            1\n",
       "parsvnath        1\n",
       "expired          1\n",
       "beam             1\n",
       "beshak           1\n",
       "cld              1\n",
       "pace             1\n",
       "mushtaq          1\n",
       "howdy            1\n",
       "ghalib           1\n",
       "leya             1\n",
       "pudhcha          1\n",
       "pilgrim          1\n",
       "soiled           1\n",
       "lool             1\n",
       "krissh           1\n",
       "imo              1\n",
       "muaaaaah         1\n",
       "pranam           1\n",
       "bevkoof          1\n",
       "destroyed        1\n",
       "quater           1\n",
       "vasundhara       1\n",
       "validity         1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ignore = set(stopwords.words('english')) & set(relevant_words.unique())\n",
    "relevant_words.value_counts().drop(ignore)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This list is a lot more meaningful.\n",
    "\n",
    "But before we go ahead, let's take a quick look at the *words we've ignored* to see if we should've taken something from there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                2441\n",
       "a               2377\n",
       "i               2161\n",
       "i'm              980\n",
       "u                778\n",
       "-                604\n",
       "@                507\n",
       ":)               493\n",
       "&amp;            467\n",
       "(@               363\n",
       "don't            292\n",
       "n                291\n",
       ":p               287\n",
       "r                285\n",
       "!                281\n",
       "it's             243\n",
       ":d               241\n",
       "7                240\n",
       "#ios7            232\n",
       "#forsale         227\n",
       "#flat            226\n",
       "2                217\n",
       ".                216\n",
       "?                215\n",
       "!!               204\n",
       "#residential     204\n",
       "..               196\n",
       ",                191\n",
       "#bappamorya      189\n",
       ":-)              173\n",
       "dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words.drop(relevant_words.index).str.lower().value_counts().head(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... Ah! We're missing all the smileys (which may be OK) and the hashtags (which could be useful). Should we just pull in the hashtags alone? Let's do that. We'll allow `#` as an exception. We'll also ignore `@` which usually indicates reply to a person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "good                      543\n",
       "like                      418\n",
       "hai                       386\n",
       "one                       365\n",
       "love                      351\n",
       "time                      321\n",
       "get                       300\n",
       "new                       298\n",
       "people                    297\n",
       "see                       273\n",
       "day                       271\n",
       "ios                       255\n",
       "rt                        247\n",
       "ki                        242\n",
       "ur                        242\n",
       "know                      228\n",
       "#forsale                  227\n",
       "#flat                     226\n",
       "go                        221\n",
       "life                      219\n",
       "best                      214\n",
       "se                        205\n",
       "#residential              204\n",
       "back                      201\n",
       "morning                   200\n",
       "never                     192\n",
       "make                      192\n",
       "hi                        192\n",
       "#bappamorya               189\n",
       "follow                    188\n",
       "                         ... \n",
       "circumstances               1\n",
       "vaat                        1\n",
       "parag                       1\n",
       "recreate                    1\n",
       "#pounding                   1\n",
       "#nestle                     1\n",
       "meuble                      1\n",
       "#thingsthatmakemehappy      1\n",
       "primarily                   1\n",
       "kanipinchadu                1\n",
       "#kathmandont                1\n",
       "ruhu                        1\n",
       "kashif                      1\n",
       "tidak                       1\n",
       "bl                          1\n",
       "dekhaunchu                  1\n",
       "jokingly                    1\n",
       "inclination                 1\n",
       "bd                          1\n",
       "bf                          1\n",
       "sants                       1\n",
       "@itweetfacts                1\n",
       "dictated                    1\n",
       "bk                          1\n",
       "#instaholic                 1\n",
       "jaaoege                     1\n",
       "mahmood                     1\n",
       "br                          1\n",
       "#justbeingme                1\n",
       "betch                       1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "relevant_words = words.str.lower()\n",
    "relevant_words = relevant_words[~relevant_words.str.contains(r'[^#@a-z]')]\n",
    "relevant_words = relevant_words[relevant_words.str.len() > 1]\n",
    "ignore = set(stopwords.words('english')) & set(relevant_words.unique())\n",
    "relevant_words.value_counts().drop(ignore)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We haven't added anything to the list of top words, but further down, it may be useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word stems: `nltk.PorterStemmer()`\n",
    "\n",
    "Let's look at all the words that start with `time`, like `timing`, `timer`, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "time         321\n",
       "times         51\n",
       "timeline       6\n",
       "timings        3\n",
       "timeless       3\n",
       "timesnow       2\n",
       "timetable      1\n",
       "tim            1\n",
       "timely         1\n",
       "timezone       1\n",
       "timing         1\n",
       "timed          1\n",
       "timline        1\n",
       "timesheet      1\n",
       "timepass       1\n",
       "timro          1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "relevant_words[relevant_words.str.startswith('tim')].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the very least, we want `time` and `times` to mean the same word. These are word stems. Here's one way of doing this in NLTK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "time         378\n",
       "timelin        6\n",
       "timeless       3\n",
       "timesnow       2\n",
       "timlin         1\n",
       "timezon        1\n",
       "tim            1\n",
       "timet          1\n",
       "timesheet      1\n",
       "timepass       1\n",
       "timro          1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "porter = nltk.PorterStemmer()\n",
    "stemmed_words = relevant_words.apply(porter.stem)\n",
    "stemmed_words[stemmed_words.str.startswith('tim')].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that this introduces words like `timelin` instead of `timeline`. These can be avoided through the use of a process called `lemmatization` (see `nltk.WordNetLemmatizer()`). However, this is relatively slower.\n",
    "\n",
    "For now, we'll just stick to the original words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bigrams: `nltk.collocations`\n",
    "\n",
    "What if we want to find phrases? If we're looking for 2-word combinations (bigrams), we can use the `nltk.collocations.BigramCollocationFinder`. These are the top 30 word pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#bappamorya #bappamorya\n",
      "good morning\n",
      "#flat #forsale\n",
      "#jacksonville #jobs\n",
      "will be\n",
      "#residentialplot #land\n",
      "to be\n",
      "agle baras\n",
      "international airport\n",
      "#land #forsale\n",
      "posted photo\n",
      "baras tu\n",
      "tu jaldi\n",
      "#smwmumbai #mumbaiisamazing\n",
      "now trending\n",
      "happy birthday\n",
      "just posted\n",
      "in the\n",
      "waiting for\n",
      "trending topic\n",
      "jaldi aa\n",
      "gracious acts\n",
      "#apartment #flat\n",
      "@smwmumbai #smwmumbai\n",
      "follow back\n",
      "the best\n",
      "railway station\n",
      "cycling km\n",
      "god bless\n",
      "shows up\n"
     ]
    }
   ],
   "source": [
    "from nltk.collocations import BigramCollocationFinder\n",
    "from nltk.metrics import BigramAssocMeasures\n",
    "\n",
    "bcf = BigramCollocationFinder.from_words(relevant_words)\n",
    "for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):\n",
    "    print ' '.join(pair)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## See this as a word cloud\n",
    "\n",
    "Let's get the data into a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>good</td>\n",
       "      <td>543</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>like</td>\n",
       "      <td>418</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>hai</td>\n",
       "      <td>386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>one</td>\n",
       "      <td>365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>love</td>\n",
       "      <td>351</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   word  count\n",
       "0  good    543\n",
       "1  like    418\n",
       "2   hai    386\n",
       "3   one    365\n",
       "4  love    351"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_words = relevant_words.value_counts().drop(ignore).reset_index()\n",
    "top_words.columns = ['word', 'count']\n",
    "top_words.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Work in progress...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "re_separator = re.compile(r'[\\s\"#\\.\\?,;\\(\\)!/]+')\n",
    "re_url = re.compile(r'http.*?($|\\s)')\n",
    "def tokenize(sentence):\n",
    "    sentence = re_url.sub('', sentence)\n",
    "    words = re_separator.split(sentence)\n",
    "    return [word for word in words if\n",
    "            len(word) > 1]\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vectorizer = CountVectorizer(\n",
    "    # analyser='word',             # Separate using punctuations\n",
    "    # analyzer=re_separator.split, # Separate using spaces\n",
    "    # analyzer=re_separator.split, # Separate using custom separator\n",
    "    analyzer=tokenize,             # Separate using custom separator\n",
    "    min_df=10,                   # Ignore words that occur less than 10 times in the corpus\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Note: for these 18,000 documents, sklearn takes about ~0.5 seconds on my system\n",
    "X = vectorizer.fit_transform(data['text'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# terms: 2482\n",
      "^_^ 869\n",
      ":-D 51\n",
      "I’m 482\n",
      "don’t 1203\n",
      "&amp 0\n",
      "[pic] 867\n",
      ":-P 52\n",
      "-www 7\n",
      "[pic]: 868\n",
      "&gt 1\n",
      "alert: 908\n",
      "here: 1437\n",
      "100% 11\n",
      "Job: 490\n",
      "IN: 454\n",
      "Café 292\n",
      ":3 53\n",
      ":o 57\n",
      ":p 58\n",
      ":O 55\n",
      ":D 54\n",
      ":P 56\n",
      "Méridien 592\n",
      "-_- 6\n",
      "&lt 2\n",
      "10:30 12\n"
     ]
    }
   ],
   "source": [
    "# Here are some of the terms that have special characters \n",
    "print '# terms: %d' % len(vectorizer.vocabulary_)\n",
    "for key in vectorizer.vocabulary_.keys():\n",
    "    if re.search('\\W', key) and not re.search(r'[@#\\']', key) and re.search('\\w', key):\n",
    "        print key, vectorizer.vocabulary_[key]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Apply TF-IDF\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "transformer = TfidfTransformer()\n",
    "tfidf = transformer.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6 [u'sorry'] @b50 oops...sorry typo. 'Than'\n",
      "7 [u'place'] 9h09 place au someil maintenant\n",
      "24 [u'GM'] @satish_bsk GM\n",
      "25 [u'org'] @mrlumpyU_U xbek menindas org yg xblik msia. Ngagaha\n",
      "26 [u'Hey'] Hey evrybuddy http://t.co/vH89PFhyYg\n",
      "35 [u'ha'] @gauthamvarma04 ha ha\n",
      "85 [u'ma'] @bindeshpandya gm$... Jay ma bharat..vande mataram.. Namo namah...@BJYM @BJP_Gujarat\n"
     ]
    }
   ],
   "source": [
    "# Let's see the unusual terms\n",
    "import numpy as np\n",
    "terms = np.array(vectorizer.get_feature_names())\n",
    "for index in range(100):\n",
    "    t = terms[(tfidf[index] >= 0.99).toarray()[0]]\n",
    "    if len(t):\n",
    "        print index, t, data['text'][index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Segment by those with above median followers\n",
    "followers_count = series.map(lambda v: v['user']['followers_count'])\n",
    "segment = followers_count.values > followers_count.median()\n",
    "count1 = X[segment].sum(axis=0)\n",
    "count2 = X[~segment].sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>term</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>261</td>\n",
       "      <td>242</td>\n",
       "      <td>&amp;amp</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>74</td>\n",
       "      <td>53</td>\n",
       "      <td>&amp;gt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>48</td>\n",
       "      <td>44</td>\n",
       "      <td>&amp;lt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>'s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>16</td>\n",
       "      <td>--</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a    b  term\n",
       "0  261  242  &amp\n",
       "1   74   53   &gt\n",
       "2   48   44   &lt\n",
       "3    4    7    's\n",
       "4    4   16    --"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count of term in each segment\n",
    "df = pd.DataFrame(np.concatenate([count1, count2]).T).astype(float)\n",
    "df.columns = ['a', 'b']\n",
    "df['term'] = terms\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "total = df['a'] + df['b']\n",
    "contrast = df['a'] / total - 0.5\n",
    "freq = total.rank() / len(df)\n",
    "df['significance'] = freq / 2 + contrast.abs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>term</th>\n",
       "      <th>significance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>664</th>\n",
       "      <td>290</td>\n",
       "      <td>1</td>\n",
       "      <td>Property</td>\n",
       "      <td>0.985282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>370</th>\n",
       "      <td>239</td>\n",
       "      <td>0</td>\n",
       "      <td>ForSale</td>\n",
       "      <td>0.985093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365</th>\n",
       "      <td>232</td>\n",
       "      <td>0</td>\n",
       "      <td>Flat</td>\n",
       "      <td>0.983783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>252</th>\n",
       "      <td>0</td>\n",
       "      <td>189</td>\n",
       "      <td>BappaMorya</td>\n",
       "      <td>0.980459</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>222</td>\n",
       "      <td>2</td>\n",
       "      <td>Residential</td>\n",
       "      <td>0.973747</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       a    b         term  significance\n",
       "664  290    1     Property      0.985282\n",
       "370  239    0      ForSale      0.985093\n",
       "365  232    0         Flat      0.983783\n",
       "252    0  189   BappaMorya      0.980459\n",
       "688  222    2  Residential      0.973747"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values('significance', ascending=False).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def termdiff(terms, counts, segment):\n",
    "    df = pd.DataFrame(np.concatenate([\n",
    "                counts[segment].sum(axis=0),\n",
    "                counts[~segment].sum(axis=0)\n",
    "            ]).T).astype(float)\n",
    "    df.columns = ['a', 'b']\n",
    "    df['term'] = terms\n",
    "    total = df['a'] + df['b']\n",
    "    df['contrast'] = 2 * (df['a'] / total - 0.5)\n",
    "    df['freq'] = total.rank() / len(df)\n",
    "    df['significance'] = (df['freq'] + df['contrast'].abs()) / 2\n",
    "    return df.sort_values('significance', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>term</th>\n",
       "      <th>contrast</th>\n",
       "      <th>freq</th>\n",
       "      <th>significance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>664</th>\n",
       "      <td>290</td>\n",
       "      <td>1</td>\n",
       "      <td>Property</td>\n",
       "      <td>0.993127</td>\n",
       "      <td>0.977438</td>\n",
       "      <td>0.985282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>370</th>\n",
       "      <td>239</td>\n",
       "      <td>0</td>\n",
       "      <td>ForSale</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.970185</td>\n",
       "      <td>0.985093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365</th>\n",
       "      <td>232</td>\n",
       "      <td>0</td>\n",
       "      <td>Flat</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.967566</td>\n",
       "      <td>0.983783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>252</th>\n",
       "      <td>0</td>\n",
       "      <td>189</td>\n",
       "      <td>BappaMorya</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.960919</td>\n",
       "      <td>0.980459</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>222</td>\n",
       "      <td>2</td>\n",
       "      <td>Residential</td>\n",
       "      <td>0.982143</td>\n",
       "      <td>0.965351</td>\n",
       "      <td>0.973747</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       a    b         term  contrast      freq  significance\n",
       "664  290    1     Property  0.993127  0.977438      0.985282\n",
       "370  239    0      ForSale  1.000000  0.970185      0.985093\n",
       "365  232    0         Flat  1.000000  0.967566      0.983783\n",
       "252    0  189   BappaMorya -1.000000  0.960919      0.980459\n",
       "688  222    2  Residential  0.982143  0.965351      0.973747"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "termdiff(terms, X, segment).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There seem to be several influential people on Twitter tweetings about properties for sale. Non-influential people are tweeting about BappaMorya."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with_hashtags = series.apply(lambda v: len(v['entities']['hashtags']) > 0).values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>term</th>\n",
       "      <th>contrast</th>\n",
       "      <th>freq</th>\n",
       "      <th>significance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2419</th>\n",
       "      <td>0</td>\n",
       "      <td>135</td>\n",
       "      <td>टन</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.943191</td>\n",
       "      <td>0.971595</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>451</th>\n",
       "      <td>35</td>\n",
       "      <td>946</td>\n",
       "      <td>I'm</td>\n",
       "      <td>-0.928644</td>\n",
       "      <td>0.995568</td>\n",
       "      <td>0.962106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>551</th>\n",
       "      <td>2</td>\n",
       "      <td>134</td>\n",
       "      <td>Maharashtra</td>\n",
       "      <td>-0.970588</td>\n",
       "      <td>0.943795</td>\n",
       "      <td>0.957192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2399</th>\n",
       "      <td>3</td>\n",
       "      <td>98</td>\n",
       "      <td>और</td>\n",
       "      <td>-0.940594</td>\n",
       "      <td>0.922643</td>\n",
       "      <td>0.931619</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1163</th>\n",
       "      <td>2</td>\n",
       "      <td>76</td>\n",
       "      <td>dear</td>\n",
       "      <td>-0.948718</td>\n",
       "      <td>0.891620</td>\n",
       "      <td>0.920169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2409</th>\n",
       "      <td>10</td>\n",
       "      <td>156</td>\n",
       "      <td>के</td>\n",
       "      <td>-0.879518</td>\n",
       "      <td>0.956285</td>\n",
       "      <td>0.917902</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1604</th>\n",
       "      <td>0</td>\n",
       "      <td>56</td>\n",
       "      <td>lessons</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.835012</td>\n",
       "      <td>0.917506</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2465</th>\n",
       "      <td>19</td>\n",
       "      <td>246</td>\n",
       "      <td>है</td>\n",
       "      <td>-0.856604</td>\n",
       "      <td>0.974416</td>\n",
       "      <td>0.915510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>697</th>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>Rumi</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.829371</td>\n",
       "      <td>0.914686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2468</th>\n",
       "      <td>1</td>\n",
       "      <td>63</td>\n",
       "      <td>है।</td>\n",
       "      <td>-0.968750</td>\n",
       "      <td>0.860596</td>\n",
       "      <td>0.914673</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       a    b         term  contrast      freq  significance\n",
       "2419   0  135           टन -1.000000  0.943191      0.971595\n",
       "451   35  946          I'm -0.928644  0.995568      0.962106\n",
       "551    2  134  Maharashtra -0.970588  0.943795      0.957192\n",
       "2399   3   98           और -0.940594  0.922643      0.931619\n",
       "1163   2   76         dear -0.948718  0.891620      0.920169\n",
       "2409  10  156           के -0.879518  0.956285      0.917902\n",
       "1604   0   56      lessons -1.000000  0.835012      0.917506\n",
       "2465  19  246           है -0.856604  0.974416      0.915510\n",
       "697    0   54         Rumi -1.000000  0.829371      0.914686\n",
       "2468   1   63          है। -0.968750  0.860596      0.914673"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tdiff = termdiff(terms, X, with_hashtags)\n",
    "tdiff[tdiff['b'] > tdiff['a']].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tweets without hashtags tend to be Hindi tweets.\n",
    "\n",
    "The word \"I'm\" often is used without hashtags. (These are typically tweets that say \"I'm at\".)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([u\"I'm at LINK (Mumbai, Maharashtra) http://t.co/ComXHpCbua\",\n",
       "       u\"I'm getting fragrance of a dish being cooked in pure ghee... seems yum\",\n",
       "       u\"I'm at Le Meridien - @spg (Bangalore, Karnataka) http://t.co/GhzDYpdTRu\",\n",
       "       u\"I'm at Lajpat Nagar Metro Station (New Delhi, new delhi) http://t.co/MNEHQ9Qesg\",\n",
       "       u\"I'm at Godrej Memorial Hospital (Mumbai, Maharashtra) http://t.co/8lieJFiZH5\"], dtype=object)"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.ix[X.T[451].toarray()[0] > 0]['text'].values[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word \"dear\" is often used in hashtags. These are typically replies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([u\"@Ghislainemonie hi dear what's new for dinner today. I can't decide..\",\n",
       "       u'@bbhhaappyy So nice of you dear! Trust me and have a great day ahead! Stay blessed and keep connected. Rabh rakha hai!',\n",
       "       u'@MJCfan keep smiling dear..have a great day ahead..',\n",
       "       u'@sonamakapoor @PerniaQureshi.       Hi good morning dear',\n",
       "       u'@2ps664 @skelkar07 @keerti07 @TahminaJaved @sheetal3176 @Jyoramesh10 hi dear how r u'], dtype=object)"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.ix[X.T[1163].toarray()[0] > 0]['text'].values[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tdiff = termdiff(terms, X, series.map(lambda v: v['user']['location'].lower().startswith('bangalore')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>term</th>\n",
       "      <th>contrast</th>\n",
       "      <th>freq</th>\n",
       "      <th>significance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>435</th>\n",
       "      <td>842</td>\n",
       "      <td>18155</td>\n",
       "      <td>Hi</td>\n",
       "      <td>-0.911354</td>\n",
       "      <td>0.999799</td>\n",
       "      <td>0.955576</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1921</th>\n",
       "      <td>842</td>\n",
       "      <td>18155</td>\n",
       "      <td>re</td>\n",
       "      <td>-0.911354</td>\n",
       "      <td>0.999799</td>\n",
       "      <td>0.955576</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>&amp;amp</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.499799</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>&amp;gt</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.499799</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>&amp;lt</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.499799</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        a      b  term  contrast      freq  significance\n",
       "435   842  18155    Hi -0.911354  0.999799      0.955576\n",
       "1921  842  18155    re -0.911354  0.999799      0.955576\n",
       "0       0      0  &amp       NaN  0.499799           NaN\n",
       "1       0      0   &gt       NaN  0.499799           NaN\n",
       "2       0      0   &lt       NaN  0.499799           NaN"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tdiff.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lessons learnt\n",
    "\n",
    "1. Tokenisation and filtering of words *always* have a manual element -- so make that easy.\n",
    "    - But are there some robust English tokenisation patterns?\n",
    "1. Have a single function that tells me what token is unusual about a group\n",
    "1. For each token, show the concordance for context"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# spaCy\n",
    "\n",
    "Install [spaCy](https://spacy.io/)\n",
    "\n",
    "    conda config --add channels spacy\n",
    "    conda install spacy\n",
    "    python -m spacy.en.download all\n",
    "\n",
    "If you get an SSL error, run:\n",
    "\n",
    "    conda config --set ssl_verify False\n",
    "\n",
    "and re-run the above commands. This adds an `ssl_verify: False` statement to  `~/.condarc` add a line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}