{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Twitter text analysis\n", "\n", "Let's load one day's worth of tweets from India. These were\n", "[captured](https://github.com/gramener/twitter-stream) via the\n", "[Twitter API](https://dev.twitter.com/). The file is at .\n", "It's just under 7MB.\n", "\n", "First, let's download the file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import urllib\n", "\n", "tweetfile = 'tweets.json.gz'\n", "if not os.path.exists(tweetfile):\n", " url = 'http://files.gramener.com/data/tweets.20130919.json.gz'\n", " urllib.urlretrieve(url, tweetfile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file is not *quite* a gzipped JSON file, despite the file name. Each row is a JSON string. Some lines might be blank -- especially alternate lines." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"created_at\":\"Wed Sep 18 03:39:02 +0000 2013\",\"id\":380174094936702976,\"id_str\":\n", "{\"created_at\":\"Wed Sep 18 03:39:02 +0000 2013\",\"id\":380174096635416577,\"id_str\":\n", "{\"created_at\":\"Wed Sep 18 03:39:06 +0000 2013\",\"id\":380174111076405248,\"id_str\":\n", "{\"created_at\":\"Wed Sep 18 03:39:16 +0000 2013\",\"id\":380174154751696896,\"id_str\":\n" ] } ], "source": [ "import gzip\n", "for line in gzip.open(tweetfile).readlines()[:8]:\n", " if line.strip():\n", " print line[:80]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load this into a Pandas data structure. After some experimentation, I find that this is a reasonably fast way of loading it." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "\n", "series = pd.Series([\n", " line for line in gzip.open(tweetfile) if line.strip()\n", "]).apply(json.loads)\n", "\n", "data = pd.DataFrame({\n", " 'id' : series.apply(lambda t: t['id_str']),\n", " 'name': series.apply(lambda t: t['user']['screen_name']),\n", " 'text': series.apply(lambda t: t['text']),\n", "}).set_index('id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've extracted just a few things from the tweets -- such as the ID (which we set as the index), the person who tweeted it, the text of the tweet." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametext
id
380174094936702976rgokulபின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா...
380174096635416577fknadaf@rehu123 \\nHi..re h r u..????
380174111076405248neetakolhatkar@sohamsabnis mhanunach jau dya..tyat phile jod...
380174154751696896pinashah1@Miragpur7 jok of tha day
380174182803202050MeghaLvsShaleen@ilovearrt @shweet_tasu @akanksha_pooh31 @Miss...
\n", "
" ], "text/plain": [ " name \\\n", "id \n", "380174094936702976 rgokul \n", "380174096635416577 fknadaf \n", "380174111076405248 neetakolhatkar \n", "380174154751696896 pinashah1 \n", "380174182803202050 MeghaLvsShaleen \n", "\n", " text \n", "id \n", "380174094936702976 பின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா... \n", "380174096635416577 @rehu123 \\nHi..re h r u..???? \n", "380174111076405248 @sohamsabnis mhanunach jau dya..tyat phile jod... \n", "380174154751696896 @Miragpur7 jok of tha day \n", "380174182803202050 @ilovearrt @shweet_tasu @akanksha_pooh31 @Miss... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pure Python\n", "\n", "Now let's do some basic text analysis on this.\n", "\n", "## Most frequent words: `.split(' ')` and `.value_counts()`\n", "\n", "Let's get the full text as a string and count the words. Let's assume that words are split by a single space." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "to 3256\n", "the 3235\n", " 2441\n", "in 2275\n", "a 2193\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = pd.Series(' '.join(data['text']).split(' '))\n", "words.value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are lots of errors in the assumption that words are split by a single space. That ignores punctuation, multiple spaces, hyphenation, and a lot of other things. But **it's not a bad starting point** and you can start making reasonable inferences as a first approximation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK: `.word_tokenize()`\n", "\n", "The process of converting a sentence into words is called tokenization. NLTK offers an `nltk.word_tokenize()` function for this. Let's try it out:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@sohamsabnis mhanunach jau dya..tyat phile jodi ne ahe...mhanje imagination la break nahi\n", "[u'@', u'sohamsabnis', u'mhanunach', u'jau', u'dya..tyat', u'phile', u'jodi', u'ne', u'ahe', u'...', u'mhanje', u'imagination', u'la', u'break', u'nahi']\n", "\n", "@Miragpur7 jok of tha day\n", "[u'@', u'Miragpur7', u'jok', u'of', u'tha', u'day']\n", "\n", "@ilovearrt @shweet_tasu @akanksha_pooh31 @MissHal96 @Mishtithakur @SalgaonkarPriya @Shaleen_Ki_Pari Super cute :p\n", "[u'@', u'ilovearrt', u'@', u'shweet_tasu', u'@', u'akanksha_pooh31', u'@', u'MissHal96', u'@', u'Mishtithakur', u'@', u'SalgaonkarPriya', u'@', u'Shaleen_Ki_Pari', u'Super', u'cute', u':', u'p']\n", "\n", "Looking forward to interacting with the dynamic students, faculty & team of @SriSriU. Its fast becoming a global centre of excellence !\n", "[u'Looking', u'forward', u'to', u'interacting', u'with', u'the', u'dynamic', u'students', u',', u'faculty', u'&', u'amp', u';', u'team', u'of', u'@', u'SriSriU', u'.', u'Its', u'fast', u'becoming', u'a', u'global', u'centre', u'of', u'excellence', u'!']\n", "\n" ] } ], "source": [ "import nltk\n", "for i in range(2, 6):\n", " print data['text'][i]\n", " print nltk.word_tokenize(data['text'][i])\n", " print ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few problems with this. User names like `@ilovearrt` are split into `@` and `iloverrrt`. Similarly, `&` is split. And so on.\n", "\n", "NLTK offers other tokenizers, including the ability to custom-write your own. But for now, we'll just go with our simple list of space-separated words.\n", "\n", "**NOTE**: Tokenization is usually specific to a given dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NLTK\n", "\n", "## Remove stopwords: `nltk.corpus.stopwords` and `.drop()`\n", "\n", "The bigger problem is that the most common words are also the most often used -- to, the, in, a, etc. These are called **stopwords**. We need a way of finding and removing them.\n", "\n", "NLTK offers a standard list of stopwords. This is what we get if we remove those." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " 2441\n", "I 1817\n", "I'm 970\n", "u 695\n", "- 604\n", "@ 507\n", "The 503\n", ":) 493\n", "& 467\n", "like 390\n", "hai 364\n", "(@ 363\n", "good 333\n", "one 285\n", "get 285\n", "! 281\n", "time 280\n", "love 269\n", "n 266\n", "r 263\n", "day 245\n", "RT 244\n", "people 242\n", ":D 240\n", "7 240\n", "#ForSale 227\n", "#Flat 226\n", "don't 223\n", "iOS 222\n", "ur 222\n", " ... \n", "gained 1\n", "grips 1\n", "agreed.. 1\n", "election.#congreefights 1\n", "din.... 1\n", "http://t.co/RXM8hgBBoS 1\n", "langsunglah 1\n", "policies. 1\n", "thriller 1\n", "dummy 1\n", "Amen! 1\n", "थीं 1\n", "meetings 1\n", "#Mathura 1\n", "@AmypichardAmy 1\n", "http://t.co/necdoOUAHU 1\n", "Ravjiani, 1\n", "#TheAsianAge 1\n", "coffee.!! 1\n", "race's 1\n", "http://t.co/aH4i8A0Nz1\" 1\n", "real… 1\n", "http://t.co/GNzghJBYX1 1\n", "update?? 1\n", "#lazy 1\n", "107,#Gurgaon, 1\n", "annaru 1\n", "snooping 1\n", "@BangaloreAshram 1\n", "जैन। 1\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import stopwords\n", "ignore = set(stopwords.words('english')) & set(words.unique())\n", "words.value_counts().drop(ignore)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still, it's not really clear what the words are. We need to go further.\n", "\n", "- Let's use lowecase for standardisation.\n", "- Let's remove punctuations. Maybe any word that *even contains punctuation*, like \"I'm\" or \"&\"\n", "- All single-letter words are a good idea to drop off too, like \"u\"." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "relevant_words = words.str.lower()\n", "relevant_words = relevant_words[~relevant_words.str.contains(r'[^a-z]')]\n", "relevant_words = relevant_words[relevant_words.str.len() > 1]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "good 543\n", "like 418\n", "hai 386\n", "one 365\n", "love 351\n", "time 321\n", "get 300\n", "new 298\n", "people 297\n", "see 273\n", "day 271\n", "ios 255\n", "rt 247\n", "ki 242\n", "ur 242\n", "know 228\n", "go 221\n", "life 219\n", "best 214\n", "se 205\n", "back 201\n", "morning 200\n", "make 192\n", "never 192\n", "hi 192\n", "follow 188\n", "still 188\n", "want 185\n", "india 180\n", "way 178\n", " ... \n", "chattarpur 1\n", "pleaseeeeee 1\n", "bhujiya 1\n", "chuploo 1\n", "enuff 1\n", "roost 1\n", "cantt 1\n", "parsvnath 1\n", "expired 1\n", "beam 1\n", "beshak 1\n", "cld 1\n", "pace 1\n", "mushtaq 1\n", "howdy 1\n", "ghalib 1\n", "leya 1\n", "pudhcha 1\n", "pilgrim 1\n", "soiled 1\n", "lool 1\n", "krissh 1\n", "imo 1\n", "muaaaaah 1\n", "pranam 1\n", "bevkoof 1\n", "destroyed 1\n", "quater 1\n", "vasundhara 1\n", "validity 1\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ignore = set(stopwords.words('english')) & set(relevant_words.unique())\n", "relevant_words.value_counts().drop(ignore)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This list is a lot more meaningful.\n", "\n", "But before we go ahead, let's take a quick look at the *words we've ignored* to see if we should've taken something from there." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " 2441\n", "a 2377\n", "i 2161\n", "i'm 980\n", "u 778\n", "- 604\n", "@ 507\n", ":) 493\n", "& 467\n", "(@ 363\n", "don't 292\n", "n 291\n", ":p 287\n", "r 285\n", "! 281\n", "it's 243\n", ":d 241\n", "7 240\n", "#ios7 232\n", "#forsale 227\n", "#flat 226\n", "2 217\n", ". 216\n", "? 215\n", "!! 204\n", "#residential 204\n", ".. 196\n", ", 191\n", "#bappamorya 189\n", ":-) 173\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words.drop(relevant_words.index).str.lower().value_counts().head(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... Ah! We're missing all the smileys (which may be OK) and the hashtags (which could be useful). Should we just pull in the hashtags alone? Let's do that. We'll allow `#` as an exception. We'll also ignore `@` which usually indicates reply to a person." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "good 543\n", "like 418\n", "hai 386\n", "one 365\n", "love 351\n", "time 321\n", "get 300\n", "new 298\n", "people 297\n", "see 273\n", "day 271\n", "ios 255\n", "rt 247\n", "ki 242\n", "ur 242\n", "know 228\n", "#forsale 227\n", "#flat 226\n", "go 221\n", "life 219\n", "best 214\n", "se 205\n", "#residential 204\n", "back 201\n", "morning 200\n", "never 192\n", "make 192\n", "hi 192\n", "#bappamorya 189\n", "follow 188\n", " ... \n", "circumstances 1\n", "vaat 1\n", "parag 1\n", "recreate 1\n", "#pounding 1\n", "#nestle 1\n", "meuble 1\n", "#thingsthatmakemehappy 1\n", "primarily 1\n", "kanipinchadu 1\n", "#kathmandont 1\n", "ruhu 1\n", "kashif 1\n", "tidak 1\n", "bl 1\n", "dekhaunchu 1\n", "jokingly 1\n", "inclination 1\n", "bd 1\n", "bf 1\n", "sants 1\n", "@itweetfacts 1\n", "dictated 1\n", "bk 1\n", "#instaholic 1\n", "jaaoege 1\n", "mahmood 1\n", "br 1\n", "#justbeingme 1\n", "betch 1\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relevant_words = words.str.lower()\n", "relevant_words = relevant_words[~relevant_words.str.contains(r'[^#@a-z]')]\n", "relevant_words = relevant_words[relevant_words.str.len() > 1]\n", "ignore = set(stopwords.words('english')) & set(relevant_words.unique())\n", "relevant_words.value_counts().drop(ignore)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We haven't added anything to the list of top words, but further down, it may be useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word stems: `nltk.PorterStemmer()`\n", "\n", "Let's look at all the words that start with `time`, like `timing`, `timer`, etc." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "time 321\n", "times 51\n", "timeline 6\n", "timings 3\n", "timeless 3\n", "timesnow 2\n", "timetable 1\n", "tim 1\n", "timely 1\n", "timezone 1\n", "timing 1\n", "timed 1\n", "timline 1\n", "timesheet 1\n", "timepass 1\n", "timro 1\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relevant_words[relevant_words.str.startswith('tim')].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the very least, we want `time` and `times` to mean the same word. These are word stems. Here's one way of doing this in NLTK." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "time 378\n", "timelin 6\n", "timeless 3\n", "timesnow 2\n", "timlin 1\n", "timezon 1\n", "tim 1\n", "timet 1\n", "timesheet 1\n", "timepass 1\n", "timro 1\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "porter = nltk.PorterStemmer()\n", "stemmed_words = relevant_words.apply(porter.stem)\n", "stemmed_words[stemmed_words.str.startswith('tim')].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this introduces words like `timelin` instead of `timeline`. These can be avoided through the use of a process called `lemmatization` (see `nltk.WordNetLemmatizer()`). However, this is relatively slower.\n", "\n", "For now, we'll just stick to the original words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bigrams: `nltk.collocations`\n", "\n", "What if we want to find phrases? If we're looking for 2-word combinations (bigrams), we can use the `nltk.collocations.BigramCollocationFinder`. These are the top 30 word pairs." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#bappamorya #bappamorya\n", "good morning\n", "#flat #forsale\n", "#jacksonville #jobs\n", "will be\n", "#residentialplot #land\n", "to be\n", "agle baras\n", "international airport\n", "#land #forsale\n", "posted photo\n", "baras tu\n", "tu jaldi\n", "#smwmumbai #mumbaiisamazing\n", "now trending\n", "happy birthday\n", "just posted\n", "in the\n", "waiting for\n", "trending topic\n", "jaldi aa\n", "gracious acts\n", "#apartment #flat\n", "@smwmumbai #smwmumbai\n", "follow back\n", "the best\n", "railway station\n", "cycling km\n", "god bless\n", "shows up\n" ] } ], "source": [ "from nltk.collocations import BigramCollocationFinder\n", "from nltk.metrics import BigramAssocMeasures\n", "\n", "bcf = BigramCollocationFinder.from_words(relevant_words)\n", "for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):\n", " print ' '.join(pair)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See this as a word cloud\n", "\n", "Let's get the data into a DataFrame" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
0good543
1like418
2hai386
3one365
4love351
\n", "
" ], "text/plain": [ " word count\n", "0 good 543\n", "1 like 418\n", "2 hai 386\n", "3 one 365\n", "4 love 351" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_words = relevant_words.value_counts().drop(ignore).reset_index()\n", "top_words.columns = ['word', 'count']\n", "top_words.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Work in progress...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# sklearn" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re\n", "\n", "re_separator = re.compile(r'[\\s\"#\\.\\?,;\\(\\)!/]+')\n", "re_url = re.compile(r'http.*?($|\\s)')\n", "def tokenize(sentence):\n", " sentence = re_url.sub('', sentence)\n", " words = re_separator.split(sentence)\n", " return [word for word in words if\n", " len(word) > 1]\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "vectorizer = CountVectorizer(\n", " # analyser='word', # Separate using punctuations\n", " # analyzer=re_separator.split, # Separate using spaces\n", " # analyzer=re_separator.split, # Separate using custom separator\n", " analyzer=tokenize, # Separate using custom separator\n", " min_df=10, # Ignore words that occur less than 10 times in the corpus\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Note: for these 18,000 documents, sklearn takes about ~0.5 seconds on my system\n", "X = vectorizer.fit_transform(data['text'])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# terms: 2482\n", "^_^ 869\n", ":-D 51\n", "I’m 482\n", "don’t 1203\n", "& 0\n", "[pic] 867\n", ":-P 52\n", "-www 7\n", "[pic]: 868\n", "> 1\n", "alert: 908\n", "here: 1437\n", "100% 11\n", "Job: 490\n", "IN: 454\n", "Café 292\n", ":3 53\n", ":o 57\n", ":p 58\n", ":O 55\n", ":D 54\n", ":P 56\n", "Méridien 592\n", "-_- 6\n", "< 2\n", "10:30 12\n" ] } ], "source": [ "# Here are some of the terms that have special characters \n", "print '# terms: %d' % len(vectorizer.vocabulary_)\n", "for key in vectorizer.vocabulary_.keys():\n", " if re.search('\\W', key) and not re.search(r'[@#\\']', key) and re.search('\\w', key):\n", " print key, vectorizer.vocabulary_[key]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Apply TF-IDF\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "transformer = TfidfTransformer()\n", "tfidf = transformer.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6 [u'sorry'] @b50 oops...sorry typo. 'Than'\n", "7 [u'place'] 9h09 place au someil maintenant\n", "24 [u'GM'] @satish_bsk GM\n", "25 [u'org'] @mrlumpyU_U xbek menindas org yg xblik msia. Ngagaha\n", "26 [u'Hey'] Hey evrybuddy http://t.co/vH89PFhyYg\n", "35 [u'ha'] @gauthamvarma04 ha ha\n", "85 [u'ma'] @bindeshpandya gm$... Jay ma bharat..vande mataram.. Namo namah...@BJYM @BJP_Gujarat\n" ] } ], "source": [ "# Let's see the unusual terms\n", "import numpy as np\n", "terms = np.array(vectorizer.get_feature_names())\n", "for index in range(100):\n", " t = terms[(tfidf[index] >= 0.99).toarray()[0]]\n", " if len(t):\n", " print index, t, data['text'][index]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Segment by those with above median followers\n", "followers_count = series.map(lambda v: v['user']['followers_count'])\n", "segment = followers_count.values > followers_count.median()\n", "count1 = X[segment].sum(axis=0)\n", "count2 = X[~segment].sum(axis=0)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abterm
0261242&amp
17453&gt
24844&lt
347's
4416--
\n", "
" ], "text/plain": [ " a b term\n", "0 261 242 &\n", "1 74 53 >\n", "2 48 44 <\n", "3 4 7 's\n", "4 4 16 --" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count of term in each segment\n", "df = pd.DataFrame(np.concatenate([count1, count2]).T).astype(float)\n", "df.columns = ['a', 'b']\n", "df['term'] = terms\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "total = df['a'] + df['b']\n", "contrast = df['a'] / total - 0.5\n", "freq = total.rank() / len(df)\n", "df['significance'] = freq / 2 + contrast.abs()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abtermsignificance
6642901Property0.985282
3702390ForSale0.985093
3652320Flat0.983783
2520189BappaMorya0.980459
6882222Residential0.973747
\n", "
" ], "text/plain": [ " a b term significance\n", "664 290 1 Property 0.985282\n", "370 239 0 ForSale 0.985093\n", "365 232 0 Flat 0.983783\n", "252 0 189 BappaMorya 0.980459\n", "688 222 2 Residential 0.973747" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values('significance', ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def termdiff(terms, counts, segment):\n", " df = pd.DataFrame(np.concatenate([\n", " counts[segment].sum(axis=0),\n", " counts[~segment].sum(axis=0)\n", " ]).T).astype(float)\n", " df.columns = ['a', 'b']\n", " df['term'] = terms\n", " total = df['a'] + df['b']\n", " df['contrast'] = 2 * (df['a'] / total - 0.5)\n", " df['freq'] = total.rank() / len(df)\n", " df['significance'] = (df['freq'] + df['contrast'].abs()) / 2\n", " return df.sort_values('significance', ascending=False)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abtermcontrastfreqsignificance
6642901Property0.9931270.9774380.985282
3702390ForSale1.0000000.9701850.985093
3652320Flat1.0000000.9675660.983783
2520189BappaMorya-1.0000000.9609190.980459
6882222Residential0.9821430.9653510.973747
\n", "
" ], "text/plain": [ " a b term contrast freq significance\n", "664 290 1 Property 0.993127 0.977438 0.985282\n", "370 239 0 ForSale 1.000000 0.970185 0.985093\n", "365 232 0 Flat 1.000000 0.967566 0.983783\n", "252 0 189 BappaMorya -1.000000 0.960919 0.980459\n", "688 222 2 Residential 0.982143 0.965351 0.973747" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "termdiff(terms, X, segment).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There seem to be several influential people on Twitter tweetings about properties for sale. Non-influential people are tweeting about BappaMorya." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [], "source": [ "with_hashtags = series.apply(lambda v: len(v['entities']['hashtags']) > 0).values" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abtermcontrastfreqsignificance
24190135टन-1.0000000.9431910.971595
45135946I'm-0.9286440.9955680.962106
5512134Maharashtra-0.9705880.9437950.957192
2399398और-0.9405940.9226430.931619
1163276dear-0.9487180.8916200.920169
240910156के-0.8795180.9562850.917902
1604056lessons-1.0000000.8350120.917506
246519246है-0.8566040.9744160.915510
697054Rumi-1.0000000.8293710.914686
2468163है।-0.9687500.8605960.914673
\n", "
" ], "text/plain": [ " a b term contrast freq significance\n", "2419 0 135 टन -1.000000 0.943191 0.971595\n", "451 35 946 I'm -0.928644 0.995568 0.962106\n", "551 2 134 Maharashtra -0.970588 0.943795 0.957192\n", "2399 3 98 और -0.940594 0.922643 0.931619\n", "1163 2 76 dear -0.948718 0.891620 0.920169\n", "2409 10 156 के -0.879518 0.956285 0.917902\n", "1604 0 56 lessons -1.000000 0.835012 0.917506\n", "2465 19 246 है -0.856604 0.974416 0.915510\n", "697 0 54 Rumi -1.000000 0.829371 0.914686\n", "2468 1 63 है। -0.968750 0.860596 0.914673" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdiff = termdiff(terms, X, with_hashtags)\n", "tdiff[tdiff['b'] > tdiff['a']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets without hashtags tend to be Hindi tweets.\n", "\n", "The word \"I'm\" often is used without hashtags. (These are typically tweets that say \"I'm at\".)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([u\"I'm at LINK (Mumbai, Maharashtra) http://t.co/ComXHpCbua\",\n", " u\"I'm getting fragrance of a dish being cooked in pure ghee... seems yum\",\n", " u\"I'm at Le Meridien - @spg (Bangalore, Karnataka) http://t.co/GhzDYpdTRu\",\n", " u\"I'm at Lajpat Nagar Metro Station (New Delhi, new delhi) http://t.co/MNEHQ9Qesg\",\n", " u\"I'm at Godrej Memorial Hospital (Mumbai, Maharashtra) http://t.co/8lieJFiZH5\"], dtype=object)" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[X.T[451].toarray()[0] > 0]['text'].values[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The word \"dear\" is often used in hashtags. These are typically replies." ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([u\"@Ghislainemonie hi dear what's new for dinner today. I can't decide..\",\n", " u'@bbhhaappyy So nice of you dear! Trust me and have a great day ahead! Stay blessed and keep connected. Rabh rakha hai!',\n", " u'@MJCfan keep smiling dear..have a great day ahead..',\n", " u'@sonamakapoor @PerniaQureshi. Hi good morning dear',\n", " u'@2ps664 @skelkar07 @keerti07 @TahminaJaved @sheetal3176 @Jyoramesh10 hi dear how r u'], dtype=object)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[X.T[1163].toarray()[0] > 0]['text'].values[:5]" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tdiff = termdiff(terms, X, series.map(lambda v: v['user']['location'].lower().startswith('bangalore')))" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abtermcontrastfreqsignificance
43584218155Hi-0.9113540.9997990.955576
192184218155re-0.9113540.9997990.955576
000&ampNaN0.499799NaN
100&gtNaN0.499799NaN
200&ltNaN0.499799NaN
\n", "
" ], "text/plain": [ " a b term contrast freq significance\n", "435 842 18155 Hi -0.911354 0.999799 0.955576\n", "1921 842 18155 re -0.911354 0.999799 0.955576\n", "0 0 0 & NaN 0.499799 NaN\n", "1 0 0 > NaN 0.499799 NaN\n", "2 0 0 < NaN 0.499799 NaN" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tdiff.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lessons learnt\n", "\n", "1. Tokenisation and filtering of words *always* have a manual element -- so make that easy.\n", " - But are there some robust English tokenisation patterns?\n", "1. Have a single function that tells me what token is unusual about a group\n", "1. For each token, show the concordance for context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# spaCy\n", "\n", "Install [spaCy](https://spacy.io/)\n", "\n", " conda config --add channels spacy\n", " conda install spacy\n", " python -m spacy.en.download all\n", "\n", "If you get an SSL error, run:\n", "\n", " conda config --set ssl_verify False\n", "\n", "and re-run the above commands. This adds an `ssl_verify: False` statement to `~/.condarc` add a line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }