{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mining Twitter\n", "\n", "Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application. There are four primary identifiers you'll need to note for an OAuth 1.0A workflow: consumer key, consumer secret, access token, and access token secret. Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are taking advantage of the virtual machine experience for this chapter that is powered by Vagrant, you should just be able to execute the code in this notebook without any worries whatsoever about installing dependencies. If you are running the code from your own development envioronment, however, be advised that these examples in this chapter take advantage of a Python package called [twitter](https://github.com/sixohsix/twitter) to make API calls. You can install this package in a terminal with [pip](https://pypi.python.org/pypi/pip) with the command `pip install twitter`, preferably from within a [Python virtual environment](https://pypi.python.org/pypi/virtualenv). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once installed, you should be able to open up a Python interpreter (or better yet, your [IPython](http://ipython.org/) interpreter) and get rolling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Authorizing an application to access Twitter account data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import twitter\n", "\n", "# Go to http://dev.twitter.com/apps/new to create an app and get values\n", "# for these credentials, which you'll need to provide in place of these\n", "# empty string values that are defined as placeholders.\n", "# See https://developer.twitter.com/en/docs/basics/authentication/overview/oauth\n", "# for more information on Twitter's OAuth implementation.\n", "\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "OAUTH_TOKEN = ''\n", "OAUTH_TOKEN_SECRET = ''\n", "\n", "auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", "\n", "twitter_api = twitter.Twitter(auth=auth)\n", "\n", "# Nothing to see by displaying twitter_api except that it's now a\n", "# defined variable\n", "\n", "print(twitter_api)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieving trends" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The Yahoo! Where On Earth ID for the entire world is 1.\n", "# See https://dev.twitter.com/docs/api/1.1/get/trends/place and\n", "# http://developer.yahoo.com/geo/geoplanet/\n", "\n", "WORLD_WOE_ID = 1\n", "US_WOE_ID = 23424977\n", "\n", "# Prefix ID with the underscore for query string parameterization.\n", "# Without the underscore, the twitter package appends the ID value\n", "# to the URL itself as a special case keyword argument.\n", "\n", "world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)\n", "us_trends = twitter_api.trends.place(_id=US_WOE_ID)\n", "\n", "print(world_trends)\n", "print()\n", "print(us_trends)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for trend in world_trends[0]['trends']:\n", " print(trend['name'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for trend in us_trends[0]['trends']:\n", " print(trend['name'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "world_trends_set = set([trend['name'] \n", " for trend in world_trends[0]['trends']])\n", "\n", "us_trends_set = set([trend['name'] \n", " for trend in us_trends[0]['trends']]) \n", "\n", "common_trends = world_trends_set.intersection(us_trends_set)\n", "\n", "print(common_trends)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Anatomy of a Tweet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "# Set this variable to a trending topic, \n", "# or anything else for that matter. The example query below\n", "# was a trending topic when this content was being developed\n", "# and is used throughout the remainder of this chapter.\n", "\n", "q = '#MothersDay' \n", "\n", "count = 100\n", "\n", "# Import unquote to prevent url encoding errors in next_results\n", "from urllib.parse import unquote\n", "\n", "# See https://dev.twitter.com/rest/reference/get/search/tweets\n", "\n", "search_results = twitter_api.search.tweets(q=q, count=count)\n", "\n", "statuses = search_results['statuses']\n", "\n", "\n", "# Iterate through 5 more batches of results by following the cursor\n", "for _ in range(5):\n", " print('Length of statuses', len(statuses))\n", " try:\n", " next_results = search_results['search_metadata']['next_results']\n", " except KeyError as e: # No more results when next_results doesn't exist\n", " break\n", " \n", " # Create a dictionary from next_results, which has the following form:\n", " # ?max_id=847960489447628799&q=%23RIPSelena&count=100&include_entities=1\n", " kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split(\"&\") ])\n", " \n", " search_results = twitter_api.search.tweets(**kwargs)\n", " statuses += search_results['statuses']\n", "\n", "# Show one sample search result by slicing the list...\n", "print(json.dumps(statuses[0], indent=1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(10):\n", " print()\n", " print(statuses[i]['text'])\n", " print('Favorites: ', statuses[i]['favorite_count'])\n", " print('Retweets: ', statuses[i]['retweet_count'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting text, screen names, and hashtags from tweets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status_texts = [ status['text'] \n", " for status in statuses ]\n", "\n", "screen_names = [ user_mention['screen_name'] \n", " for status in statuses\n", " for user_mention in status['entities']['user_mentions'] ]\n", "\n", "hashtags = [ hashtag['text'] \n", " for status in statuses\n", " for hashtag in status['entities']['hashtags'] ]\n", "\n", "# Compute a collection of all words from all tweets\n", "words = [ w \n", " for t in status_texts \n", " for w in t.split() ]\n", "\n", "# Explore the first 5 items for each...\n", "\n", "print(json.dumps(status_texts[0:5], indent=1))\n", "print(json.dumps(screen_names[0:5], indent=1) )\n", "print(json.dumps(hashtags[0:5], indent=1))\n", "print(json.dumps(words[0:5], indent=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a basic frequency distribution from the words in tweets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "\n", "for item in [words, screen_names, hashtags]:\n", " c = Counter(item)\n", " print(c.most_common()[:10]) # top 10\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using prettytable to display tuples in a nice tabular format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from prettytable import PrettyTable\n", "\n", "for label, data in (('Word', words), \n", " ('Screen Name', screen_names), \n", " ('Hashtag', hashtags)):\n", " pt = PrettyTable(field_names=[label, 'Count']) \n", " c = Counter(data)\n", " [ pt.add_row(kv) for kv in c.most_common()[:10] ]\n", " pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment\n", " print(pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating lexical diversity for tweets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A function for computing lexical diversity\n", "def lexical_diversity(tokens):\n", " return len(set(tokens))/len(tokens) \n", "\n", "# A function for computing the average number of words per tweet\n", "def average_words(statuses):\n", " total_words = sum([ len(s.split()) for s in statuses ]) \n", " return total_words/len(statuses)\n", "\n", "print(lexical_diversity(words))\n", "print(lexical_diversity(screen_names))\n", "print(lexical_diversity(hashtags))\n", "print(average_words(status_texts))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding the most popular retweets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "retweets = [\n", " # Store out a tuple of these three values ...\n", " (status['retweet_count'], \n", " status['retweeted_status']['user']['screen_name'],\n", " status['retweeted_status']['id'],\n", " status['text']) \n", " \n", " # ... for each status ...\n", " for status in statuses \n", " \n", " # ... so long as the status meets this condition.\n", " if 'retweeted_status' in status.keys()\n", " ]\n", "\n", "# Slice off the first 5 from the sorted results and display each item in the tuple\n", "\n", "pt = PrettyTable(field_names=['Count', 'Screen Name', 'Tweet ID', 'Text'])\n", "[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]\n", "pt.max_width['Text'] = 50\n", "pt.align= 'l'\n", "print(pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking up users who have retweeted a status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the original tweet id for a tweet from its retweeted_status node \n", "# and insert it here\n", "\n", "_retweets = twitter_api.statuses.retweets(id=862359093398261760)\n", "print([r['user']['screen_name'] for r in _retweets])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting frequencies of words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "word_counts = sorted(Counter(words).values(), reverse=True)\n", "\n", "plt.loglog(word_counts)\n", "plt.ylabel(\"Freq\")\n", "plt.xlabel(\"Word Rank\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating histograms of words, screen names, and hashtags" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for label, data in (('Words', words), \n", " ('Screen Names', screen_names), \n", " ('Hashtags', hashtags)):\n", "\n", " # Build a frequency map for each set of data\n", " # and plot the values\n", " c = Counter(data)\n", " plt.hist(list(c.values()))\n", " \n", " # Add a title and y-label ...\n", " plt.title(label)\n", " plt.ylabel(\"Number of items in bin\")\n", " plt.xlabel(\"Bins (number of times an item appeared)\")\n", " \n", " # ... and display as a new figure\n", " plt.figure()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating a histogram of retweet counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using underscores while unpacking values in\n", "# a tuple is idiomatic for discarding them\n", "\n", "counts = [count for count, _, _, _ in retweets]\n", "\n", "plt.hist(counts)\n", "plt.title('Retweets')\n", "plt.xlabel('Bins (number of times retweeted)')\n", "plt.ylabel('Number of tweets in bin')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Sentiment Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pip install nltk\n", "import nltk\n", "nltk.download('vader_lexicon')\n", "\n", "import numpy as np\n", "from nltk.sentiment.vader import SentimentIntensityAnalyzer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "twitter_stream = twitter.TwitterStream(auth=auth)\n", "iterator = twitter_stream.statuses.sample()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tweets = []\n", "for tweet in iterator:\n", " try:\n", " if tweet['lang'] == 'en':\n", " tweets.append(tweet)\n", " except:\n", " pass\n", " if len(tweets) == 100:\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer = SentimentIntensityAnalyzer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer.polarity_scores('Hello')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer.polarity_scores('I really enjoy this video series.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer.polarity_scores('I REALLY enjoy this video series.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer.polarity_scores('I REALLY enjoy this video series!!!')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analyzer.polarity_scores('I REALLY did not enjoy this video series!!!')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = np.zeros(len(tweets))\n", "\n", "for i, t in enumerate(tweets):\n", " # Extract the text portion of the tweet\n", " text = t['text']\n", " \n", " # Measure the polarity of the tweet\n", " polarity = analyzer.polarity_scores(text)\n", " \n", " # Store the normalized, weighted composite score\n", " scores[i] = polarity['compound']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "most_positive = np.argmax(scores)\n", "most_negative = np.argmin(scores)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('{0:6.3f} : \"{1}\"'.format(scores[most_positive], tweets[most_positive]['text']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('{0:6.3f} : \"{1}\"'.format(scores[most_negative], tweets[most_negative]['text']))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }