{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Scattertext to Examine President Trump's Tweets\n", "### Jason S. Kessler: http://www.jasonkessler.com\n", "\n", "David Robinson presented a fanstitic analysis of President Trump's tweets the Variance Explained blog: http://varianceexplained.org/r/trump-followup/ .\n", "\n", "The word-scatter plot in the analysis, however, was a bit crowded and difficult to read (included at the bottom of the notebook).\n", "\n", "My Python library Scattertext provides and easy way to make legible, interative scatter plots for text visualiztion. This notebook walks you through the process of creating a similar plot using Scattertext and the PyData ecosystem.\n", "\n", "Please check out Scattertext on Github at https://github.com/JasonKessler/scattertext for documentation, and see the PyData Seattle talk introducing its usage at https://www.youtube.com/watch?v=H7X9CA2pWKo .\n", "\n", "If you are academically inclined, you can cite the accompanying technical article as\n", "\n", "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017. https://arxiv.org/abs/1703.00565\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import scattertext as st\n", "import re, io, itertools\n", "from pprint import pprint\n", "import pandas as pd\n", "import numpy as np\n", "import spacy.en\n", "import os, pkgutil, json, urllib, datetime\n", "from urllib.request import urlopen\n", "from IPython.display import IFrame\n", "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the database of tweets, parse them, filter out RT's and tweets by devices that Trump probably wasn't using. Label them as before or after election" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.concat([pd.read_json('http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json' % (year))\n", " for year in range(2009, 2018)])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Twitter for Android 14545\n", "Twitter Web Client 12144\n", "Twitter for iPhone 3986\n", "TweetDeck 483\n", "TwitLonger Beta 405\n", "Instagram 133\n", "Facebook 105\n", "Media Studio 98\n", "Twitter Ads 97\n", "Twitter for BlackBerry 97\n", "Mobile Web (M5) 56\n", "Twitlonger 23\n", "Twitter for iPad 22\n", "Vine - Make a Scene 10\n", "Twitter QandA 10\n", "Periscope 7\n", "Neatly For BlackBerry 10 5\n", "Twitter Mirror for iPad 1\n", "Twitter for Websites 1\n", "Name: source, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['source'].value_counts()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nlp = spacy.en.English()\n", "df['parsed'] = df.text.apply(nlp)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['before_or_after_election'] = df['created_at'].apply(lambda x: 'after' \n", " if x > datetime.datetime(2016,11,9) \n", " else 'before')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_trump_device_non_retweets = df[(df.is_retweet == False) \n", " & (((df.source == 'Twitter for Android') & (df.created_at < datetime.datetime(2017,4,1)))\n", " | ((df.source == 'Twitter for iPhone') & (df.created_at > datetime.datetime(2017,3,1))))\n", " & df.text.apply(lambda x: ('RT ' not in x \n", " and 'RT:' not in x \n", " and not x.strip().startswith('\"')))]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "before 4223\n", "after 1653\n", "Name: before_or_after_election, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_trump_device_non_retweets['before_or_after_election'].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2017-10-20 18:50:21')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_trump_device_non_retweets.created_at.max()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "corpus = st.CorpusFromParsedDocuments(df_trump_device_non_retweets, \n", " category_col='before_or_after_election', \n", " parsed_col='parsed').build()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 0, 2, 9, 11]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "st.version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the plot and display it\n", "\n", "We can can make some interesting obsverations beyond what we could see in the Scatterplot below.\n", "\n", "* He has tweeted a lot of about \"fake news\" after the election, but never before. \n", "* He tweeted extensively about climate change (\"warming\", \"climate\", \"freezing\", \"ice\") before the election, but never after (!)\n", "* He only tweeted about \"workers\" once before the election, but has multiple times afterward. In a similar vein, the word \"jobs\" occured much more often after the election than before.\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_scattertext_explorer(corpus,\n", " category='after',\n", " category_name='After Election',\n", " not_category_name='Before Election',\n", " use_full_doc=True,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=10,\n", " term_ranker=st.OncePerDocFrequencyRanker,\n", " width_in_pixels=1000,\n", " sort_by_dist=False,\n", " metadata=df_trump_device_non_retweets['created_at'].astype(str))\n", "file_name = 'output/trump_before_after_election.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html = st.produce_fightin_words_explorer(corpus,\n", " category='after',\n", " category_name='After Election',\n", " not_category_name='Before Election',\n", " use_full_doc=True,\n", " minimum_term_frequency=5,\n", " pmi_filter_thresold=10,\n", " term_ranker=st.OncePerDocFrequencyRanker,\n", " width_in_pixels=1000,\n", " metadata=df_trump_device_non_retweets['created_at'].astype(str))\n", "file_name = 'output/trump_before_after_election.html'\n", "open(file_name, 'wb').write(html.encode('utf-8'))\n", "IFrame(src=file_name, width = 1300, height=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The original chart: (created August 9, 2017)\n", "\n", "![ggplot2 scatter plot](http://varianceexplained.org/figs/2017-08-09-trump-followup/before_after_scatter-1.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [py36]", "language": "python", "name": "Python [py36]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }