{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Scattertext to Examine President Trump's Tweets\n",
    "### Jason S. Kessler: http://www.jasonkessler.com\n",
    "\n",
    "David Robinson presented a fanstitic analysis of President Trump's tweets the Variance Explained blog: http://varianceexplained.org/r/trump-followup/ .\n",
    "\n",
    "The word-scatter plot in the analysis, however, was a bit crowded and difficult to read (included at the bottom of the notebook).\n",
    "\n",
    "My Python library Scattertext provides and easy way to make legible, interative scatter plots for text visualiztion.  This notebook walks you through the process of creating a similar plot using Scattertext and the PyData ecosystem.\n",
    "\n",
    "Please check out Scattertext on Github at https://github.com/JasonKessler/scattertext for documentation, and see the PyData Seattle talk introducing its usage at https://www.youtube.com/watch?v=H7X9CA2pWKo .\n",
    "\n",
    "If you are academically inclined, you can cite the accompanying technical article as\n",
    "\n",
    "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017. https://arxiv.org/abs/1703.00565\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:98% !important; }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import scattertext as st\n",
    "import re, io, itertools\n",
    "from pprint import pprint\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import spacy.en\n",
    "import os, pkgutil, json, urllib, datetime\n",
    "from urllib.request import urlopen\n",
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:98% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the database of tweets, parse them, filter out RT's and tweets by devices that Trump probably wasn't using. Label them as before or after election"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.concat([pd.read_json('http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json' % (year))\n",
    "                for year in range(2009, 2018)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Twitter for Android         14545\n",
       "Twitter Web Client          12144\n",
       "Twitter for iPhone           3986\n",
       "TweetDeck                     483\n",
       "TwitLonger Beta               405\n",
       "Instagram                     133\n",
       "Facebook                      105\n",
       "Media Studio                   98\n",
       "Twitter Ads                    97\n",
       "Twitter for BlackBerry         97\n",
       "Mobile Web (M5)                56\n",
       "Twitlonger                     23\n",
       "Twitter for iPad               22\n",
       "Vine - Make a Scene            10\n",
       "Twitter QandA                  10\n",
       "Periscope                       7\n",
       "Neatly For BlackBerry 10        5\n",
       "Twitter Mirror for iPad         1\n",
       "Twitter for Websites            1\n",
       "Name: source, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['source'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nlp = spacy.en.English()\n",
    "df['parsed'] = df.text.apply(nlp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df['before_or_after_election'] = df['created_at'].apply(lambda x: 'after' \n",
    "                                                        if x > datetime.datetime(2016,11,9) \n",
    "                                                        else 'before')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_trump_device_non_retweets = df[(df.is_retweet == False) \n",
    "                                & (((df.source == 'Twitter for Android') & (df.created_at < datetime.datetime(2017,4,1)))\n",
    "                                   | ((df.source == 'Twitter for iPhone') & (df.created_at > datetime.datetime(2017,3,1))))\n",
    "                                & df.text.apply(lambda x: ('RT ' not in x \n",
    "                                                           and 'RT:' not in x \n",
    "                                                           and not x.strip().startswith('\"')))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "before    4223\n",
       "after     1653\n",
       "Name: before_or_after_election, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_trump_device_non_retweets['before_or_after_election'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2017-10-20 18:50:21')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_trump_device_non_retweets.created_at.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus = st.CorpusFromParsedDocuments(df_trump_device_non_retweets, \n",
    "                                      category_col='before_or_after_election', \n",
    "                                      parsed_col='parsed').build()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 0, 2, 9, 11]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "st.version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the plot and display it\n",
    "\n",
    "We can can make some interesting obsverations beyond what we could see in the Scatterplot below.\n",
    "\n",
    "* He has tweeted a lot of about \"fake news\" after the election, but never before. \n",
    "* He tweeted extensively about climate change (\"warming\", \"climate\", \"freezing\", \"ice\") before the election, but never after (!)\n",
    "* He only tweeted about \"workers\" once before the election, but has multiple times afterward. In a similar vein, the word \"jobs\" occured much more often after the election than before.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1300\"\n",
       "            height=\"700\"\n",
       "            src=\"output/trump_before_after_election.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1559f8eb8>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = st.produce_scattertext_explorer(corpus,\n",
    "                                       category='after',\n",
    "                                       category_name='After Election',\n",
    "                                       not_category_name='Before Election',\n",
    "                                       use_full_doc=True,\n",
    "                                       minimum_term_frequency=5,\n",
    "                                       pmi_filter_thresold=10,\n",
    "                                       term_ranker=st.OncePerDocFrequencyRanker,\n",
    "                                       width_in_pixels=1000,\n",
    "                                       sort_by_dist=False,\n",
    "                                       metadata=df_trump_device_non_retweets['created_at'].astype(str))\n",
    "file_name = 'output/trump_before_after_election.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1300, height=700)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1300\"\n",
       "            height=\"700\"\n",
       "            src=\"output/trump_before_after_election.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x15ae87860>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = st.produce_fightin_words_explorer(corpus,\n",
    "                                       category='after',\n",
    "                                       category_name='After Election',\n",
    "                                       not_category_name='Before Election',\n",
    "                                       use_full_doc=True,\n",
    "                                       minimum_term_frequency=5,\n",
    "                                       pmi_filter_thresold=10,\n",
    "                                       term_ranker=st.OncePerDocFrequencyRanker,\n",
    "                                       width_in_pixels=1000,\n",
    "                                       metadata=df_trump_device_non_retweets['created_at'].astype(str))\n",
    "file_name = 'output/trump_before_after_election.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1300, height=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The original chart: (created August 9, 2017)\n",
    "\n",
    "![ggplot2 scatter plot](http://varianceexplained.org/figs/2017-08-09-trump-followup/before_after_scatter-1.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py36]",
   "language": "python",
   "name": "Python [py36]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}