{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Scattertext to Examine President Trump's Tweets\n",
    "### Jason S. Kessler: http://www.jasonkessler.com\n",
    "\n",
    "David Robinson presented a fanstitic analysis of President Trump's tweets the Variance Explained blog: http://varianceexplained.org/r/trump-followup/ .\n",
    "\n",
    "He presented an intersting scatter plot relating frequency of word use among the president's tweets before and after his election. Due to ggplot2's limitations, the scatter plot was a bit hard to read.  Luckily, Python's Scattertext provides and easy way to make legible, interative scatter plots for text visualiztion.  See how the same tweets were made into a Scattertext scatter plot below using Python.\n",
    "\n",
    "Please check out Scattertext on Github at https://github.com/JasonKessler/scattertext for documentation, and see the PyData Seattle talk introducing it's usage at https://www.youtube.com/watch?v=H7X9CA2pWKo .\n",
    "\n",
    "If you are academically inclined, you can cite the accompanying technical article as\n",
    "\n",
    "Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017. https://arxiv.org/abs/1703.00565\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:98% !important; }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import scattertext as st\n",
    "import re, io, itertools\n",
    "from pprint import pprint\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import spacy.en\n",
    "import os, pkgutil, json, urllib, datetime\n",
    "from urllib.request import urlopen\n",
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:98% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the database of tweets, parse them, filter out RT's and non Android tweets, and label them as before or after election"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.concat([pd.read_json('http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json' % (year))\n",
    "                for year in range(2009, 2018)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.en.English()\n",
    "df['parsed'] = df.text.apply(nlp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['before_or_after_election'] = df['created_at'].apply(lambda x: 'after' \n",
    "                                                        if x > datetime.datetime(2016,11,9) \n",
    "                                                        else 'before')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_android_non_retweets = df[(df.is_retweet == False) \n",
    "                                & (df.source == 'Twitter for Android')\n",
    "                                & df.text.apply(lambda x: 'RT ' not in x and 'RT:' not in x)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "before    13989\n",
       "after       435\n",
       "Name: before_or_after_election, dtype: int64"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_android_non_retweets['before_or_after_election'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus = st.CorpusFromParsedDocuments(df_android_non_retweets, \n",
    "                                      category_col='before_or_after_election', \n",
    "                                      parsed_col='parsed').build()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the plot and display it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1200\"\n",
       "            height=\"700\"\n",
       "            src=\"trump_before_after_election.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x1081dd940>"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html = st.produce_scattertext_explorer(corpus,\n",
    "                                       category='after',\n",
    "                                       category_name='After Election',\n",
    "                                       not_category_name='Before Election',\n",
    "                                       use_full_doc=True,\n",
    "                                       minimum_term_frequency=2,\n",
    "                                       pmi_filter_thresold=10,\n",
    "                                       minimum_not_category_term_frequency=10,\n",
    "                                       width_in_pixels=1000,\n",
    "                                       metadata=df_android_non_retweets['created_at'])\n",
    "file_name = 'trump_before_after_election.html'\n",
    "open(file_name, 'wb').write(html.encode('utf-8'))\n",
    "IFrame(src=file_name, width = 1200, height=700)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}