## Using Scattertext to Examine President Trump's Tweets
### Jason S. Kessler: http://www.jasonkessler.com

David Robinson presented a fanstitic analysis of President Trump's tweets the Variance Explained blog: http://varianceexplained.org/r/trump-followup/ .

He presented an intersting scatter plot relating frequency of word use among the president's tweets before and after his election. Due to ggplot2's limitations, the scatter plot was a bit hard to read. Luckily, Python's Scattertext provides and easy way to make legible, interative scatter plots for text visualiztion. See how the same tweets were made into a Scattertext scatter plot below using Python.

Please check out Scattertext on Github at https://github.com/JasonKessler/scattertext for documentation, and see the PyData Seattle talk introducing it's usage at https://www.youtube.com/watch?v=H7X9CA2pWKo .

If you are academically inclined, you can cite the accompanying technical article as

Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017. https://arxiv.org/abs/1703.00565


In [20]:
%matplotlib inline
import scattertext as st
import re, io, itertools
from pprint import pprint
import pandas as pd
import numpy as np
import spacy.en
import os, pkgutil, json, urllib, datetime
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML(""))

## Download the database of tweets, parse them, filter out RT's and non Android tweets, and label them as before or after election

In [16]:
df = pd.concat([pd.read_json('http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json' % (year))
 for year in range(2009, 2018)])

In [18]:
nlp = spacy.en.English()
df['parsed'] = df.text.apply(nlp)

In [26]:
df['before_or_after_election'] = df['created_at'].apply(lambda x: 'after' 
 if x > datetime.datetime(2016,11,9) 
 else 'before')

In [46]:
df_android_non_retweets = df[(df.is_retweet == False) 
 & (df.source == 'Twitter for Android')
 & df.text.apply(lambda x: 'RT ' not in x and 'RT:' not in x)]

In [47]:
df_android_non_retweets['before_or_after_election'].value_counts()

before 13989
after 435
Name: before_or_after_election, dtype: int64

In [48]:
corpus = st.CorpusFromParsedDocuments(df_android_non_retweets, 
 category_col='before_or_after_election', 
 parsed_col='parsed').build()

## Create the plot and display it

In [49]:
html = st.produce_scattertext_explorer(corpus,
 category='after',
 category_name='After Election',
 not_category_name='Before Election',
 use_full_doc=True,
 minimum_term_frequency=2,
 pmi_filter_thresold=10,
 minimum_not_category_term_frequency=10,
 width_in_pixels=1000,
 metadata=df_android_non_retweets['created_at'])
file_name = 'trump_before_after_election.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)