# warc-twarc


This notebook demonstrates how to use [Webrecorder](http://webrecorder.io) to record a set of search results from Twitter and then extract the Twitter data for those results using [twarc](https://github.com/docnow/twarc). The notebook will show how to record a search for tweets from Donald Trump that mention "fake news" and print out the time that the tweet was sent and the text of the tweet.

If you'd like to skip the explanation and just use a script from the command line you can find that [here](https://github.com/edsu/warc-twarc/blob/master/warc-twarc.py). If you'd like to interact with this Notebook online without running it locally try it [binder](https://mybinder.org/):

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/edsu/warc-twarc/master?filepath=warc-twarc.ipynb)



---

## Step 1: Recording a Twitter Search

First go to Twitter and [search](https://twitter.com/search/) for something that's of interest. For example you can search for all of Donald Trump's tweets that mention "fake news" with the query [from:realDonaldTrump "fake news"](https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd). Check out the [advanced search](https://twitter.com/search-advanced) page for all the ways of fine tuning a search. Note that you can limit by time, which we will return to later.

Now copy the URL in your browser and head over to [Webrecorder](https://webrecorder.io). Webrecorder lets you record the results of a series of interactions in your browser and download the data as a [WARC](https://en.wikipedia.org/wiki/Web_ARChive) file (more on WARC shortly). You'll probably want to register for an account since anonymous recording sessions will expire after some period of time.

Paste the URL for your Twitter search into the box and create a collection using the drop down box:

![Webrecorder homepage](images/webrecorder-01.png)

Once you do that you should see a screen something like this:

![Webrecorder homepage](images/webrecorder-02.png)

If you scroll to the bottom of the screen you should see that the page automatically loads more tweets that match the search query. This is the so called *infinite scroll* behavior, where the web browser goes and fetches more tweet from Twitter and puts them into the page automatically.

Fortunately Webrecorder records all of these background interactions, and even provides an *autoscroll* function (a button at the top left of the screen). When you click on `autoscroll` Webrecorder will automatically scroll to the bottom of each page, triggering more results. You can let this run until it stops or until you've gotten enough results.

Once you are done hover over the *record* button just above *autoscroll* and click to *stop* the recording. You will then see a page that lists the recordings in your collection. And you should see the option to download the data (it's a little cloud with an arrow). Move the downloaded file into the directory containing this Jupyter notebook and set the `warc_file` variable.

In [1]:
warc_file = "recording-session-20171220102048.warc.gz"

## Step 2: Extracting the Tweet Identifiers

Now that we have our recording our next task is to read it and find all the tweet ids that were mentinoed in the search results. The WARC file is a specialized file that contains all the [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) transactions performed by your web browser with some additional metadata. Fortunately the Webrecorder project have made a small Python library called [warcio](https://github.com/webrecorder/warcio) which makes it much easier to read (and write) WARC data.

Go and install warcio and a couple other modules we'll be needing:

 pip install warcio twarc BeautifulSoup4
 
The WARC file is broken up into chunks called *records* which can be of different types (request, response, warcinfo). HTTP requests result in HTTP responses, which can be described using warcinfo metadata records. In our case we are interested in the *response* records because we want to look for tweet identifiers in the responses that came back from Twitter. 

Lets read through our file and extract the URLs that have been recorded (from the WARC record headers), and also look at the type of content being returned (using the Content-Type HTTP header of the response):
 

In [2]:
from warcio.archiveiterator import ArchiveIterator

for record in ArchiveIterator(open(warc_file, 'rb')):
 if record.rec_type == 'response':
 url = record.rec_headers.get_header('WARC-Target-URI')
 content_type = record.http_headers.get_header('Content-Type')
 print(content_type, url)
 

text/html;charset=utf-8 https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd
image/jpeg https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_bigger.jpg
image/png https://abs.twimg.com/emoji/v2/72x72/1f1fa-1f1f8.png
text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_more_1.bundle.css
text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_core.bundle.css
image/png https://abs.twimg.com/a/1513630383/img/search/ic_places_foursquare_logo.png
image/jpeg https://pbs.twimg.com/profile_images/907173400245915654/dyAyfiBr_bigger.jpg
image/png https://abs.twimg.com/a/1513630383/img/search/ic_places_yelp_logo.png
text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_more_2.bundle.css
text/javascript; charset=utf-8 https://twitter.com/i/js_inst?c_name=ui_metrics
application/javascript; charset=utf-8 https://abs.twimg.com/k/en/init.en.3e62c1938034c4d4c7b5.js
text/javascript; charset=utf-8 https://twitter.com/i/js_inst?c_name=ui_metrics
image/jp

At the top of the output you should see *text/html;charset=utf-8* response to a URL like:

 https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd

That's our intial HTML response we got from Twitter for our search. After that you should see a bunch of images, css, fonts and javascript being fetched. That's all the stuff the browser found in the HTML that was needed to render the page. 

If you scroll down a little bit further you should see a `text/javascript; charset=utf-8` response for a long URL like this:

 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=240000&latent_count=0&min_position=TWEET-943135588496093190-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQAAAAAAAAEAAAAAAAAAAAQAAABAAAAAAAAAAAAAAAAAAABAAAAQAAAAQCAAAQAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAQAEAAAAQAAAAAAAEAAABAABEAABIAAAAAAAACQAAAABAQAgAQAAAAAAAAAAAAAgAAAAAACAAAAAAAAAAAAAAhCAEAEAACABAEAAAAAAAAAAIAAAgAABAAgAAAAAAAAAAAAAAAAAAAAgAAAAAACABAAAAAlAAAAQAAAACBEAAAAAAAgAAAAEAAAAAAAAEAABAAAAAAAAAAAAAAAEAAAAAABAAAAAABAAAAAAAEAAAAAAAAAAAARAAAAAAAAAAAAAAIAAAAAAAQAAQQAAEAAAQAAAAAQAECAAAAwAgAAAAAAEAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAAQAAAAAEgAAgAAAAAAEAAAAAAAAAAAAAKAEAAAAAAADAAACAgAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAIAAAgAAAAAAAAAACAAAgAAAAAgBAAAAAAAAAAQACAAAAAAAAAAAAAgAAAIAAAAAAAAAAAAAAIAAAAAAAAAAAgAAAAAAAABAAgAIAAAAAAAAAAAAACAEAAAAAAAAAAAAAAAAAAAAAAgAACAAgAAAABAAAAAAACAAAAAAAAAAAAAAAAAQAAAAgABAAAAAAAAEAAAAAAAAAAAAAQCABAAABAAAABAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAACAAAAAAAAIAEAAAAAAAAAIAAAAAAAAAAAAAgAAAAAAAAAAAAAAAQACAIAAgBAAAAAAQFAAAAAAAAAAAAAAAAAA%3D%3D-R-0
 
That is the infinite scroll behavior where JavaScript goes and fetches more results to interleave into the page. So we need to extract tweet identifiers from two types of responses, the initial HTML page response, and the subsequent calls to fetch more data. Let's create two functions for doing that.

### Extract Tweet Identifiers from HTML

Fortunately the tweet identifiers in the HTML are fairly easy to find. Each tweet is represented as a <div> that has a class *tweet*. This <div> has an attribute *data-tweet-id* which contains the tweet identifier. We can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to easily parse and work with the HTML:

In [3]:
from bs4 import BeautifulSoup

def extract_html(html):
 soup = BeautifulSoup(html, 'html.parser')
 for div in soup.find_all('div', class_='tweet'):
 yield div.attrs.get('data-tweet-id')

Let's try it out by reading our WARC file again, but this time looking for HTTP responses from `https://twitter.com/search`. If we find any we'll get the response content and send it to our new `extract_html` function:

In [4]:
for record in ArchiveIterator(open(warc_file, 'rb')):
 if record.rec_type == 'response':
 url = record.rec_headers.get_header('WARC-Target-URI')
 if url.startswith('https://twitter.com/search?q='):
 for tweet_id in extract_html(record.content_stream().read()):
 print(tweet_id)

935838073618870272
921726400922619904
899623926082535425
921829947093733376
913034591879024640
929509950811881472
940223974985871360
914094625488502784
939634404267380736
939967625362276354
921207772233990144
917921548677328896
937279001684598784
920406959320371200
915539424406114304
918457595618365441
935844881825763328
922072236592435200
936688444046266368
943135588496093190


You should see 20 tweet identifiers, which are the first 20 returned when you do a Twitter search.

### Extract Tweet Identifiers from JavaScript

Now lets create a function that can extract the tweet identifiers from the JavaScript that are triggered by the infinite scroll behavior:


In [5]:
import json

def extract_javascript(content):
 # make sure the content is decoded or else json.loads complains
 content = content.decode('utf-8')
 data = json.loads(content)
 for tweet_id in extract_html(data.get('items_html', '')):
 yield tweet_id

Let's try that function out too by reading our WARC file again and looking for `https://twitter.com/i/search/timeline` URLs, and handing them off to our new `extract_javascript` function:

In [6]:
for record in ArchiveIterator(open(warc_file, 'rb')):
 if record.rec_type == 'response':
 url = record.rec_headers.get_header('WARC-Target-URI')
 if url.startswith('https://twitter.com/i/search/timeline?'):
 for tweet_id in extract_javascript(record.content_stream().read()):
 print(tweet_id)

940930017365778432
907588803161939968
934563828834164739
925364408364171265
898964640817983488
918061437750267904
915907150333009920
901031532164468736
940554567414091776
900706146943717377
925333956110757888
926481563214376961
934551607596986368
939616077356642304
899411254061694979
918112884630093825
935874566701842434
937145025359761408
924251519121346560
898130328916824064
879682547235651584
894984126582972416
879648931172556802
894518002795900928
880771685460344832
894512983384129536
880015261004435456
868810522942164993
892383242535481344
874609480301936640
888575966259314691
914099295963553792
887477071160762369
869509894688387072
915894251967385600
914189344533024768
884020939264073728
884378624660582405
923147501418446849
868810404335673344
939485131693322240
890568797941362690
891437168798965761
879678356450676736
877372660455546880
889675644396867584
875690204564258816
887475373981696000
880017678978736129
872064426568036353
894514535062790144
886544734788997125
900352052068

Now we can combine these two functions to compose a function that takes a WARC file as a parameter and returns all the tweet ids contained in the WARC file!

In [7]:
def tweet_ids(warc_file):
 for record in ArchiveIterator(open(warc_file, 'rb')):
 if record.rec_type == 'response':
 
 url = record.rec_headers.get_header('WARC-Target-URI')
 content = record.content_stream().read()
 
 if url.startswith('https://twitter.com/search?q='):
 for tweet_id in extract_html(content):
 yield tweet_id
 
 elif url.startswith('https://twitter.com/i/search/timeline?'):
 for tweet_id in extract_javascript(content):
 yield tweet_id

We've seen our functions work before but lets make sure the composed function works:

In [8]:
for tweet_id in tweet_ids(warc_file):
 print(tweet_id)

935838073618870272
921726400922619904
899623926082535425
921829947093733376
913034591879024640
929509950811881472
940223974985871360
914094625488502784
939634404267380736
939967625362276354
921207772233990144
917921548677328896
937279001684598784
920406959320371200
915539424406114304
918457595618365441
935844881825763328
922072236592435200
936688444046266368
943135588496093190
940930017365778432
907588803161939968
934563828834164739
925364408364171265
898964640817983488
918061437750267904
915907150333009920
901031532164468736
940554567414091776
900706146943717377
925333956110757888
926481563214376961
934551607596986368
939616077356642304
899411254061694979
918112884630093825
935874566701842434
937145025359761408
924251519121346560
898130328916824064
879682547235651584
894984126582972416
879648931172556802
894518002795900928
880771685460344832
894512983384129536
880015261004435456
868810522942164993
892383242535481344
874609480301936640
888575966259314691
914099295963553792
887477071160

## Step 3: Get the Twitter Data

Now that we have the tweet identifiers we can use [twarc](https://github.com/docnow/twarc) to fetch the Twitter JSON data for each tweet from Twitter's API. You will need to go to [apps.twitter.com](https://apps.twitter.com) to create an application, and get your API keys to use twarc.

Tell twarc about your keys by running this at the command line:

 twarc configure

Now you should be ready to use twarc to hydrate the tweet ids to get the rich structured data for a tweet:

In [9]:
import twarc

# If you haven't configure twarc from the command line you can also set the keys manually 
# by uncommenting and filling in the following text:
#
# CONSUMER_KEY = ""
# CONSUMER_SECRET = ""
# ACCESS_TOKEN = ""
# ACCESS_TOKEN_SECRET = ""
#
# twitter = twarc.Twarc(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

twitter = twarc.Twarc()

for tweet in twitter.hydrate(tweet_ids(warc_file)):
 url = 'https://twitter.com/' + tweet['user']['screen_name'] + '/status' + tweet['id_str']
 print (url, tweet['created_at'], tweet['full_text'])

https://twitter.com/realDonaldTrump/status921207772233990144 Fri Oct 20 02:53:42 +0000 2017 The Fake News is going crazy with wacky Congresswoman Wilson(D), who was SECRETLY on a very personal call, and gave a total lie on content!
https://twitter.com/realDonaldTrump/status900352052068401154 Wed Aug 23 13:40:31 +0000 2017 Last night in Phoenix I read the things from my statements on Charlottesville that the Fake News Media didn't cover fairly. People got it!
https://twitter.com/realDonaldTrump/status874576057579565056 Tue Jun 13 10:35:55 +0000 2017 The Fake News Media has never been so wrong or so dirty. Purposely incorrect stories and phony sources to meet their agenda of hate. Sad!
https://twitter.com/realDonaldTrump/status887477071160762369 Wed Jul 19 00:59:56 +0000 2017 The Fake News is becoming more and more dishonest! Even a dinner arranged for top 20 leaders in Germany is made to look sinister!
https://twitter.com/realDonaldTrump/status880017678978736129 Wed Jun 28 10:58:59 +000

If it helps I've bundled all this into a single little utility here as [warc-twarc.py](https://github.com/edsu/warc-twarc/blob/master/warc-twarc.py). Maybe it would be fun to do something a bit more interesting with the tweet metadata. I leave that for a future notebook, and to you!