{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# warc-twarc\n", "\n", "\n", "This notebook demonstrates how to use [Webrecorder](http://webrecorder.io) to record a set of search results from Twitter and then extract the Twitter data for those results using [twarc](https://github.com/docnow/twarc). The notebook will show how to record a search for tweets from Donald Trump that mention \"fake news\" and print out the time that the tweet was sent and the text of the tweet.\n", "\n", "If you'd like to skip the explanation and just use a script from the command line you can find that [here](https://github.com/edsu/warc-twarc/blob/master/warc-twarc.py). If you'd like to interact with this Notebook online without running it locally try it [binder](https://mybinder.org/):\n", "\n", "[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/edsu/warc-twarc/master?filepath=warc-twarc.ipynb)\n", "\n", "\n", "\n", "---\n", "\n", "## Step 1: Recording a Twitter Search\n", "\n", "First go to Twitter and [search](https://twitter.com/search/) for something that's of interest. For example you can search for all of Donald Trump's tweets that mention \"fake news\" with the query [from:realDonaldTrump \"fake news\"](https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd). Check out the [advanced search](https://twitter.com/search-advanced) page for all the ways of fine tuning a search. Note that you can limit by time, which we will return to later.\n", "\n", "Now copy the URL in your browser and head over to [Webrecorder](https://webrecorder.io). Webrecorder lets you record the results of a series of interactions in your browser and download the data as a [WARC](https://en.wikipedia.org/wiki/Web_ARChive) file (more on WARC shortly). You'll probably want to register for an account since anonymous recording sessions will expire after some period of time.\n", "\n", "Paste the URL for your Twitter search into the box and create a collection using the drop down box:\n", "\n", "![Webrecorder homepage](images/webrecorder-01.png)\n", "\n", "Once you do that you should see a screen something like this:\n", "\n", "![Webrecorder homepage](images/webrecorder-02.png)\n", "\n", "If you scroll to the bottom of the screen you should see that the page automatically loads more tweets that match the search query. This is the so called *infinite scroll* behavior, where the web browser goes and fetches more tweet from Twitter and puts them into the page automatically.\n", "\n", "Fortunately Webrecorder records all of these background interactions, and even provides an *autoscroll* function (a button at the top left of the screen). When you click on `autoscroll` Webrecorder will automatically scroll to the bottom of each page, triggering more results. You can let this run until it stops or until you've gotten enough results.\n", "\n", "Once you are done hover over the *record* button just above *autoscroll* and click to *stop* the recording. You will then see a page that lists the recordings in your collection. And you should see the option to download the data (it's a little cloud with an arrow). Move the downloaded file into the directory containing this Jupyter notebook and set the `warc_file` variable." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "warc_file = \"recording-session-20171220102048.warc.gz\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Extracting the Tweet Identifiers\n", "\n", "Now that we have our recording our next task is to read it and find all the tweet ids that were mentinoed in the search results. The WARC file is a specialized file that contains all the [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) transactions performed by your web browser with some additional metadata. Fortunately the Webrecorder project have made a small Python library called [warcio](https://github.com/webrecorder/warcio) which makes it much easier to read (and write) WARC data.\n", "\n", "Go and install warcio and a couple other modules we'll be needing:\n", "\n", " pip install warcio twarc BeautifulSoup4\n", " \n", "The WARC file is broken up into chunks called *records* which can be of different types (request, response, warcinfo). HTTP requests result in HTTP responses, which can be described using warcinfo metadata records. In our case we are interested in the *response* records because we want to look for tweet identifiers in the responses that came back from Twitter. \n", "\n", "Lets read through our file and extract the URLs that have been recorded (from the WARC record headers), and also look at the type of content being returned (using the Content-Type HTTP header of the response):\n", " " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/html;charset=utf-8 https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd\n", "image/jpeg https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_bigger.jpg\n", "image/png https://abs.twimg.com/emoji/v2/72x72/1f1fa-1f1f8.png\n", "text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_more_1.bundle.css\n", "text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_core.bundle.css\n", "image/png https://abs.twimg.com/a/1513630383/img/search/ic_places_foursquare_logo.png\n", "image/jpeg https://pbs.twimg.com/profile_images/907173400245915654/dyAyfiBr_bigger.jpg\n", "image/png https://abs.twimg.com/a/1513630383/img/search/ic_places_yelp_logo.png\n", "text/css https://abs.twimg.com/a/1513630383/css/t1/twitter_more_2.bundle.css\n", "text/javascript; charset=utf-8 https://twitter.com/i/js_inst?c_name=ui_metrics\n", "application/javascript; charset=utf-8 https://abs.twimg.com/k/en/init.en.3e62c1938034c4d4c7b5.js\n", "text/javascript; charset=utf-8 https://twitter.com/i/js_inst?c_name=ui_metrics\n", "image/jpeg https://pbs.twimg.com/profile_banners/789177639558713344/1513292647/600x200\n", "image/gif https://abs.twimg.com/a/1513630383/img/t1/spinners/spinner-rosetta-gray-32x32.gif\n", "application/font-woff https://abs.twimg.com/a/1513630383/font/edge-icons-Regular.woff\n", "image/png https://abs.twimg.com/a/1513630383/img/animations/web_heart_animation_edge.png\n", "image/jpeg https://pbs.twimg.com/profile_banners/25073877/1513397710/600x200\n", "application/javascript; charset=utf-8 https://abs.twimg.com/k/en/10.pages_search.en.a3ea2e4006f9168b0b78.js\n", "application/javascript; charset=utf-8 https://abs.twimg.com/k/en/0.commons.en.98cf8691152ce7d80c4d.js\n", "application/json;charset=utf-8 https://analytics.twitter.com/tpm/p?_=1513763894666\n", "text/javascript; charset=utf-8 https://twitter.com/i/trends?k=&pc=true&query=from%3ArealDonaldTrump%20%22fake%20news%22&show_context=true&src=module\n", "text/javascript http://www.google-analytics.com/analytics.js\n", "image/gif;charset=utf-8 https://syndication.twitter.com/i/jot/syndication?l=%7B%22_category_%22%3A%22syndicated_impression%22%2C%22event_namespace%22%3A%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22action%22%3A%22impression%22%7D%2C%22triggered_on%22%3A1513763895552%7D\n", "application/javascript;charset=utf-8 https://twitter.com/push_service_worker.js\n", "text/javascript; charset=utf-8 https://twitter.com/i/jot\n", "image/gif https://www.google-analytics.com/r/collect?v=1&_v=j66&aip=1&a=1920068369&t=pageview&_s=1&dl=https%3A%2F%2Ftwitter.com%2Fsearch%3Fq%3Dfrom%253ArealDonaldTrump%2520%2522fake%2520news%2522%26src%3Dtypd&dp=%2Fanon%2Fsearch%2Fdefault&ul=en-us&de=UTF-8&dt=from%3ArealDonaldTrump%20%22fake%20news%22%20-%20Twitter%20Search&sd=24-bit&sr=1440x900&vp=1032x656&je=0&_u=YEBAAQAB~&jid=535701800&gjid=854188133&cid=652698663.1513763896&tid=UA-30775-6&_gid=457140905.1513763896&_r=1&z=1283758507\n", "text/javascript; charset=utf-8 https://twitter.com/i/profiles/popup?user_id=25073877&wants_hovercard=true&_=1513763894669\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=240000&latent_count=0&min_position=TWEET-943135588496093190-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQAAAAAAAAEAAAAAAAAAAAQAAABAAAAAAAAAAAAAAAAAAABAAAAQAAAAQCAAAQAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAQAEAAAAQAAAAAAAEAAABAABEAABIAAAAAAAACQAAAABAQAgAQAAAAAAAAAAAAAgAAAAAACAAAAAAAAAAAAAAhCAEAEAACABAEAAAAAAAAAAIAAAgAABAAgAAAAAAAAAAAAAAAAAAAAgAAAAAACABAAAAAlAAAAQAAAACBEAAAAAAAgAAAAEAAAAAAAAEAABAAAAAAAAAAAAAAAEAAAAAABAAAAAABAAAAAAAEAAAAAAAAAAAARAAAAAAAAAAAAAAIAAAAAAAQAAQQAAEAAAQAAAAAQAECAAAAwAgAAAAAAEAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAAQAAAAAEgAAgAAAAAAEAAAAAAAAAAAAAKAEAAAAAAADAAACAgAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAIAAAgAAAAAAAAAACAAAgAAAAAgBAAAAAAAAAAQACAAAAAAAAAAAAAgAAAIAAAAAAAAAAAAAAIAAAAAAAAAAAgAAAAAAAABAAgAIAAAAAAAAAAAAACAEAAAAAAAAAAAAAAAAAAAAAAgAACAAgAAAABAAAAAAACAAAAAAAAAAAAAAAAAQAAAAgABAAAAAAAAEAAAAAAAAAAAAAQCABAAABAAAABAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAACAAAAAAAAIAEAAAAAAAAAIAAAAAAAAAAAAAgAAAAAAAAAAAAAAAQACAIAAgBAAAAAAQFAAAAAAAAAAAAAAAAAA%3D%3D-R-0\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=TWEET-943135588496093190-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQAAAAAAAAEAAAAAAAAAAAQAAABAAAAAAAAAAAAAAAAAAABAAAAQAAAAQCAAAQAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAQAEAAAAQAAAAAAAEAAABAABEAABIAAAAAAAACQAAAABAQAgAQAAAAAAAAAAAAAgAAAAAACAAAAAAAAAAAAAAhCAEAEAACABAEAAAAAAAAAAIAAAgAABAAgAAAAAAAAAAAAAAAAAAAAgAAAAAACABAAAAAlAAAAQAAAACBEAAAAAAAgAAAAEAAAAAAAAEAABAAAAAAAAAAAAAAAEAAAAAABAAAAAABAAAAAAAEAAAAAAAAAAAARAAAAAAAAAAAAAAIAAAAAAAQAAQQAAEAAAQAAAAAQAECAAAAwAgAAAAAAEAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAAQAAAAAEgAAgAAAAAAEAAAAAAAAAAAAAKAEAAAAAAADAAACAgAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAIAAAgAAAAAAAAAACAAAgAAAAAgBAAAAAAAAAAQACAAAAAAAAAAAAAgAAAIAAAAAAAAAAAAAAIAAAAAAAAAAAgAAAAAAAABAAgAIAAAAAAAAAAAAACAEAAAAAAAAAAAAAAAAAAAAAAgAACAAgAAAABAAAAAAACAAAAAAAAAAAAAAAAAQAAAAgABAAAAAAAAEAAAAAAAAAAAAAQCABAAABAAAABAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAACAAAAAAAAIAEAAAAAAAAAIAAAAAAAAAAAAAgAAAAAAAAAAAAAAAQACAIAAgBAAAAAAQFAAAAAAAAAAAAAAAAAA%3D%3D-R-0\n", "text/javascript; charset=utf-8 https://twitter.com/i/trends?k=&pc=true&query=from%3ArealDonaldTrump%20%22fake%20news%22&show_context=true&src=module\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-943135588496093190-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQAAAAAAAAEAAAAAAAAAAAQAAABAAAAAAAAAAAAAAAAAAABAAAAQAAAAQCAAAQAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAQAEAAAAQAAAAAAAEAAABAABEAABIAAAAAAAACQAAAABAQAgAQAAAAAAAAAAAAAgAAAAAACAAAAAAAAAAAAAAhCAEAEAACABAEAAAAAAAAAAIAAAgAABAAgAAAAAAAAAAAAAAAAAAAAgAAAAAACABAAAAAlAAAAQAAAACBEAAAAAAAgAAAAEAAAAAAAAEAABAAAAAAAAAAAAAAAEAAAAAABAAAAAABAAAAAAAEAAAAAAAAAAAARAAAAAAAAAAAAAAIAAAAAAAQAAQQAAEAAAQAAAAAQAECAAAAwAgAAAAAAEAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAAQAAAAAEgAAgAAAAAAEAAAAAAAAAAAAAKAEAAAAAAADAAACAgAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAIAAAgAAAAAAAAAACAAAgAAAAAgBAAAAAAAAAAQACAAAAAAAAAAAAAgAAAIAAAAAAAAAAAAAAIAAAAAAAAAAAgAAAAAAAABAAgAIAAAAAAAAAAAAACAEAAAAAAAAAAAAAAAAAAAAAAgAACAAgAAAABAAAAAAACAAAAAAAAAAAAAAAAAQAAAAgABAAAAAAAAEAAAAAAAAAAAAAQCABAAABAAAABAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAACAAAAAAAAIAEAAAAAAAAAIAAAAAAAAAAAAAgAAAAAAAAAAAAAAAQACAIAAgBAAAAAAQFAAAAAAAAAAAAAAAAAA%3D%3D-R-0&reset_error_state=false\n", "image/jpeg https://pbs.twimg.com/media/DQov0hEWsAAtwFq.jpg\n", "image/jpeg https://pbs.twimg.com/media/DQov0hCWkAATjA8.jpg\n", "image/jpeg https://pbs.twimg.com/media/DQov0hEXkAAyg3m.jpg\n", "image/jpeg https://pbs.twimg.com/media/DQov0hCWkAIry5w.jpg\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-940930017365778432-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQggAAJAAAEAAAAAAAAAAAQAAABBkAAAgAAABAAAAAIAACBAAAkQAAAAQCAAAQAAAAEAAAAAAAEAAAAAAAgEAABCAAgAAAAAAAAQAMAQAAQAEAAAAAEAACBAABEAABIDAAQAQBACQRAAABgwAgAQAAAAAAAAEAAAAggAAAAACAAAAAAAAAAAAAghCAEAEAACABAEAQAACAAAAAIABAgAABACgAAAQAAAAAAAAAAAgABACgAAAAAACGBAAAAAtCAAAQAAAQCBEAAAAAAAgAAAAEAAAAAACAEAABAAAIAAAgAAAAAAAEAAgBAAhAAAAAABAAAAAAAEAQAAIIAAAAAARAAAAAAAAAAAAAAIAAAEAAEQAASQAAFAAAQAAAAAQAFCEAQAwAgAAAAAAEAAAQAAAQAACAAABAAAAAAAAAAAAAABCAIACCAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAEQABgAAEmgAgAABAAAEBAAAAAEAAAAAAKAEACAAIAADAAQCAgAAAAAAAAAAAAAAAAAAAAAgIAIAAAAAAAAgAAAAIAAAAACIAAAgAAAAAEUAAICAAAgAAAAAgBAAEAIAAAAAQACAAACAIJIAAAAAgAAAIAIAAAAAQQAAAAAIAAAYAAAAAAAgAACAAAEABAAgAIAQAAAAAAAAAAACAEAQgAAAACAAAAAAQAACAAAAgAAiAAgAAAARAAAAAAAGAAAAAAAAAAAABAAAASAAAAgABAACAAAAAEAAAAAAAAAAAAAQCABAAiFAAQABBAAAAAAAAAAAgAAAAQQAAACAQAACAAAAAAAAACAAAAAAAAIAEAAAAAABAAIAEAAABEAAQAAAgAAAAAAAAAAEAAAAQEGAIAAgBAAAAAAQFAAAAAAAAAACAAAAAAA%3D%3D-R-0&reset_error_state=false\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-923147501418446849-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQggAAJAAAEAAAAAAAAAAAQAAABB0AAgoAAABAAAAAIAECBAgAmSAAABRCAAAQAgAAFAAAAAAAUAAAAEAAgEAAhCAAgAAAAQAAAQANAQAAQAEAQAAAEAAChAUBEAAJIDAAQCQBgCQRAAABhwAgAQAAAAAAAAEEAEAggAAAAACAAABQAAAAAAAAghCQEAEAACABAMAQAACAAAAAIABAgAABACgAAAQAAAAAAIAAAAgABACgAAAAAACGJAAAAE9CAACQAgAQCBEAAAAACAgAAACEBAAgAACAEAABAAAIABAgAAAAAAAEAAoBAAhAAAAAABAAAAAAAEAUAAIIAAAAAARAAAAAAAAEAFAAAIQAAEACEQAASQAAFAIAQBAAAAQAFCEAQEwAgAAAAAAEAAAQAAAQgACAAABAAAAEECAACACBABCIIACCEAAAAAAAAAAgAAAABAAEAAAQAKAQAEAAAAAAAASAAAgAAABAEYABgBAEmgAgAABAAAEBAAAAAGAAACAAKAEAGAAIAADAAQCAgAACAAAAAAAAAAAAEAAACEgMAIAAAQEACAkAAAAICAAAACIAAAgCEAAAEUAAICAAEgAAAAAgBAAEgIAAAAAQACAAACAIJIACAAAlBAgJAIAAAAAQQAAAAAIAIAYAAEAAAAgAAKAIAUABAAgAIgQAIAARAAASAACAEAQgAAgACAAACAAQEACAAAggAAiAAggAAARAAAAAABGAAAAAAAAAAAABAAAASAAAAoABAACAAAAAUAAAAAAAAAAAAAUCABAAiFCAQgBBAAABAAAAAAAgAAgAQQAAACAQAACAAAIAAAQACAAAAAAAAKAEQAAAAAhBAIAEAAABEAAQAAAhAAAAEAAAAAEAAAAQMGAIABgBAAAAABQFAAAAAAAAAACAgAAAAA%3D%3D-R-0&reset_error_state=false\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-939485131693322240-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQwgAAJAAAEEAAAAAAAAAAQIABBB0AAgoAAABAgAAAIAECBAgAmSAAABRCAAQQAwAAFAAAAAIAUAAAAEEAiEQAhCAAgAAAAQEAAQANEQAAQAMAQQgAEEAChAUBEAANIDAAQCQDgCQVAAABhwIgASAAgAACAAEEAkAggIAAAACAAABQAAAAAAAgghCQECEAACABAMAQQACAAAAAIAhAgAABACwAACQAAAAAQIAABAhABICgAAAAAACGJAAAAU9CAAGQAhAQCBEAAAQACEgAIACEBAAgIACAEAABAAAIABAgQAAAAAAEAAoBAAhCAAAAABAAAAAEAEAUAAIIAAAAgBRAAEAAQAoEAFBAAIQAEEAKEQAgSQAAFAIAQBAAAEQIFCEAQEwAiAAAAAAEAAAQAAAwgACQAABAQAAEESAACACBQBCIYACSEAAAAAAAAAAgAAAABQAEAgAQAKAQAEAAAAAABASAAAgAAQBAEYABgBEEmgCgAADAAAEBAAAAAGAAACAAKEEAGAAIIIDAAQCAgAACAUAAAAAAAAAAUAAACEgMAIAAAQEACAkAAAAICAAAICIAgAgCEAAAEUAAICBAEgAAABAgBAQEiIAAAAAQACAAgCAIJIQCAAAlBAgJAJCAAAAQQAAAAAIAIAaAAEAAAQgAAKAIAUABAAgAIgQQIEARAAASAACAEAZgEIgACAEACAAQECqAAAggAAiAAggAAAZAEAAAgBGAAAAAAAAAAAIBAAAASBAAAoABAwCAABAAUAEAIAAgQBAAAAUCABAAmFCAQgFDIAABEAAAAAAgAAgAQQAgACCQAACAABIAAAQACAAAAIAAAKAEQAAAQAhBAIAEQAABEAAQAAAhAAQBEAAAAAEAAAAQMGAIABhBAAAgABQFAAAAAIAAAACAkAAAAA%3D%3D-R-0&reset_error_state=false\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET--943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQwiAAJAAAEEAAAAAAAAAAQIABBB0ABgqAAABAgAAAKAECBAgAmSAAABRCAAQQAwAEFIAAAAIAUAAAAEEAiEQAhCAgwAAAAQEAAQANEREAQAMAQQgAEUADhAUBEAANIDAARCQTgCQVAAABhwIgISAAgABCAAEEAkAggIACAACgAABQBAAAAAAgghCQECEAACABAMAQQgCAAAAAIAhAgAABQCwAACUAAAAAUIAABAhABICgAAAAAACGJEAAAU9CAAmQghAQCBEAAAQACEoAIACEBAAgIACAEAABAAAIABAgRAAAAQAEAAoBCAhCAAAAABAAAAAOAMAUAAIIAAAAgBRAAEQAQAoEAFBAAIQAEEIKEQAgSQIAlAIAQBAAAEQIFCEEQEwAiAAAEEAEAAAQAAAygACQAABAQAAEESAACICBQBCIYACSEAAAAAAAAAAgAgIABQAEAgAQAKAQAEAAAAAABASAAAiAAQRAEYABgBEEmgCgAADAAEEBAAAAAGAAACIAKEEAGAIIIIDAAQCAgAICAUAAAAAAAAIgUAAACEmMAIAAAQEACAkAAAAoCAAAICIAgggCEAAAMUABICBAEgAAABAgBAQEiIAAAIAQACAAgCAYJIQCAAAlBAgJAJCAAAAQQAABgAIAIAaAAEAAAQgEAKAMAUABAAgQIgQQIUARgAASAACAEEZgEIgACAEACAAQECqAAAggAAiAAgggAAZAEAAAgBGAAAAAAAAAAAIBAAABSRAAAoABAwCAABACUEEAIAAgQBAAAAUCABAEmFCAQgFDIBABEAAAAAAgAAwAQQAgACCQAACAAhIAAAQACAAAAIAAAKCEQAAAQIhBAIBEQAABEAAQAAAhACQBEABAAAEAAAASMGAIABhBAAAgABUFAAABAIAAAACImKAAAA%3D%3D-T-0&reset_error_state=false\n", "text/javascript; charset=utf-8 https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-914269704440737792-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQwiAAJAAAEEAAAAAAAAAAQIABBB0ABgqAAABAgAAAKAECBAgAmSAAABRCAAQQAwAEFIAAAAIAUAAAAEEAiEQAhCAgwAAAAQEAAQANEREAQAMAQQgAEUADhAUBEAANIDAARCQTgCQVAAABhwIgISAAgABCAAEEAkAggIACAACgAABQBAAAAAAgghCQECEAACABAMAQQgCAAAAAIAhAgAABQCwAACUAAAAAUIAABAhABICgAAAAAACGJEAAAU9CAAmQghAQCBEAAAQACEoAIACEBAAgIACAEAABAAAIABAgRAAAAQAEAAoBCAhCAAAAABAAAAAOAMAUAAIIAAAAgBRAAEQAQAoEAFBAAIQAEEIKEQAgSQIAlAIAQBAAAEQIFCEEQEwAiAAAEEAEAAAQAAAygACQAABAQAAEESAACICBQBCIYACSEAAAAAAAAAAgAgIABQAEAgAQAKAQAEAAAAAABASAAAiAAQRAEYABgBEEmgCgAADAAEEBAAAAAGAAACIAKEEAGAIIIIDAAQCAgAICAUAAAAAAAAIgUAAACEmMAIAAAQEACAkAAAAoCAAAICIAgggCEAAAMUABICBAEgAAABAgBAQEiIAAAIAQACAAgCAYJIQCAAAlBAgJAJCAAAAQQAABgAIAIAaAAEAAAQgEAKAMAUABAAgQIgQQIUARgAASAACAEEZgEIgACAEACAAQECqAAAggAAiAAgggAAZAEAAAgBGAAAAAAAAAAAIBAAABSRAAAoABAwCAABACUEEAIAAgQBAAAAUCABAEmFCAQgFDIBABEAAAAAAgAAwAQQAgACCQAACAAhIAAAQACAAAAIAAAKCEQAAAQIhBAIBEQAABEAAQAAAhACQBEABAAAEAAAASMGAIABhBAAAgABUFAAABAIAAAACImKAAAA%3D%3D-T-0-1&reset_error_state=false\n", "text/javascript; charset=utf-8 https://twitter.com/i/jot\n" ] } ], "source": [ "from warcio.archiveiterator import ArchiveIterator\n", "\n", "for record in ArchiveIterator(open(warc_file, 'rb')):\n", " if record.rec_type == 'response':\n", " url = record.rec_headers.get_header('WARC-Target-URI')\n", " content_type = record.http_headers.get_header('Content-Type')\n", " print(content_type, url)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the top of the output you should see *text/html;charset=utf-8* response to a URL like:\n", "\n", " https://twitter.com/search?q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd\n", "\n", "That's our intial HTML response we got from Twitter for our search. After that you should see a bunch of images, css, fonts and javascript being fetched. That's all the stuff the browser found in the HTML that was needed to render the page. \n", "\n", "If you scroll down a little bit further you should see a `text/javascript; charset=utf-8` response for a long URL like this:\n", "\n", " https://twitter.com/i/search/timeline?vertical=default&q=from%3ArealDonaldTrump%20%22fake%20news%22&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=240000&latent_count=0&min_position=TWEET-943135588496093190-943135588496093190-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWAAAQAAAAAAAAEAAAAAAAAAAAQAAABAAAAAAAAAAAAAAAAAAABAAAAQAAAAQCAAAQAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAQAEAAAAQAAAAAAAEAAABAABEAABIAAAAAAAACQAAAABAQAgAQAAAAAAAAAAAAAgAAAAAACAAAAAAAAAAAAAAhCAEAEAACABAEAAAAAAAAAAIAAAgAABAAgAAAAAAAAAAAAAAAAAAAAgAAAAAACABAAAAAlAAAAQAAAACBEAAAAAAAgAAAAEAAAAAAAAEAABAAAAAAAAAAAAAAAEAAAAAABAAAAAABAAAAAAAEAAAAAAAAAAAARAAAAAAAAAAAAAAIAAAAAAAQAAQQAAEAAAQAAAAAQAECAAAAwAgAAAAAAEAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAgAAAAAAAAAAAAAKAQAAAAAAAAAAAAAAgAAAAAAQAAAAAEgAAgAAAAAAEAAAAAAAAAAAAAKAEAAAAAAADAAACAgAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAIAAAgAAAAAAAAAACAAAgAAAAAgBAAAAAAAAAAQACAAAAAAAAAAAAAgAAAIAAAAAAAAAAAAAAIAAAAAAAAAAAgAAAAAAAABAAgAIAAAAAAAAAAAAACAEAAAAAAAAAAAAAAAAAAAAAAgAACAAgAAAABAAAAAAACAAAAAAAAAAAAAAAAAQAAAAgABAAAAAAAAEAAAAAAAAAAAAAQCABAAABAAAABAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAACAAAAAAAAIAEAAAAAAAAAIAAAAAAAAAAAAAgAAAAAAAAAAAAAAAQACAIAAgBAAAAAAQFAAAAAAAAAAAAAAAAAA%3D%3D-R-0\n", " \n", "That is the infinite scroll behavior where JavaScript goes and fetches more results to interleave into the page. So we need to extract tweet identifiers from two types of responses, the initial HTML page response, and the subsequent calls to fetch more data. Let's create two functions for doing that.\n", "\n", "### Extract Tweet Identifiers from HTML\n", "\n", "Fortunately the tweet identifiers in the HTML are fairly easy to find. Each tweet is represented as a <div> that has a class *tweet*. This <div> has an attribute *data-tweet-id* which contains the tweet identifier. We can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to easily parse and work with the HTML:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "def extract_html(html):\n", " soup = BeautifulSoup(html, 'html.parser')\n", " for div in soup.find_all('div', class_='tweet'):\n", " yield div.attrs.get('data-tweet-id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try it out by reading our WARC file again, but this time looking for HTTP responses from `https://twitter.com/search`. If we find any we'll get the response content and send it to our new `extract_html` function:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "935838073618870272\n", "921726400922619904\n", "899623926082535425\n", "921829947093733376\n", "913034591879024640\n", "929509950811881472\n", "940223974985871360\n", "914094625488502784\n", "939634404267380736\n", "939967625362276354\n", "921207772233990144\n", "917921548677328896\n", "937279001684598784\n", "920406959320371200\n", "915539424406114304\n", "918457595618365441\n", "935844881825763328\n", "922072236592435200\n", "936688444046266368\n", "943135588496093190\n" ] } ], "source": [ "for record in ArchiveIterator(open(warc_file, 'rb')):\n", " if record.rec_type == 'response':\n", " url = record.rec_headers.get_header('WARC-Target-URI')\n", " if url.startswith('https://twitter.com/search?q='):\n", " for tweet_id in extract_html(record.content_stream().read()):\n", " print(tweet_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see 20 tweet identifiers, which are the first 20 returned when you do a Twitter search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Tweet Identifiers from JavaScript\n", "\n", "Now lets create a function that can extract the tweet identifiers from the JavaScript that are triggered by the infinite scroll behavior:\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "def extract_javascript(content):\n", " # make sure the content is decoded or else json.loads complains\n", " content = content.decode('utf-8')\n", " data = json.loads(content)\n", " for tweet_id in extract_html(data.get('items_html', '')):\n", " yield tweet_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try that function out too by reading our WARC file again and looking for `https://twitter.com/i/search/timeline` URLs, and handing them off to our new `extract_javascript` function:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "940930017365778432\n", "907588803161939968\n", "934563828834164739\n", "925364408364171265\n", "898964640817983488\n", "918061437750267904\n", "915907150333009920\n", "901031532164468736\n", "940554567414091776\n", "900706146943717377\n", "925333956110757888\n", "926481563214376961\n", "934551607596986368\n", "939616077356642304\n", "899411254061694979\n", "918112884630093825\n", "935874566701842434\n", "937145025359761408\n", "924251519121346560\n", "898130328916824064\n", "879682547235651584\n", "894984126582972416\n", "879648931172556802\n", "894518002795900928\n", "880771685460344832\n", "894512983384129536\n", "880015261004435456\n", "868810522942164993\n", "892383242535481344\n", "874609480301936640\n", "888575966259314691\n", "914099295963553792\n", "887477071160762369\n", "869509894688387072\n", "915894251967385600\n", "914189344533024768\n", "884020939264073728\n", "884378624660582405\n", "923147501418446849\n", "868810404335673344\n", "939485131693322240\n", "890568797941362690\n", "891437168798965761\n", "879678356450676736\n", "877372660455546880\n", "889675644396867584\n", "875690204564258816\n", "887475373981696000\n", "880017678978736129\n", "872064426568036353\n", "894514535062790144\n", "886544734788997125\n", "900352052068401154\n", "868985285207629825\n", "886534810575020032\n", "889435104841523201\n", "911189860769255424\n", "888724194820857857\n", "874576057579565056\n", "881983493533822976\n", "880049704620494848\n", "892920397162848257\n", "918796079243677696\n", "889673743873843200\n", "899625157421039616\n", "939480342779580416\n", "935147410472480769\n", "939849867438034944\n", "914465475777695744\n", "897223558073602049\n", "894653195112378368\n", "894367017054208001\n", "883230130885324802\n", "881847676232503297\n", "921709468055896064\n", "914269704440737792\n" ] } ], "source": [ "for record in ArchiveIterator(open(warc_file, 'rb')):\n", " if record.rec_type == 'response':\n", " url = record.rec_headers.get_header('WARC-Target-URI')\n", " if url.startswith('https://twitter.com/i/search/timeline?'):\n", " for tweet_id in extract_javascript(record.content_stream().read()):\n", " print(tweet_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can combine these two functions to compose a function that takes a WARC file as a parameter and returns all the tweet ids contained in the WARC file!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def tweet_ids(warc_file):\n", " for record in ArchiveIterator(open(warc_file, 'rb')):\n", " if record.rec_type == 'response':\n", " \n", " url = record.rec_headers.get_header('WARC-Target-URI')\n", " content = record.content_stream().read()\n", " \n", " if url.startswith('https://twitter.com/search?q='):\n", " for tweet_id in extract_html(content):\n", " yield tweet_id\n", " \n", " elif url.startswith('https://twitter.com/i/search/timeline?'):\n", " for tweet_id in extract_javascript(content):\n", " yield tweet_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've seen our functions work before but lets make sure the composed function works:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "935838073618870272\n", "921726400922619904\n", "899623926082535425\n", "921829947093733376\n", "913034591879024640\n", "929509950811881472\n", "940223974985871360\n", "914094625488502784\n", "939634404267380736\n", "939967625362276354\n", "921207772233990144\n", "917921548677328896\n", "937279001684598784\n", "920406959320371200\n", "915539424406114304\n", "918457595618365441\n", "935844881825763328\n", "922072236592435200\n", "936688444046266368\n", "943135588496093190\n", "940930017365778432\n", "907588803161939968\n", "934563828834164739\n", "925364408364171265\n", "898964640817983488\n", "918061437750267904\n", "915907150333009920\n", "901031532164468736\n", "940554567414091776\n", "900706146943717377\n", "925333956110757888\n", "926481563214376961\n", "934551607596986368\n", "939616077356642304\n", "899411254061694979\n", "918112884630093825\n", "935874566701842434\n", "937145025359761408\n", "924251519121346560\n", "898130328916824064\n", "879682547235651584\n", "894984126582972416\n", "879648931172556802\n", "894518002795900928\n", "880771685460344832\n", "894512983384129536\n", "880015261004435456\n", "868810522942164993\n", "892383242535481344\n", "874609480301936640\n", "888575966259314691\n", "914099295963553792\n", "887477071160762369\n", "869509894688387072\n", "915894251967385600\n", "914189344533024768\n", "884020939264073728\n", "884378624660582405\n", "923147501418446849\n", "868810404335673344\n", "939485131693322240\n", "890568797941362690\n", "891437168798965761\n", "879678356450676736\n", "877372660455546880\n", "889675644396867584\n", "875690204564258816\n", "887475373981696000\n", "880017678978736129\n", "872064426568036353\n", "894514535062790144\n", "886544734788997125\n", "900352052068401154\n", "868985285207629825\n", "886534810575020032\n", "889435104841523201\n", "911189860769255424\n", "888724194820857857\n", "874576057579565056\n", "881983493533822976\n", "880049704620494848\n", "892920397162848257\n", "918796079243677696\n", "889673743873843200\n", "899625157421039616\n", "939480342779580416\n", "935147410472480769\n", "939849867438034944\n", "914465475777695744\n", "897223558073602049\n", "894653195112378368\n", "894367017054208001\n", "883230130885324802\n", "881847676232503297\n", "921709468055896064\n", "914269704440737792\n" ] } ], "source": [ "for tweet_id in tweet_ids(warc_file):\n", " print(tweet_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Get the Twitter Data\n", "\n", "Now that we have the tweet identifiers we can use [twarc](https://github.com/docnow/twarc) to fetch the Twitter JSON data for each tweet from Twitter's API. You will need to go to [apps.twitter.com](https://apps.twitter.com) to create an application, and get your API keys to use twarc.\n", "\n", "Tell twarc about your keys by running this at the command line:\n", "\n", " twarc configure\n", "\n", "Now you should be ready to use twarc to hydrate the tweet ids to get the rich structured data for a tweet:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://twitter.com/realDonaldTrump/status921207772233990144 Fri Oct 20 02:53:42 +0000 2017 The Fake News is going crazy with wacky Congresswoman Wilson(D), who was SECRETLY on a very personal call, and gave a total lie on content!\n", "https://twitter.com/realDonaldTrump/status900352052068401154 Wed Aug 23 13:40:31 +0000 2017 Last night in Phoenix I read the things from my statements on Charlottesville that the Fake News Media didn't cover fairly. People got it!\n", "https://twitter.com/realDonaldTrump/status874576057579565056 Tue Jun 13 10:35:55 +0000 2017 The Fake News Media has never been so wrong or so dirty. Purposely incorrect stories and phony sources to meet their agenda of hate. Sad!\n", "https://twitter.com/realDonaldTrump/status887477071160762369 Wed Jul 19 00:59:56 +0000 2017 The Fake News is becoming more and more dishonest! Even a dinner arranged for top 20 leaders in Germany is made to look sinister!\n", "https://twitter.com/realDonaldTrump/status880017678978736129 Wed Jun 28 10:58:59 +0000 2017 Some of the Fake News Media likes to say that I am not totally engaged in healthcare. Wrong, I know the subject well & want victory for U.S.\n", "https://twitter.com/realDonaldTrump/status939616077356642304 Sat Dec 09 22:01:44 +0000 2017 .@DaveWeigel @WashingtonPost put out a phony photo of an empty arena hours before I arrived @ the venue, w/ thousands of people outside, on their way in. Real photos now shown as I spoke. Packed house, many people unable to get in. Demand apology & retraction from FAKE NEWS WaPo! https://t.co/XAblFGh1ob\n", "https://twitter.com/realDonaldTrump/status918061437750267904 Wed Oct 11 10:31:18 +0000 2017 It would be really nice if the Fake News Media would report the virtually unprecedented Stock Market growth since the election.Need tax cuts\n", "https://twitter.com/realDonaldTrump/status899411254061694979 Sun Aug 20 23:22:07 +0000 2017 Heading back to Washington after working hard and watching some of the worst and most dishonest Fake News reporting I have ever seen!\n", "https://twitter.com/realDonaldTrump/status892920397162848257 Thu Aug 03 01:29:46 +0000 2017 I love the White House, one of the most beautiful buildings (homes) I have ever seen. But Fake News said I called it a dump - TOTALLY UNTRUE\n", "https://twitter.com/realDonaldTrump/status940554567414091776 Tue Dec 12 12:10:58 +0000 2017 Despite thousands of hours wasted and many millions of dollars spent, the Democrats have been unable to show any collusion with Russia - so now they are moving on to the false accusations and fabricated stories of women who I don’t know and/or have never met. FAKE NEWS!\n", "https://twitter.com/realDonaldTrump/status922072236592435200 Sun Oct 22 12:08:47 +0000 2017 It is finally sinking through. 46% OF PEOPLE BELIEVE MAJOR NATIONAL NEWS ORGS FABRICATE STORIES ABOUT ME. FAKE NEWS, even worse! Lost cred.\n", "https://twitter.com/realDonaldTrump/status921726400922619904 Sat Oct 21 13:14:33 +0000 2017 Stock Market hits another all time high on Friday. 5.3 trillion dollars up since Election. Fake News doesn't spent much time on this!\n", "https://twitter.com/realDonaldTrump/status939634404267380736 Sat Dec 09 23:14:34 +0000 2017 .@daveweigel of the Washington Post just admitted that his picture was a FAKE (fraud?) showing an almost empty arena last night for my speech in Pensacola when, in fact, he knew the arena was packed (as shown also on T.V.). FAKE NEWS, he should be fired.\n", "https://twitter.com/realDonaldTrump/status915539424406114304 Wed Oct 04 11:29:43 +0000 2017 Wow, so many Fake News stories today. No matter what I do or say, they will not write or speak truth. The Fake News Media is out of control!\n", "https://twitter.com/realDonaldTrump/status869509894688387072 Tue May 30 11:04:48 +0000 2017 Russian officials must be laughing at the U.S. & how a lame excuse for why the Dems lost the election has taken over the Fake News.\n", "https://twitter.com/realDonaldTrump/status921709468055896064 Sat Oct 21 12:07:16 +0000 2017 I hope the Fake News Media keeps talking about Wacky Congresswoman Wilson in that she, as a representative, is killing the Democrat Party!\n", "https://twitter.com/realDonaldTrump/status889673743873843200 Tue Jul 25 02:28:44 +0000 2017 So many stories about me in the @washingtonpost are Fake News. They are as bad as ratings challenged @CNN. Lobbyist for Amazon and taxes?\n", "https://twitter.com/realDonaldTrump/status918457595618365441 Thu Oct 12 12:45:29 +0000 2017 The Fake News Is going all out in order to demean and denigrate! Such hatred!\n", "https://twitter.com/realDonaldTrump/status879678356450676736 Tue Jun 27 12:30:38 +0000 2017 Fake News CNN is looking at big management changes now that they got caught falsely pushing their phony Russian stories. Ratings way down!\n", "https://twitter.com/realDonaldTrump/status939849867438034944 Sun Dec 10 13:30:44 +0000 2017 Things are going really well for our economy, a subject the Fake News spends as little time as possible discussing! Stock Market hit another RECORD HIGH, unemployment is now at a 17 year low and companies are coming back into the USA. Really good news, and much more to come!\n", "https://twitter.com/realDonaldTrump/status940223974985871360 Mon Dec 11 14:17:18 +0000 2017 Another false story, this time in the Failing @nytimes, that I watch 4-8 hours of television a day - Wrong! Also, I seldom, if ever, watch CNN or MSNBC, both of which I consider Fake News. I never watch Don Lemon, who I once called the “dumbest man on television!” Bad Reporting.\n", "https://twitter.com/realDonaldTrump/status929509950811881472 Sun Nov 12 00:43:36 +0000 2017 Does the Fake News Media remember when Crooked Hillary Clinton, as Secretary of State, was begging Russia to be our friend with the misspelled reset button? Obama tried also, but he had zero chemistry with Putin.\n", "https://twitter.com/realDonaldTrump/status898964640817983488 Sat Aug 19 17:47:26 +0000 2017 Steve Bannon will be a tough and smart new voice at @BreitbartNews...maybe even better than ever before. Fake News needs the competition!\n", "https://twitter.com/realDonaldTrump/status881983493533822976 Mon Jul 03 21:10:25 +0000 2017 Dow hit a new intraday all-time high! I wonder whether or not the Fake News Media will so report?\n", "https://twitter.com/realDonaldTrump/status880771685460344832 Fri Jun 30 12:55:08 +0000 2017 Watched low rated @Morning_Joe for first time in long time. FAKE NEWS. He called me to stop a National Enquirer article. I said no! Bad show\n", "https://twitter.com/realDonaldTrump/status880015261004435456 Wed Jun 28 10:49:22 +0000 2017 The failing @nytimes writes false story after false story about me. They don't even call to verify the facts of a story. A Fake News Joke!\n", "https://twitter.com/realDonaldTrump/status939480342779580416 Sat Dec 09 13:02:23 +0000 2017 Fake News CNN made a vicious and purposeful mistake yesterday. They were caught red handed, just like lonely Brian Ross at ABC News (who should be immediately fired for his “mistake”). Watch to see if @CNN fires those responsible, or was it just gross incompetence?\n", "https://twitter.com/realDonaldTrump/status939485131693322240 Sat Dec 09 13:21:24 +0000 2017 CNN’S slogan is CNN, THE MOST TRUSTED NAME IN NEWS. Everyone knows this is not true, that this could, in fact, be a fraud on the American Public. There are many outlets that are far more trusted than Fake News CNN. Their slogan should be CNN, THE LEAST TRUSTED NAME IN NEWS!\n", "https://twitter.com/realDonaldTrump/status937145025359761408 Sun Dec 03 02:22:40 +0000 2017 Congratulations to @ABC News for suspending Brian Ross for his horrendously inaccurate and dishonest report on the Russia, Russia, Russia Witch Hunt. More Networks and “papers” should do the same with their Fake News!\n", "https://twitter.com/realDonaldTrump/status937279001684598784 Sun Dec 03 11:15:02 +0000 2017 I never asked Comey to stop investigating Flynn. Just more Fake News covering another Comey lie!\n", "https://twitter.com/realDonaldTrump/status935838073618870272 Wed Nov 29 11:49:18 +0000 2017 Great, and we should boycott Fake News CNN. Dealing with them is a total waste of time! https://t.co/8zJ3j7g5el\n", "https://twitter.com/realDonaldTrump/status884378624660582405 Mon Jul 10 11:47:49 +0000 2017 If Chelsea Clinton were asked to hold the seat for her mother,as her mother gave our country away, the Fake News would say CHELSEA FOR PRES!\n", "https://twitter.com/realDonaldTrump/status899623926082535425 Mon Aug 21 13:27:12 +0000 2017 Jerry Falwell of Liberty University was fantastic on @foxandfriends. The Fake News should listen to what he had to say. Thanks Jerry!\n", "https://twitter.com/realDonaldTrump/status907588803161939968 Tue Sep 12 12:56:47 +0000 2017 Fascinating to watch people writing books and major articles about me and yet they know nothing about me & have zero access. #FAKE NEWS!\n", "https://twitter.com/realDonaldTrump/status886544734788997125 Sun Jul 16 11:15:10 +0000 2017 With all of its phony unnamed sources & highly slanted & even fraudulent reporting, #Fake News is DISTORTING DEMOCRACY in our country!\n", "https://twitter.com/realDonaldTrump/status891437168798965761 Sat Jul 29 23:15:57 +0000 2017 I love reading about all of the \"geniuses\" who were so instrumental in my election success. Problem is, most don't exist. #Fake News! MAGA\n", "https://twitter.com/realDonaldTrump/status894653195112378368 Mon Aug 07 20:15:18 +0000 2017 The Fake News Media will not talk about the importance of the United Nations Security Council's 15-0 vote in favor of sanctions on N. Korea!\n", "https://twitter.com/realDonaldTrump/status915907150333009920 Thu Oct 05 11:50:56 +0000 2017 Rex Tillerson never threatened to resign. This is Fake News put out by @NBCNews. Low news and reporting standards. No verification from me.\n", "https://twitter.com/realDonaldTrump/status918796079243677696 Fri Oct 13 11:10:30 +0000 2017 Sadly, they and others are Fake News, and the public is just beginning to figure it out! https://t.co/8B8AyA7V1s\n", "https://twitter.com/realDonaldTrump/status901031532164468736 Fri Aug 25 10:40:32 +0000 2017 General John Kelly is doing a fantastic job as Chief of Staff. There is tremendous spirit and talent in the W.H. Don't believe the Fake News\n", "https://twitter.com/realDonaldTrump/status936688444046266368 Fri Dec 01 20:08:22 +0000 2017 The media has been speculating that I fired Rex Tillerson or that he would be leaving soon - FAKE NEWS! He’s not leaving and while we disagree on certain subjects, (I call the final shots) we work well together and America is highly respected again!\n", "https://t.co/FrqiPLFJ1E\n", "https://twitter.com/realDonaldTrump/status888724194820857857 Sat Jul 22 11:35:34 +0000 2017 While all agree the U. S. President has the complete power to pardon, why think of that when only crime so far is LEAKS against us.FAKE NEWS\n", "https://twitter.com/realDonaldTrump/status880049704620494848 Wed Jun 28 13:06:14 +0000 2017 The #AmazonWashingtonPost, sometimes referred to as the guardian of Amazon not paying internet taxes (which they should) is FAKE NEWS!\n", "https://twitter.com/realDonaldTrump/status886534810575020032 Sun Jul 16 10:35:44 +0000 2017 HillaryClinton can illegally get the questions to the Debate & delete 33,000 emails but my son Don is being scorned by the Fake News Media?\n", "https://twitter.com/realDonaldTrump/status879648931172556802 Tue Jun 27 10:33:42 +0000 2017 Wow, CNN had to retract big story on \"Russia,\" with 3 employees forced to resign. What about all the other phony stories they do? FAKE NEWS!\n", "https://twitter.com/realDonaldTrump/status898130328916824064 Thu Aug 17 10:32:11 +0000 2017 The public is learning (even more so) how dishonest the Fake News is. They totally misrepresent what I say about hate, bigotry etc. Shame!\n", "https://twitter.com/realDonaldTrump/status924251519121346560 Sat Oct 28 12:28:28 +0000 2017 Just read the nice remarks by President Jimmy Carter about me and how badly I am treated by the press (Fake News). Thank you Mr. President!\n", "https://twitter.com/realDonaldTrump/status943135588496093190 Tue Dec 19 15:07:01 +0000 2017 A story in the @washingtonpost that I was close to “rescinding” the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources don’t exist!\n", "https://twitter.com/realDonaldTrump/status889435104841523201 Mon Jul 24 10:40:28 +0000 2017 Drain the Swamp should be changed to Drain the Sewer - it's actually much worse than anyone ever thought, and it begins with the Fake News!\n", "https://twitter.com/realDonaldTrump/status894984126582972416 Tue Aug 08 18:10:18 +0000 2017 After 200 days, rarely has any Administration achieved what we have achieved..not even close! Don't believe the Fake News Suppression Polls!\n", "https://twitter.com/realDonaldTrump/status934551607596986368 Sat Nov 25 22:37:21 +0000 2017 .@FoxNews is MUCH more important in the United States than CNN, but outside of the U.S., CNN International is still a major source of (Fake) news, and they represent our Nation to the WORLD very poorly. The outside world does not see the truth from them!\n", "https://twitter.com/realDonaldTrump/status872064426568036353 Tue Jun 06 12:15:36 +0000 2017 Sorry folks, but if I would have relied on the Fake News of CNN, NBC, ABC, CBS, washpost or nytimes, I would have had ZERO chance winning WH\n", "https://twitter.com/realDonaldTrump/status935844881825763328 Wed Nov 29 12:16:21 +0000 2017 Wow, Matt Lauer was just fired from NBC for “inappropriate sexual behavior in the workplace.” But when will the top executives at NBC & Comcast be fired for putting out so much Fake News. Check out Andy Lack’s past!\n", "https://twitter.com/realDonaldTrump/status925333956110757888 Tue Oct 31 12:09:41 +0000 2017 The Fake News is working overtime. As Paul Manaforts lawyer said, there was \"no collusion\" and events mentioned took place long before he...\n", "https://twitter.com/realDonaldTrump/status918112884630093825 Wed Oct 11 13:55:44 +0000 2017 With all of the Fake News coming out of NBC and the Networks, at what point is it appropriate to challenge their License? Bad for country!\n", "https://twitter.com/realDonaldTrump/status894512983384129536 Mon Aug 07 10:58:09 +0000 2017 The Trump base is far bigger & stronger than ever before (despite some phony Fake News polling). Look at rallies in Penn, Iowa, Ohio.......\n", "https://twitter.com/realDonaldTrump/status874609480301936640 Tue Jun 13 12:48:44 +0000 2017 Fake News is at an all time high. Where is their apology to me for all of the incorrect stories???\n", "https://twitter.com/realDonaldTrump/status894518002795900928 Mon Aug 07 11:18:05 +0000 2017 Hard to believe that with 24/7 #Fake News on CNN, ABC, NBC, CBS, NYTIMES & WAPO, the Trump base is getting stronger!\n", "https://twitter.com/realDonaldTrump/status879682547235651584 Tue Jun 27 12:47:17 +0000 2017 So they caught Fake News CNN cold, but what about NBC, CBS & ABC? What about the failing @nytimes & @washingtonpost? They are all Fake News!\n", "https://twitter.com/realDonaldTrump/status915894251967385600 Thu Oct 05 10:59:40 +0000 2017 Why Isn't the Senate Intel Committee looking into the Fake News Networks in OUR country to see why so much of our news is just made up-FAKE!\n", "https://twitter.com/realDonaldTrump/status914189344533024768 Sat Sep 30 18:04:59 +0000 2017 Despite the Fake News Media in conjunction with the Dems, an amazing job is being done in Puerto Rico. Great people!\n", "https://twitter.com/realDonaldTrump/status888575966259314691 Sat Jul 22 01:46:33 +0000 2017 Sean Spicer is a wonderful person who took tremendous abuse from the Fake News Media - but his future is bright!\n", "https://twitter.com/realDonaldTrump/status914269704440737792 Sat Sep 30 23:24:18 +0000 2017 In analyzing the Alabama Primary race,FAKE NEWS always fails to mention that the candidate I endorsed went up MANY points after endorsement!\n", "https://twitter.com/realDonaldTrump/status868985285207629825 Mon May 29 00:20:11 +0000 2017 The Fake News Media works hard at disparaging & demeaning my use of social media because they don't want America to hear the real story!\n", "https://twitter.com/realDonaldTrump/status875690204564258816 Fri Jun 16 12:23:08 +0000 2017 The Fake News Media hates when I use what has turned out to be my very powerful Social Media - over 100 million people! I can go around them\n", "https://twitter.com/realDonaldTrump/status900706146943717377 Thu Aug 24 13:07:34 +0000 2017 The Fake News is now complaining about my different types of back to back speeches. Well, there was Afghanistan (somber), the big Rally.....\n", "https://twitter.com/realDonaldTrump/status877372660455546880 Wed Jun 21 03:48:37 +0000 2017 Well, the Special Elections are over and those that want to MAKE AMERICA GREAT AGAIN are 5 and O! All the Fake News, all the money spent = 0\n", "https://twitter.com/realDonaldTrump/status887475373981696000 Wed Jul 19 00:53:12 +0000 2017 Fake News story of secret dinner with Putin is \"sick.\" All G 20 leaders, and spouses, were invited by the Chancellor of Germany. Press knew!\n", "https://twitter.com/realDonaldTrump/status894367017054208001 Mon Aug 07 01:18:08 +0000 2017 The Fake News refuses to report the success of the first 6 months: S.C., surging economy & jobs,border & military security,ISIS & MS-13 etc.\n", "https://twitter.com/realDonaldTrump/status881847676232503297 Mon Jul 03 12:10:44 +0000 2017 At some point the Fake News will be forced to discuss our great jobs numbers, strong economy, success with ISIS, the border & so much else!\n", "https://twitter.com/realDonaldTrump/status935874566701842434 Wed Nov 29 14:14:19 +0000 2017 So now that Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the “unsolved mystery” that took place in Florida years ago? Investigate!\n", "https://twitter.com/realDonaldTrump/status889675644396867584 Tue Jul 25 02:36:17 +0000 2017 Is Fake News Washington Post being used as a lobbyist weapon against Congress to keep Politicians from looking into Amazon no-tax monopoly?\n", "https://twitter.com/realDonaldTrump/status868810522942164993 Sun May 28 12:45:45 +0000 2017 Does anyone notice how the Montana Congressional race was such a big deal to Dems & Fake News until the Republican won? V was poorly covered\n", "https://twitter.com/realDonaldTrump/status890568797941362690 Thu Jul 27 13:45:22 +0000 2017 ...about then candidate Trump.\" Catherine Herridge @FoxNews. So why doesn't Fake News report this? Witch Hunt! Purposely phony reporting.\n", "https://twitter.com/realDonaldTrump/status917921548677328896 Wed Oct 11 01:15:26 +0000 2017 The Fake News is at it again, this time trying to hurt one of the finest people I know, General John Kelly, by saying he will soon be.....\n", "https://twitter.com/realDonaldTrump/status940930017365778432 Wed Dec 13 13:02:52 +0000 2017 Wow, more than 90% of Fake News Media coverage of me is negative, with numerous forced retractions of untrue stories. Hence my use of Social Media, the only way to get the truth out. Much of Mainstream Meadia has become a joke! @foxandfriends\n", "https://twitter.com/realDonaldTrump/status868810404335673344 Sun May 28 12:45:16 +0000 2017 ....it is very possible that those sources don't exist but are made up by fake news writers. #FakeNews is the enemy!\n", "https://twitter.com/realDonaldTrump/status897223558073602049 Mon Aug 14 22:29:00 +0000 2017 Made additional remarks on Charlottesville and realize once again that the #Fake News Media will never be satisfied...truly bad people!\n", "https://twitter.com/realDonaldTrump/status911189860769255424 Fri Sep 22 11:26:06 +0000 2017 The greatest influence over our election was the Fake News Media \"screaming\" for Crooked Hillary Clinton. Next, she was a bad candidate!\n", "https://twitter.com/realDonaldTrump/status923147501418446849 Wed Oct 25 11:21:30 +0000 2017 \"Clinton campaign & DNC paid for research that led to the anti-Trump Fake News Dossier. The victim here is the President.\" @FoxNews\n", "https://twitter.com/realDonaldTrump/status914094625488502784 Sat Sep 30 11:48:36 +0000 2017 Fake News CNN and NBC are going out of their way to disparage our great First Responders as a way to \"get Trump.\" Not fair to FR or effort!\n", "https://twitter.com/realDonaldTrump/status920406959320371200 Tue Oct 17 21:51:34 +0000 2017 So much Fake News being put in dying magazines and newspapers. Only place worse may be @NBCNews, @CBSNews, @ABC and @CNN. Fiction writers!\n", "https://twitter.com/realDonaldTrump/status926481563214376961 Fri Nov 03 16:09:52 +0000 2017 The rigged Dem Primary, one of the biggest political stories in years, got ZERO coverage on Fake News Network TV last night. Disgraceful!\n", "https://twitter.com/realDonaldTrump/status899625157421039616 Mon Aug 21 13:32:06 +0000 2017 Thank you, the very dishonest Fake News Media is out of control! https://t.co/8J7y900VGK\n", "https://twitter.com/realDonaldTrump/status894514535062790144 Mon Aug 07 11:04:19 +0000 2017 ...and West Virginia. The fact is the Fake News Russian collusion story, record Stock Market, border security, military strength, jobs.....\n", "https://twitter.com/realDonaldTrump/status934563828834164739 Sat Nov 25 23:25:54 +0000 2017 Wow, even I didn’t realize we did so much. Wish the Fake News would report! Thank you. https://t.co/ApVbu2b0Jd\n", "https://twitter.com/realDonaldTrump/status914099295963553792 Sat Sep 30 12:07:09 +0000 2017 The Fake News Networks are working overtime in Puerto Rico doing their best to take the spirit away from our soldiers and first R's. Shame!\n", "https://twitter.com/realDonaldTrump/status921829947093733376 Sat Oct 21 20:06:00 +0000 2017 Keep hearing about \"tiny\" amount of money spent on Facebook ads. What about the billions of dollars of Fake News on CNN, ABC, NBC & CBS?\n", "https://twitter.com/realDonaldTrump/status939967625362276354 Sun Dec 10 21:18:40 +0000 2017 Very little discussion of all the purposely false and defamatory stories put out this week by the Fake News Media. They are out of control - correct reporting means nothing to them. Major lies written, then forced to be withdrawn after they are exposed...a stain on America!\n", "https://twitter.com/realDonaldTrump/status883230130885324802 Fri Jul 07 07:44:07 +0000 2017 I will represent our country well and fight for its interests! Fake News Media will never cover me accurately but who cares! We will #MAGA!\n", "https://twitter.com/realDonaldTrump/status914465475777695744 Sun Oct 01 12:22:14 +0000 2017 We have done a great job with the almost impossible situation in Puerto Rico. Outside of the Fake News or politically motivated ingrates,...\n", "https://twitter.com/realDonaldTrump/status884020939264073728 Sun Jul 09 12:06:30 +0000 2017 ...have it. Fake News said 17 intel agencies when actually 4 (had to apologize). Why did Obama do NOTHING when he had info before election?\n", "https://twitter.com/realDonaldTrump/status892383242535481344 Tue Aug 01 13:55:19 +0000 2017 Only the Fake News Media and Trump enemies want me to stop using Social Media (110 million people). Only way for me to get the truth out!\n", "https://twitter.com/realDonaldTrump/status935147410472480769 Mon Nov 27 14:04:51 +0000 2017 We should have a contest as to which of the Networks, plus CNN and not including Fox, is the most dishonest, corrupt and/or distorted in its political coverage of your favorite President (me). They are all bad. Winner to receive the FAKE NEWS TROPHY!\n", "https://twitter.com/realDonaldTrump/status913034591879024640 Wed Sep 27 13:36:24 +0000 2017 Facebook was always anti-Trump.The Networks were always anti-Trump hence,Fake News, @nytimes(apologized) & @WaPo were anti-Trump. Collusion?\n", "https://twitter.com/realDonaldTrump/status925364408364171265 Tue Oct 31 14:10:42 +0000 2017 ....earth shattering. He and his brother could Drain The Swamp, which would be yet another campaign promise fulfilled. Fake News weak!\n" ] } ], "source": [ "import twarc\n", "\n", "# If you haven't configure twarc from the command line you can also set the keys manually \n", "# by uncommenting and filling in the following text:\n", "#\n", "# CONSUMER_KEY = \"\"\n", "# CONSUMER_SECRET = \"\"\n", "# ACCESS_TOKEN = \"\"\n", "# ACCESS_TOKEN_SECRET = \"\"\n", "#\n", "# twitter = twarc.Twarc(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)\n", "\n", "twitter = twarc.Twarc()\n", "\n", "for tweet in twitter.hydrate(tweet_ids(warc_file)):\n", " url = 'https://twitter.com/' + tweet['user']['screen_name'] + '/status' + tweet['id_str']\n", " print (url, tweet['created_at'], tweet['full_text'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If it helps I've bundled all this into a single little utility here as [warc-twarc.py](https://github.com/edsu/warc-twarc/blob/master/warc-twarc.py). Maybe it would be fun to do something a bit more interesting with the tweet metadata. I leave that for a future notebook, and to you!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }