{
"cells": [
{
"cell_type": "markdown",
"id": "artificial-trauma",
"metadata": {},
"source": [
"# Capturing Twitter data with Twarc\n",
"\n",
"\n",
"# Background\n",
"During the protests in July 2019 asking for the resignation of the Governor of Puerto Rico, Joel Blanco-Rivera began capturing tweets with twarc, a command line tool and Python library for archiving Twitter data. This tool was selected after reading about its application for archiving twitter data of other events by members of Documenting the Now. The goal was to capture tweets about the protests that used the hashtag #RickyRenuncia, which became the protest slogan in social media.\n",
"\n",
"This module uses the experience of archiving #RickyRenuncia Twitter data to explain how to capture Twitter data with twarc, generate reports and export the data to csv and geojson files. Through the module, you will be able to create and curate your own dataset.\n",
"\n",
"# Objectives\n",
"
\n",
"## Learning goals\n",
"
\n",
"## Expected Interaction\n",
"The main function of this module is to provide information about installing and running twarc. To use twarc, users will need to install and set up the tool in their computers.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "dense-registrar",
"metadata": {},
"source": [
"## About Twarc\n",
"Twarc is a command line tool and Python library for arhiving Twitter data. It uses Twitter API to search and capture tweets according to parameters given in the search command line. The captured tweets are saved as a JSON file. It was developed by Ed Summers from the Maryland Institute for Technology in the Humanities. It is also part of Documenting the Now, a project that creates tools for social media archiving to chronicle significant events, prioritizing ethical practices in the collection and preservation of social media content (https://www.docnow.io)."
]
},
{
"cell_type": "markdown",
"id": "ecological-cloud",
"metadata": {},
"source": [
"## Installing twarc\n",
"There are two things to do before installing Twarc. First, you need to register an application at https://developer.twitter.com. Second, you need to install Python if you don't have it installed yet. After these two steps, follow the installation instructions in https://twarc-project.readthedocs.io/en/latest. It includes instructions for Mac and Windows. For this tutorial, twarc was used in a Mac and the commands are run in the application Terminal. The program is stored in a \"twarc-master\" folder. Once twarc is set up, we can begin collecting tweets.\n"
]
},
{
"cell_type": "markdown",
"id": "royal-athens",
"metadata": {},
"source": [
"### Capturing and saving tweets\n",
"Open Terminal and change the directory to where the folder twarc-master is located: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "innocent-question",
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import Image\n",
"\n",
"print ('Changing the directory in Terminal')\n",
"\n",
"local_image = Image (filename='./images/twarc-rickyrenuncia-1.gif')\n",
"\n",
"local_image"
]
},
{
"cell_type": "markdown",
"id": "conditional-imaging",
"metadata": {},
"source": [
"The basic command line for capturing tweets is \"twarc search\" and the search query. In this example the query is to search for tweets with #RickyRenuncia and save it as a JSON file: \n",
"
Since Twitter API only captures tweets from seven to nine days back, it is very possible that you will have multiple json files from the same search parameters. In this case, you can combine all files and save them into one JSON file.
" ] }, { "cell_type": "code", "execution_count": null, "id": "vietnamese-chest", "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "\n", "print ('Combine multiple files')\n", "\n", "local_image = Image (filename='./images/twarc-rickyrenuncia-2.gif')\n", "\n", "local_image" ] }, { "cell_type": "markdown", "id": "revised-belief", "metadata": {}, "source": [ "The above clip shows the use of the command line cat, where you can list the filename of all the files you want to combine. But you can also truncate the file name, as shown in the clip. This means that twarc will identify all the files that begin with tweetsRickyRenuncia and save them into one JSON file. Make sure that all the files are stored in the same folder." ] }, { "cell_type": "markdown", "id": "premier-surrey", "metadata": {}, "source": [ "Since there might be overlap in the tweets stored in the individual files, the combined JSON file will have duplicates. Here you can use one of the tools that are part of Twarc utilities, and saved in the /utils folder. To eliminate duplicates use the command line deduplicate.\n", "To create a GeoJSON file you will use the command line geojson.py:
\n", "The command line to generate a report is reportprofile.py.
" ] }, { "cell_type": "markdown", "id": "located-rapid", "metadata": {}, "source": [ "For more information on the utilities of Twarc-report visit https://github.com/pbinkley/twarc-report.
" ] }, { "cell_type": "markdown", "id": "bba31e5f-53e4-4401-9f18-5e560eaaf303", "metadata": {}, "source": [ "### Activity\n", "Create a report of your dataset. \n", "