{ "cells": [ { "cell_type": "markdown", "id": "e20d58b8-4915-4c41-bd49-c7fa4f3fa27b", "metadata": {}, "source": [ "# **Evaluating Content**\n", "\n", "Although the search processed used by the team to retrieve the Tweet ID list was rigurous, some materials in the list may be unrelated. It is possible some posts match a hashtag but discuss subjects unrelated to the historical events of #RickyRenuncia movement from summer 2019.\n", "\n", "\n", "\n", "## Objectives\n", "\n", "This notebook presents a minimal IPython graphical user interface (GUI) where participants, researchers and members of the original team could interact with content and classify it.\n", "\n", "### Learning Goals\n", "- Interact with twitter embedings.\n", "- Update SQLite3 database.\n", "- Classify content (minimum of 20 tweets).\n", "- Visualize state of the database.\n", "\n", "## Requirements\n", "\n", "**Tweeter API Credentials**\n", "\n", "The user will need to have created the `twitter_secrets.py` file based on `twitter_secrets_example.py` and set the variables to his API specifications. See [Twitter API Credentials](./Developer_Registration.ipynb) section." ] }, { "cell_type": "markdown", "id": "fcf73d3a-e4b7-4e8e-ae1b-924aa44b016e", "metadata": {}, "source": [ "## Optional Requirements\n", "\n", "**OPTIONAL**\n", "\n", "**Google API Credentials**\n", "\n", "A `google.oauth2.service_account.Credentials` object is required to interact with the google translate API to automatically see translations of text. This should help non-Spanish speakers interact with content in Spanish.\n", "\n", "The user will need to have created/edited the `google_translate_keys.json` following the [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) section. This is **optional**, but will offer the user automatic translation of tweet text content to english (or other language)." ] }, { "cell_type": "markdown", "id": "14658cf3-dbb8-4cdb-9e4f-19865a7cecf1", "metadata": {}, "source": [ "## Import Libraries\n", "\n", "Add Library justifications" ] }, { "cell_type": "code", "execution_count": 2, "id": "34f2339e-93eb-4b7f-97c0-373f750b8568", "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "from IPython.core.display import display, HTML, update_display\n", "import json, os, pickle\n", "from random import seed, randint\n", "# from tweet_requester.analysis import TweetJLAnalyzer, TweetAnalyzer\n", "from tweet_requester.display import TweetInteractiveClassifier, \\\n", "JsonLInteractiveClassifier, TSess, prepare_google_credentials, PROCESSING_STAGES\n", "# from twitter_secrets import C_BEARER_TOKEN \n", "from twitter_secrets_testing import C_BEARER_TOKEN \n", "JL_DATA=\"./tweetsRickyRenuncia-final.jsonl\"\n", "CACHE_DIR=\"tweet_cache\"\n", "SQLite_DB=\"tweets.db\"" ] }, { "cell_type": "markdown", "id": "7b8d5abf-cfaf-4648-9e00-a8376fe8b4e9", "metadata": {}, "source": [ "## Download Data Set\n", "\n", "A (dataset)[https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt] of the tweets related to the investigation is public online. This list will be used as a basis for research.from os import isfile" ] }, { "cell_type": "code", "execution_count": 3, "id": "492268e2-481a-41ce-b455-02ce893a9288", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data available at ./tweetsRickyRenuncia-final.jsonl.\n" ] } ], "source": [ "import requests\n", "from os.path import isfile, isdir\n", "\n", "tweet_list_url = \"https://ia601005.us.archive.org/31/items/tweetsRickyRenuncia-final/tweetsRickyRenuncia-final.txt\"\n", "\n", "# Download dataset if not present\n", "if not isfile(JL_DATA):\n", " response = requests.get(tweet_list_url)\n", " with open(JL_DATA, 'wb') as handler:\n", " handler.write(response.content)\n", " print(f\"Data downloaded at {JL_DATA}.\")\n", "else:\n", " print(f\"Data available at {JL_DATA}.\")" ] }, { "cell_type": "markdown", "id": "aee6766c-3b84-45d6-a068-21c6a995afd8", "metadata": {}, "source": [ "## Download Database and Cache\n", "\n", "The code bellow uses a combination of python logic and terminal commands to download the *compressed* **database** and **cache**, and then extracted only if needed. Terminals commands can be addentified by the `!` symbol at the beggining of the line. The commands `wget`, `tar` and `gzip` are [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) commands available outside of python.\n", "\n", "A version of the database and cache is being shared publicly through the [#RickyRenuncia Project Scalar Book](https://libarchivist.com/rrp/rickyrenuncia/index). The database and cache should make the experience" ] }, { "cell_type": "code", "execution_count": 5, "id": "cb516037-2015-46d9-8453-0d46a0c40196", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Database already available\n", "Database already extracted.\n", "Compressed Cache already available!\n", "Cache already extracted!\n" ] } ], "source": [ "from os.path import isfile, isdir\n", "\n", "tmp_file = \"tweet_cache.tar.gz\"\n", "tweets_db_url = \"https://libarchivist.com/rrp/rickyrenuncia//tweets.db.gz\"\n", "tweet_cache_url = \"https://libarchivist.com/rrp/rickyrenuncia//tweet_cache.tar.gz\"\n", "\n", "# Download database if not present\n", "if not isfile(\"tweets.db.gz\"):\n", " !wget \"$tweets_db_url\"\n", "else:\n", " print(\"Database already available\")\n", "\n", "# Extract database if not present\n", "if not isfile(\"tweets.db\"):\n", " !gzip -kd \"tweets.db.gz\"\n", "else:\n", " print(\"Database already extracted.\")\n", "\n", "# Download dataset if not present\n", "if not isfile(tmp_file):\n", " !wget \"$tweet_cache_url\"\n", "else:\n", " print(\"Compressed Cache already available!\")\n", "\n", "# Extract the cache if not present\n", "if not isdir(CACHE_DIR):\n", " !tar -xf $tmp_file\n", " print(\"Cache extracted!\")\n", "else:\n", " print(\"Cache already extracted!\")\n" ] }, { "cell_type": "markdown", "id": "69a9c916-785b-49d2-8160-db18af35b795", "metadata": {}, "source": [ "## Create a TSess\n", "The `TSess` object stores configuration and controls the connection used to retrieve content from the Twitter API. It is this object that requires your twitter credentials to create a connection.\n", "\n", "**Tweeter API Credentials** are required to create the session." ] }, { "cell_type": "code", "execution_count": 4, "id": "98ae133a-8172-4ef1-845c-36bc53415b4b", "metadata": {}, "outputs": [], "source": [ "tweet_session = TSess(\n", " C_BEARER_TOKEN, \n", " compression_level=5, \n", " sleep_time=3, # Minimal sleep between requests to avoid hitting rate limits\n", " cache_dir=CACHE_DIR, \n", " hash_split=True\n", " )" ] }, { "cell_type": "markdown", "id": "6102afde-5195-4737-a40d-70a4c6b6bc58", "metadata": {}, "source": [ "The session even include rate limiting for requests. For bearer token app authentication the limit is 300 tweet lookups each 15 minutes (900 seconds). In other words 3 seconds per tweet. Read more at \"[Rate limits | Docs | Twitter Developer Platform](https://developer.twitter.com/en/docs/twitter-api/rate-limits)\"." ] }, { "cell_type": "markdown", "id": "88b6d5e0-1b44-4d25-a80e-e4ffbf458004", "metadata": {}, "source": [ "## Create Google Translate Credentials\n", "\n", "After following the **optional** instructions from [Google API Credentials](./Developer_Registration.ipynb#Google-API-Credentials) run the code bellow. If the user did not acquire any credentials the code will default to no credentials." ] }, { "cell_type": "code", "execution_count": 5, "id": "714020af-b6cb-475a-afcd-622e4478a42e", "metadata": {}, "outputs": [], "source": [ "google_credentials = prepare_google_credentials(\n", " credentials_file=\"./google_translate_keys_testing.json\"\n", ")" ] }, { "cell_type": "markdown", "id": "15b5905b-dfe9-4a0b-9d03-c492d3fd4b29", "metadata": {}, "source": [ "## Create a JsonLInteractiveClassifier\n", "\n", "A JsonLInteractiveClassifier object handles interactions with a local SQLite database, Twitter API (importing a TSess) and a GUI for early data curation. \n", "\n", "In terms of the data integration process , the process of capturing the data from the API would fall under **Extraction**, the GUI for additional metadata and the SQLite database would fall under **Transform** as information is being stored in a different format for easier analysis. Any methods used to visualize the data or access attributes with less effort would fall under **Load**.\n", "\n", "In that sense the `JsonLInteractiveClassifier` handles multiple stages on the **ETL**. \n", "\n", "
Preprocessed 2006
"
],
"text/plain": [
"