{ "cells": [ { "cell_type": "markdown", "id": "733d45f6-032c-42e2-83b6-0a160ebc9c5b", "metadata": {}, "source": [ "# Media Rating by Retweet and Quote Count\n", "\n", "This notebook is meant to follow [Evaluating Content](./Evaluating_Content.ipynb). The minimal requirements for this notebook are met by utilizing [this notebook](./Evaluating_Content.ipynb) first. \n", "\n", "The SQLite3 database created in that notebook will be accessed using the Pandas library as an alternative means of accessing the data.\n", "\n", "## Objectives\n", "\n", "Create big data visualizations using Pandas, Seaborn and Matplotlib packages. Interact with data from an SQLite3 database using Pandas.\n", "\n", "### Learning Goals\n", "- Use Pandas to extract SQLite3 database data.\n", "- Become familiar with Pandas Dataframes.\n", "- Utilize Seaborn package to create visualizations.\n", "- Recognize different types of graphs that can be used to represent multivariate datasets.\n", "\n", "## Requirements\n", "\n", "* Visit [Evaluating Content](./Evaluating_Content.ipynb) to prepare the environment for this notebook.\n", "\n" ] }, { "cell_type": "markdown", "id": "74652a69-084f-46a4-baab-8531736023e3", "metadata": {}, "source": [ "## Prepare the environment\n", "- Load Packages\n", "- Create a copy of the original database\n", "- Open a connection." ] }, { "cell_type": "code", "execution_count": null, "id": "01973e3e-fac0-4952-a0c1-01dd17f00acd", "metadata": {}, "outputs": [], "source": [ "# Load Packages\n", "\n", "# Enable Matplotlib Juupyter Widget Backend\n", "%matplotlib widget\n", "import sqlite3\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from mpl_toolkits.mplot3d import Axes3D\n", "from matplotlib.colors import ListedColormap\n", "from shutil import copyfile" ] }, { "cell_type": "code", "execution_count": null, "id": "6112503d-b372-4ad0-bb50-acd5a5157a97", "metadata": {}, "outputs": [], "source": [ "# Copy Database\n", "DB_FILE = \"./pandas_tweets.db\"\n", "copyfile(\"tweets.db\", DB_FILE)" ] }, { "cell_type": "code", "execution_count": null, "id": "f87ac0fe-6a27-4342-a278-7c2718957688", "metadata": {}, "outputs": [], "source": [ "# Connect to database copy\n", "connection = sqlite3.connect(DB_FILE)" ] }, { "cell_type": "markdown", "id": "a9d1446c-cc16-4f2f-b14e-b0bc9f501560", "metadata": {}, "source": [ "## **Access Database Query**\n", "\n", "Pandas `pd` includes a method called `pd.read_sql_query` that given an SQL query and a database connection will generate a Dataframe.\n", "\n", "### **The `sqlite_master` Table**\n", "\n", "Lets try it getting the table names from the database copy using the `connection` and this query:\n", "```\n", "SELECT name, sql \n", "FROM sqlite_master \n", "WHERE type='table';\n", "```\n", "\n", "The table `sqlite_master` is part of the SQLite3 structure and can be used to get information about the structure of the database. This query in particular retrieves the tables and the commands used to create them." ] }, { "cell_type": "code", "execution_count": null, "id": "eb2a78c9-53e2-47a1-94f8-326266822dc2", "metadata": {}, "outputs": [], "source": [ "tables = pd.read_sql_query(\n", " \"\"\"SELECT name, sql\n", " FROM sqlite_master \n", " WHERE type='table';\"\"\",\n", " connection\n", ")\n", "tables.head(15)" ] }, { "cell_type": "markdown", "id": "b1fa453c-d19f-4c33-87b2-356b2d8c6e61", "metadata": {}, "source": [ "### **Retrieve a comple table**\n", "A complete table can be retrieved by using a simmilar method `pd.read_sql_table` that takes a table name and the connection." ] }, { "cell_type": "code", "execution_count": null, "id": "bebf2de6-81a4-4bd6-bd2a-1038b4f731d0", "metadata": {}, "outputs": [], "source": [ "auto_detail = pd.read_sql_query(\"SELECT * FROM tweet_auto_detail;\", connection)\n", "\n", "auto_detail.loc[ auto_detail.has_media == 0, \"has_media_label\"] = \"Media\"\n", "auto_detail.loc[ auto_detail.has_media == 1, \"has_media_label\"] = \"No Media\"\n", "\n", "auto_detail.head(10)" ] }, { "cell_type": "markdown", "id": "6449b3a5-6c9e-49ad-88ee-a1aef91d65e3", "metadata": {}, "source": [ "# Describing the data\n", "\n", "## Date range\n", "\n", "The `DatePublished` column holds timestamps of when the data was published. This format does not make it easy to know the actual dates." ] }, { "cell_type": "code", "execution_count": null, "id": "6327703c-87bf-4c48-9973-dc4192ad1677", "metadata": {}, "outputs": [], "source": [ "auto_detail[\"datePublished\"].describe()" ] }, { "cell_type": "markdown", "id": "dc07589b-116b-4c74-981d-e9f9d9908774", "metadata": {}, "source": [ "### Transform timestamps to dates \n", "\n", "A transformation is required to change this `float64` values to date and time. The function `pd.Timestamp` accepts float value timestamps and transforms them into readable dates. The method `apply` allows performing a transformation to values in a `dataframe`, this transformations most accept a single input and return a single output.\n", "\n", "The function `timestamp2DateTimeBySegment` will be our transformation in this occasion." ] }, { "cell_type": "code", "execution_count": null, "id": "763b7643-2623-418a-b604-62010950252e", "metadata": {}, "outputs": [], "source": [ "def timestamp2DateTimeBySegment(ts: float):\n", " \"\"\"Transformation from float to DateTime by segment_size\"\"\"\n", " segment_size = 3600.0*24\n", " minus4_TZ = 3600.0*-4.0\n", " return pd.Timestamp(int((ts)/segment_size)*segment_size, unit='s', tz='UTC')\n", "\n", "# auto_detail[\"datePublished_DT\"] = auto_detail[\"datePublished\"].apply(lambda x: pd.Timestamp(int((x+minus4_TZ)/seconds_in_hour)*seconds_in_hour, unit='s'))\n", "auto_detail[\"datePublished_DT\"] = auto_detail[\"datePublished\"].apply(timestamp2DateTimeBySegment)\n", "print(\"Original Float Timestamps:\")\n", "print(auto_detail[\"datePublished\"].describe(datetime_is_numeric=True))\n", "print(\"\\nTransformed into DateTime by the hour:\")\n", "print(auto_detail[\"datePublished_DT\"].describe(datetime_is_numeric=True))" ] }, { "cell_type": "markdown", "id": "61396712-4f90-40f7-b068-7b58bf08ca88", "metadata": {}, "source": [ "# Exercise 1\n", "\n", "What is the earliest date of a tweet captured in this data set?" ] }, { "cell_type": "code", "execution_count": null, "id": "1c4ca3aa-f5e7-4e18-b234-2323f320a4f8", "metadata": {}, "outputs": [], "source": [ "# Answer here with code\n", "\n", "#Example latest date would be:\n", "print(auto_detail[\"datePublished_DT\"].max())\n", "\n", "# Response:\n" ] }, { "cell_type": "markdown", "id": "99e3f29d-b6dc-470b-b00d-3e8392cc60c8", "metadata": {}, "source": [ "## Visualizing Time data\n", "\n", "**Histograms** are a great tool to visualize frequency over any one variable.\n", "\n", "The code bellow produces a histogram of the `datePublished_DT`." ] }, { "cell_type": "code", "execution_count": null, "id": "7d22e2d3-8c2f-4394-a5ee-bde5ac22ff36", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Create a Mask to limit date range\n", "start_date = \"2019-07-11\"\n", "end_date = \"2019-08-01\"\n", "mask=(auto_detail[\"datePublished_DT\"] >= start_date) & (auto_detail[\"datePublished_DT\"] < end_date)\n", "\n", "# Take sample\n", "sample_dates = auto_detail[mask]\n", "\n", "#Generate histogram bin limits with numpy\n", "edges=np.histogram_bin_edges(sample_dates[\"datePublished\"])\n", "edges=pd.DataFrame(edges, columns=(\"datePublished\",))\n", "edges[\"datePublished_DT\"] = edges[\"datePublished\"].apply(timestamp2DateTimeBySegment)\n", "\n", "# Visualize a histogram\n", "sample_dates.hist(column=\"datePublished_DT\", xrot=25, bins=edges[\"datePublished_DT\"], figsize=(7,7), backend=\"matplotlib\")\n", "plt.title(\"Histogram: Date Published\")\n", "plt.ylabel(\"Frequency\")\n", "plt.xlabel(\"Date\")" ] }, { "cell_type": "markdown", "id": "2e8ee407-e87d-4034-bdc3-19ac7215d136", "metadata": {}, "source": [ "## Frequency by Language and Date\n", "\n", "It is posible to generate multiple histograms for different groups of the data. In particular we can see a time distribution of number of tweets over time for each language.\n", "\n", "In this occasion the `column` is set to `datePublished_DTdatePublished_DT` and we also use the `by` aparameter to set the grouping column." ] }, { "cell_type": "code", "execution_count": null, "id": "bab91164-2df7-4485-90da-94181c3fa873", "metadata": {}, "outputs": [], "source": [ "auto_detail.hist(column='datePublished_DT', by='language', bins=45, figsize=(8,9), sharex=True, sharey=True);" ] }, { "cell_type": "markdown", "id": "bf38138e-7d61-4680-b371-458b7e6a8ff9", "metadata": {}, "source": [ "From the histograms above it is very clear that most activity was in Spanish followed by English. It is one of the reasons the team integrated the work with the Google Translation API for the visualizations." ] }, { "cell_type": "markdown", "id": "ed2616d1-9b5e-49b5-a2fe-f1d581f48c2f", "metadata": {}, "source": [ "## Generate a table of frequencies\n", "\n", "The column language is currently of type object as this values are separate strings. Pandas offers a different data type that reduces memory usage called `Category`. `Categories` are particularly useful when a few values will be repeated many times." ] }, { "cell_type": "code", "execution_count": null, "id": "eb89c89c-8681-4348-a278-452bb37d9d46", "metadata": {}, "outputs": [], "source": [ "auto_detail.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "d2fb9b69-0e80-4010-b998-7220d4ddd0b5", "metadata": {}, "outputs": [], "source": [ "auto_detail['language']=auto_detail['language'].astype('category')\n", "print(auto_detail['language'].describe())" ] }, { "cell_type": "markdown", "id": "12217df0-02ec-419b-b523-93c55d8fe108", "metadata": {}, "source": [ "We can verify that the datatype has changed from `object` to `category`." ] }, { "cell_type": "code", "execution_count": null, "id": "d142acba-2bfe-462a-b129-5379ddfc75e4", "metadata": {}, "outputs": [], "source": [ "auto_detail[['language']].dtypes" ] }, { "cell_type": "markdown", "id": "aedd3d94-97e6-4f21-b1bd-11f2610ea511", "metadata": {}, "source": [ "### Group By & Count\n", "To get the **totals** per language we can use a group by statement with a count operation. as shown bellow." ] }, { "cell_type": "code", "execution_count": null, "id": "5759d527-60a9-4301-b18b-a16305109631", "metadata": {}, "outputs": [], "source": [ "count_by_lang = auto_detail[[\"language\", \"tweet_id\"]].groupby(\n", " [\"language\"]\n", ").count()\n", "count_by_lang=count_by_lang.rename(\n", " columns = {\"tweet_id\":\"Total\"}, inplace = False)\n", "count_by_lang.transpose()" ] }, { "cell_type": "markdown", "id": "93f88347-fd7d-4bf9-b77b-15e6cf09b576", "metadata": {}, "source": [ "Most user activity appears to be in Spanish followed by English." ] }, { "cell_type": "markdown", "id": "c0b8cf79-16d3-447e-a6df-ad7665fcefaf", "metadata": {}, "source": [ "# Exercise 2\n", "\n", "A. How many tweets have multimedia?\n", "A. Transform the `has_media` column into a categorical column to enhance efficiency:" ] }, { "cell_type": "code", "execution_count": null, "id": "d4326e4e-f42b-4355-8e9f-e99d62f3402e", "metadata": {}, "outputs": [], "source": [ "# Using Group By & Count\n", "# Display how many tweets have media and how many don't\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "39395e98-1123-4f8d-98c4-86c0441413c7", "metadata": {}, "outputs": [], "source": [ "# Transform the column here\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "# Print dtype of the columns to display the updated format\n", "auto_detail.dtypes" ] }, { "cell_type": "markdown", "id": "4f4704ea-93d2-4a49-b8ef-a50049e3bbc7", "metadata": {}, "source": [ "# Ranking Tweets with Multimedia by User Interaction \n", "\n", "Tweets can be ranked by two readily available metrics `Retweet Count` and `Favorite Count`. Other metrics such as `Comment Count` and `Quote Count` require access to API other than the standard API. However they could be used in a similar way to the ones shown bellow.\n", "\n", "We are particularly intrested in discovering relevant multimedia shared through social media. This will require filtering our dataset to only the tweets with multimedia. \n", "\n", "The easiest way to filter a dataframe on a categorical column would be to use the [`.loc` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). Using this method it is possible to select only the rows that comply with a specific condition.\n", "\n", "The `sort_values` method allows arranging the dataframe in descending or ascending order, read more on the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html).\n", "\n", "Using this `pandas` methods allows us to quickly identify popular tweets with multimedia." ] }, { "cell_type": "code", "execution_count": null, "id": "0a3dc032-f5c0-4b17-ad1b-c1a372ffee36", "metadata": {}, "outputs": [], "source": [ "# Get Top 20 Tweets with Media by Retweet Count\n", "from tweet_requester.display import prettyPrintDataFrame\n", "\n", "\n", "# page skips to the next N\n", "page=1\n", "N = 20\n", "\n", "top_N_retweet = auto_detail.loc[auto_detail[\"has_media\"]==1].sort_values(\"retweetCount\", ascending=False).head(N*page).tail(N)[[\"retweetCount\", \"tweet_id\", \"url\"]]\n", "prettyPrintDataFrame(top_N_retweet, max_column=60)" ] }, { "cell_type": "code", "execution_count": null, "id": "4a0a6250-20a2-4acd-bb58-d7259e7f964d", "metadata": {}, "outputs": [], "source": [ "# Get Top 20 Tweets with Media by Favorite(❤️) Count\n", "\n", "# page skips to the next N\n", "page=1\n", "N = 20\n", "\n", "top_N_favorite = auto_detail.loc[auto_detail[\"has_media\"]==1].sort_values(\"favoriteCount\", ascending=False).head(N*page).tail(N)[[\"favoriteCount\", \"tweet_id\", \"url\"]]\n", "prettyPrintDataFrame(top_N_favorite, max_column=60)" ] }, { "cell_type": "code", "execution_count": null, "id": "f52c7dfc-dae7-40c4-b69e-faf73817a9ef", "metadata": {}, "outputs": [], "source": [ "# Graph Date vs\n", "# auto_detail.datePublished.apply(lambda dt: int(str(dt.year)+ str(dt.month) + str(dt.day) +str(dt.hour) + str(dt.minute)))\n", "# auto_detail.datePublished.apply(lambda dt: \"{:02}{:02}{:02}\".format(dt.day, dt.hour, dt.minute))" ] }, { "cell_type": "markdown", "id": "f5372902-3a97-4654-8b5f-686c4a00dc3f", "metadata": {}, "source": [ "## Multi Variable Relations\n", "\n", "It is possible to create multivariate relationship visualizations both in 2D and 3D format using the `matplolib` library.\n", "\n", "### 3D Scatter Plot\n", "\n", "This plot displays both the correlation between retweets and favorites, but also displays how most of the shares were concentrated in time.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a477aabe-5aff-42be-8cf2-2295b9dc6c9d", "metadata": {}, "outputs": [], "source": [ "import matplotlib.dates as mdates\n", "\n", "fig4 = plt.figure(figsize=(8,6))\n", "ax = Axes3D(fig4, auto_add_to_figure=False)\n", "fig4.add_axes(ax)\n", "\n", "# get colormap from seaborn\n", "cmap = ListedColormap(sns.color_palette(\"husl\", 2).as_hex())\n", "dates = auto_detail[\"datePublished_DT\"]\n", "\n", "sc = ax.scatter(\n", " auto_detail[\"retweetCount\"], # X\n", " auto_detail[\"favoriteCount\"], # Y\n", " auto_detail[\"datePublished\"], # Z\n", " s=40,\n", " c=auto_detail[[\"has_media\"]],\n", " cmap=cmap,\n", " alpha=1,\n", " marker=\"o\"\n", ")\n", "ax.set_xlabel(\"Retweet Count\")\n", "ax.set_ylabel(\"Favorite Count\")\n", "ax.set_zlabel(\"Date Published\")\n", "# ax.zaxis.set_major_formatter(mdates.DayLocator(interval=1))\n", "\n", "# Legend\n", "plt.legend(*sc.legend_elements(), bbox_to_anchor=(1., 1), loc=2)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "274d05b2-d05c-43e4-b2c7-3d3dfcbbe50d", "metadata": {}, "source": [ "### FacetGrid\n", "\n", "The `FacetGrid` is particularly useful to demonstrate different patterns for in different categories.\n", "\n", "The grid of graphs bellow displays how languages other than Spanish and English didn't show as much user activity in terms of retweets and favorites." ] }, { "cell_type": "code", "execution_count": null, "id": "9d8317ab-8b2b-4a87-87c9-c0852c28b480", "metadata": {}, "outputs": [], "source": [ "groups = sns.FacetGrid(auto_detail, col=\"has_media\", row=\"language\", hue=\"datePublished_DT\", legend_out=False)\n", "groups.map(sns.scatterplot, \"favoriteCount\", \"retweetCount\")" ] }, { "cell_type": "markdown", "id": "a5f3d402-bc51-4b1d-9289-f71e57b345c7", "metadata": {}, "source": [ "# Conclusion\n", "\n", "The notebooks of [Evaluating Content](./Evaluating_Content.ipynb) and this notebook, Media_Rating, offer an example of how multiple notebooks can be used in a Curatorial and Analytical environment. In particualr the notebooks demonstrate that custom web interfaces can be developed inside Jupyter Notebooks to process the data in different stages and generate consistent reports.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }