{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Introduction" ], "metadata": { "id": "qC4Ah_PsaClJ" } }, { "cell_type": "markdown", "source": [ "In this notebook, we will be plotting a chloropleth map showing which regions in the world have the most bias towards posting tweets that have a negative sentiment.\n", "\n", "The dataset used for this project is publicly available on Kaggle. Here is the link: https://www.kaggle.com/datasets/vivekchary/sentiment-with-16-million-tweets-with-locations\n", "\n", "If you are running this on a Jupyter Notebook, make sure you download the CSV file and have it in your environment before you begin running the snippets." ], "metadata": { "id": "TT64pSOILbNw" } }, { "cell_type": "markdown", "source": [ "# Preparing The Data" ], "metadata": { "id": "_dngbG76Ztmc" } }, { "cell_type": "markdown", "source": [ "Import the necessary Python libraries" ], "metadata": { "id": "QigW4DzzGCCa" } }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "7qOV_QywFKJO" }, "outputs": [], "source": [ "import pandas as pd \n", "import matplotlib.pyplot as plt\n", "import folium" ] }, { "cell_type": "markdown", "source": [ "Load dataset into a pandas dataframe" ], "metadata": { "id": "dxE8JWQmGIuL" } }, { "cell_type": "code", "source": [ "df = pd.read_csv('sentiment140_with_location.csv', header=None, names=['Sentiment Target', 'Tweet ID', 'Date', 'Query Flag', 'User', 'Text', 'Location'], encoding='latin1')\n" ], "metadata": { "id": "LEfK8g8OGL5n" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "Filter the dataframe to only include tweets with negative sentiment" ], "metadata": { "id": "5TOxie2vGWpR" } }, { "cell_type": "code", "source": [ "negative_tweets = df[df['Sentiment Target'] == 0]" ], "metadata": { "id": "THYBIxi9Ge93" }, "execution_count": 4, "outputs": [] }, { "cell_type": "markdown", "source": [ "Group the negative tweets by location to get a count of negative tweets for each location" ], "metadata": { "id": "VxS-4DThHSMB" } }, { "cell_type": "code", "source": [ "negative_tweets_count = negative_tweets.groupby(['Location']).size().reset_index(name='count')\n" ], "metadata": { "id": "oelIb-iVHXU9" }, "execution_count": 5, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Plotting Our Data" ], "metadata": { "id": "scxF6ttlZjNR" } }, { "cell_type": "markdown", "source": [ "Next, Let's create a map for visualisation of the data we are trying to represent" ], "metadata": { "id": "k8oQYdXzR41w" } }, { "cell_type": "code", "source": [ "chloro_map = folium.Map()" ], "metadata": { "id": "piKE8PKXSJuJ" }, "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "We need GeoJSON data of the countries in the world.\n", "\n", "**GeoJSON** data is a data format representing geographical features, such as country borders.\n", "\n", "For our case here, Folium already has this GeoJSON data that we could use." ], "metadata": { "id": "t2_fMimdSOIC" } }, { "cell_type": "code", "source": [ "#Setting up the world countries data URL\n", "\n", "url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'\n", "country_shapes = f'{url}/world-countries.json'" ], "metadata": { "id": "76vqQXuLSDFb" }, "execution_count": 7, "outputs": [] }, { "cell_type": "markdown", "source": [ "Here, we populate our map with the necessary data" ], "metadata": { "id": "7fla20PTSiz2" } }, { "cell_type": "code", "source": [ "#Adding the Choropleth layer onto our base map\n", "folium.Choropleth(\n", " #The GeoJSON data to represent the world country\n", " geo_data=country_shapes,\n", " name='Negative Tweet Bias Chloropleth',\n", " data= negative_tweets_count,\n", " #The column aceppting list with 2 value; The country name and the numerical value\n", " columns=['Location', 'count'],\n", " key_on='feature.properties.name',\n", " fill_color='PuRd',\n", " nan_fill_color='white'\n", ").add_to(chloro_map)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_vNjN85YSlvf", "outputId": "efef9f0f-ff6c-4dc8-de7c-eed4e2e56c6a" }, "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "Display the map" ], "metadata": { "id": "26z6qaJUWTj9" } }, { "cell_type": "code", "source": [ "chloro_map" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "fJ5LgRTYWVss", "outputId": "3e46cf91-894d-4bed-99c7-f9e323645d15" }, "execution_count": 27, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ] }, "metadata": {}, "execution_count": 27 } ] }, { "cell_type": "markdown", "source": [ "The areas in the map with a darkest shade of pink are shown to be the ones that produce the highest numbers of tweets with a percieved negative sentiment.\n", "\n", "Since this is a relativley small dataset compared to the number of tweets available, the data may be a little bit skewed." ], "metadata": { "id": "hDKYBWrkYyMf" } }, { "cell_type": "markdown", "source": [ "# Revision\n", "\n" ], "metadata": { "id": "olure7ydsmKz" } }, { "cell_type": "markdown", "source": [ "Looking at the absolute number of negative tweets can be misleading, since most tweets come from a handful of countries. \n", "\n", "A better way to determine countries with a higher bias towards posting negative tweets is to plot the fraction of negative tweets by country." ], "metadata": { "id": "NSf-UYVQtCuo" } }, { "cell_type": "markdown", "source": [ "To do this we are going to need to prepare our data a little bit differently:" ], "metadata": { "id": "hhwqXMo0trej" } }, { "cell_type": "markdown", "source": [ "First we are going to aggregate the tweets by country" ], "metadata": { "id": "49Kn9FL21VkE" } }, { "cell_type": "code", "source": [ "country_counts = df.groupby('Location').size().reset_index(name='Total Count')\n" ], "metadata": { "id": "nru8Bp2vtpE9" }, "execution_count": 9, "outputs": [] }, { "cell_type": "markdown", "source": [ "Next, we aggregate the negative sentiment tweets by country" ], "metadata": { "id": "Uw0451Gct0Vt" } }, { "cell_type": "code", "source": [ "neg_country_counts = negative_tweets.groupby('Location').size().reset_index(name='Negative Count')\n" ], "metadata": { "id": "FLaNzvT1uElQ" }, "execution_count": 10, "outputs": [] }, { "cell_type": "markdown", "source": [ "We can then merge the 2 dataframes by the respective country" ], "metadata": { "id": "zWhbiP1MuFhU" } }, { "cell_type": "code", "source": [ "country_counts = country_counts.merge(neg_country_counts, on='Location', how='left')\n" ], "metadata": { "id": "XNVbjI5_uJgF" }, "execution_count": 11, "outputs": [] }, { "cell_type": "markdown", "source": [ "We can proceed to calculate the fraction of negative tweets by country and store it in a new column in our new dataframe" ], "metadata": { "id": "58wuPj4xuLkN" } }, { "cell_type": "code", "source": [ "country_counts['Negative Fraction'] = country_counts['Negative Count'] / country_counts['Total Count']\n" ], "metadata": { "id": "SsP-6JbFuaJS" }, "execution_count": 12, "outputs": [] }, { "cell_type": "code", "source": [ "# display our columns in our new dataframe\n", "country_counts.columns" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JyyQaB6gxL46", "outputId": "cd0e68cb-0aab-4190-b1df-278d5f6555e4" }, "execution_count": 22, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['Location', 'Total Count', 'Negative Count', 'Negative Fraction'], dtype='object')" ] }, "metadata": {}, "execution_count": 22 } ] }, { "cell_type": "markdown", "source": [ "Next, we can add a chloropleth layer to our base map to distinguish areas with a higher density of negatively-biased content" ], "metadata": { "id": "iHnhfRuqu9HW" } }, { "cell_type": "code", "source": [ "chloro_map_fractions = folium.Map()" ], "metadata": { "id": "hIu0hmrWxpbg" }, "execution_count": 17, "outputs": [] }, { "cell_type": "code", "source": [ "#Adding the Choropleth layer onto our base map\n", "folium.Choropleth(\n", " #The GeoJSON data to represent the world country\n", " geo_data=country_shapes,\n", " name='Negative Tweet Bias Chloropleth',\n", " data= country_counts,\n", " #The column aceppting list with 2 value; The country name and the numerical value\n", " columns=['Location', 'Negative Fraction'],\n", " key_on='feature.properties.name',\n", " fill_color='PuRd',\n", " nan_fill_color='white'\n", ").add_to(chloro_map_fractions)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-H8-DjGaugEq", "outputId": "99da936c-046f-4f11-ca72-50109a1ac54a" }, "execution_count": 25, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 25 } ] }, { "cell_type": "markdown", "source": [ "Display the map" ], "metadata": { "id": "IYerVAqkzIB1" } }, { "cell_type": "code", "source": [ "# display the chloropleth map\n", "chloro_map_fractions" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "CI-DdOwAxZbY", "outputId": "1501808c-83b4-4bea-a6e2-8ec399e63bb4" }, "execution_count": 26, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ] }, "metadata": {}, "execution_count": 26 } ] }, { "cell_type": "markdown", "source": [ "# Interpretation" ], "metadata": { "id": "nifN9OGO05YG" } }, { "cell_type": "markdown", "source": [ "The data here is not perfect and only comes from a relatively small dataset. This could explain why some countries you'd expect to show higher bias appear to have a lower proportion. " ], "metadata": { "id": "EvodDXq8zPNM" } }, { "cell_type": "markdown", "source": [ "Another issue with fractions is that the denominator should be reasonably big for the fraction to make sense.\n", "For example: if a country has only 5 tweets in the dataset and one of them is marked as negative, it's fraction for bias comes to 0.2 which when plotted as shown above, would present the same result as that of a country that has 1,000,000 tweets and 200,000 of them are marked negative." ], "metadata": { "id": "6c_zJfK1zkqg" } }, { "cell_type": "markdown", "source": [ "A proposed solution for this is to skip a country if the total number of tweets from the country is below some arbitrary threshold.\n", "\n", "It is up to you to determine this threshold depending on your goals with your specific project." ], "metadata": { "id": "8rWztGgZ0ifp" } }, { "cell_type": "markdown", "source": [ "# Potential Next steps" ], "metadata": { "id": "NvbgoZwtZ4X0" } }, { "cell_type": "markdown", "source": [ "The chloropleth map above is generated using already existing data. The sentiment of each tweet in the original dataset, [Sentiment 140](https://www.kaggle.com/datasets/kazanova/sentiment140), was generated by running the dataset through a pretrained model that could detect if a tweet was positive, neutral or negative" ], "metadata": { "id": "dKw1VkghJ3iz" } }, { "cell_type": "markdown", "source": [ "The next steps for this experiment would involve:\n", "\n", "\n", "1. Using this data to finetune a model like BERT with the data from either the original dataset or the dataset used here\n", "2. Generate a new dataset of tweets and their geolocation by scraping Twitter. Check out how to use the [Twitter API](https://developer.twitter.com/en/docs/twitter-api). Alternatively, you can find an existing dataset that has more recent data and use that instead\n", "3. Run the new dataset through the finetuned model from Step 1 and obtain the sentiment target using a similar metric to the one used in the Sentiment 140 dataset\n", "4. Plot the geomap for the new data and compare to the previous maps to see the regions that are biased toward negative content over time" ], "metadata": { "id": "-tqBFSrxNOIR" } }, { "cell_type": "markdown", "source": [ "# Acknowledgements\n", "\n", "Dr. Kishore Papinei: for the recommendations that led to the revised section; displaying the proportion of negative tweets by country" ], "metadata": { "id": "5tdP3_oDsySf" } } ] }