# Introduction

In this notebook, we will be plotting a chloropleth map showing which regions in the world have the most bias towards posting tweets that have a negative sentiment.

The dataset used for this project is publicly available on Kaggle. Here is the link: https://www.kaggle.com/datasets/vivekchary/sentiment-with-16-million-tweets-with-locations

If you are running this on a Jupyter Notebook, make sure you download the CSV file and have it in your environment before you begin running the snippets.

# Preparing The Data

Import the necessary Python libraries

In [21]:
import pandas as pd 
import matplotlib.pyplot as plt
import folium

Load dataset into a pandas dataframe

In [3]:
df = pd.read_csv('sentiment140_with_location.csv', header=None, names=['Sentiment Target', 'Tweet ID', 'Date', 'Query Flag', 'User', 'Text', 'Location'], encoding='latin1')


Filter the dataframe to only include tweets with negative sentiment

In [4]:
negative_tweets = df[df['Sentiment Target'] == 0]

Group the negative tweets by location to get a count of negative tweets for each location

In [5]:
negative_tweets_count = negative_tweets.groupby(['Location']).size().reset_index(name='count')


# Plotting Our Data

Next, Let's create a map for visualisation of the data we are trying to represent

In [6]:
chloro_map = folium.Map()

We need GeoJSON data of the countries in the world.

**GeoJSON** data is a data format representing geographical features, such as country borders.

For our case here, Folium already has this GeoJSON data that we could use.

In [7]:
#Setting up the world countries data URL

url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'

Here, we populate our map with the necessary data

In [8]:
#Adding the Choropleth layer onto our base map
folium.Choropleth(
    #The GeoJSON data to represent the world country
    geo_data=country_shapes,
    name='Negative Tweet Bias Chloropleth',
    data= negative_tweets_count,
    #The column aceppting list with 2 value; The country name and  the numerical value
    columns=['Location', 'count'],
    key_on='feature.properties.name',
    fill_color='PuRd',
    nan_fill_color='white'
).add_to(chloro_map)

<folium.features.Choropleth at 0x7f6a3e5c5960>

Display the map

In [27]:
chloro_map

The areas in the map with a darkest shade of pink are shown to be the ones that produce the highest numbers of tweets with a percieved negative sentiment.

Since this is a relativley small dataset compared to the number of tweets available, the data may be a little bit skewed.

# Revision



Looking at the absolute number of negative tweets can be misleading, since most tweets come from a handful of countries. 

A better way to determine countries with a higher bias towards posting negative tweets is to plot the fraction of negative tweets by country.

To do this we are going to need to prepare our data a little bit differently:

First we are going to aggregate the tweets by country

In [9]:
country_counts = df.groupby('Location').size().reset_index(name='Total Count')


Next, we aggregate the negative sentiment tweets by country

In [10]:
neg_country_counts = negative_tweets.groupby('Location').size().reset_index(name='Negative Count')


We can then merge the 2 dataframes by the respective country

In [11]:
country_counts = country_counts.merge(neg_country_counts, on='Location', how='left')


We can proceed to calculate the fraction of negative tweets by country and store it in a new column in our new dataframe

In [12]:
country_counts['Negative Fraction'] = country_counts['Negative Count'] / country_counts['Total Count']


In [22]:
# display our columns in our new dataframe
country_counts.columns

Index(['Location', 'Total Count', 'Negative Count', 'Negative Fraction'], dtype='object')

Next, we can add a chloropleth layer to our base map to distinguish areas with a higher density of negatively-biased content

In [17]:
chloro_map_fractions = folium.Map()

In [25]:
#Adding the Choropleth layer onto our base map
folium.Choropleth(
    #The GeoJSON data to represent the world country
    geo_data=country_shapes,
    name='Negative Tweet Bias Chloropleth',
    data= country_counts,
    #The column aceppting list with 2 value; The country name and  the numerical value
    columns=['Location', 'Negative Fraction'],
    key_on='feature.properties.name',
    fill_color='PuRd',
    nan_fill_color='white'
).add_to(chloro_map_fractions)

<folium.features.Choropleth at 0x7f6a3d161270>

Display the map

In [26]:
# display the chloropleth map
chloro_map_fractions

# Interpretation

The data here is not perfect and only comes from a relatively small dataset. This could explain why some countries you'd expect to show higher bias appear to have a lower proportion. 

Another issue with fractions is that the denominator should be reasonably big for the fraction to make sense.
For example: if a country has only 5 tweets in the dataset and one of them is marked as negative, it's fraction for bias comes to 0.2 which when plotted as shown above, would present the same result as that of a country that has 1,000,000 tweets and 200,000 of them are marked negative.

A proposed solution for this is to skip a country if the total number of tweets from the country is below some arbitrary threshold.

It is up to you to determine this threshold depending on your goals with your specific project.

# Potential Next steps

The chloropleth map above is generated using already existing data. The sentiment of each tweet in the original dataset, [Sentiment 140](https://www.kaggle.com/datasets/kazanova/sentiment140), was generated by running the dataset through a pretrained model that could detect if a tweet was positive, neutral or negative

The next steps for this experiment would involve:


1. Using this data to finetune a model like BERT with the data from either the original dataset or the dataset used here
2. Generate a new dataset of tweets and their geolocation by scraping Twitter. Check out how to use the [Twitter API](https://developer.twitter.com/en/docs/twitter-api). Alternatively, you can find an existing dataset that has more recent data and use that instead
3. Run the new dataset through the finetuned model from Step 1 and obtain the sentiment target using a similar metric to the one used in the Sentiment 140 dataset
4. Plot the geomap for the new data and compare to the previous maps to see the regions that are biased toward negative content over time

# Acknowledgements

Dr. Kishore Papinei: for the recommendations that led to the revised section; displaying the proportion of negative tweets by country