---
addenda:
- '[code](https://github.com/alexklapheke/covid-sentiment)'
- |
[slide
deck](https://github.com/alexklapheke/covid-sentiment/blob/master/presentation/Covid%2019%20Sentiment%20Analysis.pdf)
date: 1592942292
title: 'COVID-19 sentiment analysis'
---
This project is joint work with Luken Weaver, Jon Godin, and Reza
Farrokhi.
::: {.epigraph}
The Master said, "When the multitude hate a man, it is necessary to
examine into the case. When the multitude like a man, it is necessary to
examine into the case."
:::
# Background & problem statement
The COVID-19 crisis officially began in the US on January 20, when an
infected traveler in Wuhan flew home to Washington State
[@holshue2020first]. For the several months thence, in the absence of a
unified national policy, states have engaged in a complicated fandango
of business closures, mask orders, and quarantines for visitors. While
it would take a much more sophisticated data analysis to gauge their
effectiveness, we can look at how these policies were popularly received
using social media sentiment as a proxy.
We chose to compare sentiment between two cities in which the saga took
quite different turns: New York, which suffered both an alarming
outbreak and an austere lockdown (ongoing at time of writing), and
Houston, which had a much easier time, with fewer infections than New
York City had deaths, and a lockdown lasting barely a month
(@fig:nychou).
![Case and death counts in New York City (all five boroughs) and in
Harris County, Texas (which contains Houston) as of June 22. Data
courtesy of the [*New York
Times*](https://github.com/nytimes/covid-19-data). [Source
code](https://gist.github.com/alexklapheke/5077fe330ca6a307b8700d0af4c4adf9)
Cases
Deaths
Lockdown
(statewide)](images/5c1e70817d8798d15de76e48631cc077cf36194b.svg){#fig:nychou}
A naïve hypothesis would be that Houstonians, faring better overall,
would speak with more positive affect. Complicating this, of course, is
the profusion of topics on which people might converse, including:
China, Donald Trump, Anthony Fauci, the CDC and WHO, governors Andrew
Cuomo[^1] and Greg Abbott[^2], cruise ships, quarantines, face masks,
and the cancellations of events, not to mention the pathogen itself and
its physiological effects. Our goal, therefore, was to isolate topics of
conversation, and examine the sentiment surrounding particular public
figures.
# Data collection
Our preferred data source was Twitter, which, with a third of a billion
users, captures an enormous fraction of the public discourse. In
addition, there is an up-to-date, curated
[dataset](https://github.com/thepanacealab/covid19_twitter) of COVID-19
related tweets [@banda2020a]. However, time and budget allowed few
tweets to be captured---either on the order of 10^3^ total, or
stretching back only seven days. In addition, the paucity of
geotags---by [some
estimates](https://www.forbes.com/sites/kalevleetaru/2019/03/04/visualizing-seven-years-of-twitters-evolution-2012-2018/),
on the order of 1%---would make filtering by city nearly impossible.
Reddit turned out to be somewhat more hospitable, as
[pushshift](https://github.com/pushshift/api) provides free access to
the Reddit API, as well as a [full
dataset](http://files.pushshift.io/reddit/comments/) of Reddit comments
stretching back to the site's founding [@baumgartner2020pushshift],
although the latter source had not been updated recently enough for our
purpose. Reddit users subscribe to interest communities called
"subreddits", prefixed with "/r/", some of which are for cities and
other locales. While it can't be verified that a particular subscriber
is a resident, we assume that a large proportion of them are. One
downside of this structure is that subreddits can become [echo
chambers](https://en.wikipedia.org/wiki/Echo_chamber_(media)) which do
not represent the larger population.
Another difficulty is that although Reddit ranks in the [top 20
websites](https://www.alexa.com/siteinfo/reddit.com) by traffic, the
volume of posts pales next to Twitter. Although each metropolitan
subreddit records hundreds of thousands of subscribers, by the [90--9--1
rule of
thumb](https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture)), we
guessed that in a subreddit with on the order of 10^5^ subscribers, only
about 10^3^ would be active commenters, as shown in @tbl:commenters.
Subreddit City pop. Subscribers Casual commenters Active commenters
------------ ----------- ------------- ------------------- -------------------
/r/nyc 8,300,000 225,000 ≈20,000 ≈2,000
/r/houston 2,300,000 150,000 ≈13,500 ≈1,500
: City and subreddit populations and commenter estimates
{\#tbl:commenters}
We focused on comments, since most of the substantive (read:
sentiment-laden) discussion happens there; however, these are not
guaranteed to contain topic-relevant keywords, so we searched for posts
whose titles contained any of the keywords "COVID", "coronavirus",
"quarantine", or "pandemic". Reddit
[generates](https://github.com/reddit-archive/reddit/blob/master/r2/r2/lib/utils/_utils.pyx#L39-L40)
a base 36 ID for each post and comment on the site, so we collated post
IDs with comment IDs, as the simplified code snippet below illustrates.
``` {.python}
import requests
import pandas as pd
comments = []
for subreddit in ["nyc", "houston"]:
# Get posts with the most comments
res = requests.get(
"https://api.pushshift.io/reddit/search/submission",
params = {
"subreddit": subreddit,
"size": 400,
"sort_type": "num_comments",
"sort": "desc",
})
# Parse JSON
posts = pd.DataFrame(res.json()["data"])
# Get comments from posts
res = requests.get(
"https://api.pushshift.io/reddit/search/comment",
params = {
"subreddit": subreddit,
"link_id": (["t3_" + n for n in posts["id"]]),
"size": 1000,
})
# Collect parsed output
comments.append(pd.DataFrame(res.json()["data"]))
```
In toto, we collected 150,000 comments from /r/nyc, and 85,000 from
/r/houston, and randomly sampled 85,000 New York comments for the
analysis to avoid variance issues.
# Sentiment analysis
The sentiment analysis itself was done with the VADER sentiment parser
[@vader], a sophisticated rule-based parser that incorporates contextual
information such as negations ("not"), intensifiers ("very"), hedges
("somewhat"), and even emoticons (":)").[^3] For these reasons, it works
best on unprocessed text:
``` {.python}
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Instantiate sentiment analyzer & compute sentiments
analyzer = SentimentIntensityAnalyzer()
sentiments = pd.DataFrame(map(analyzer.polarity_scores, comments["text"]))
# Append to original data frame
comments = comments.join(sentiments.set_index(comments.index))
```
It provides a "sentiment intensity" score which is signed for polarity
(positive good, negative bad). We took the 5-day rolling mean of the
sentiments of all comments, filtering for various keywords.
Two important caveats apply, the first of which is that knowing *what*
the sentiment is doesn't tell you *why*. For example, in late April,
/r/nyc responded negatively to the keyword "Fauci", but this turned out
to be displeasure not with Fauci himself, but rather with Trump, who had
considered firing him. Thus, sentiment *around* a public figure should
not be conflated with sentiment *toward* that figure. The second caveat
is that positive and negative sentiments can cancel; so although a
highly positive or negative overall sentiment score reflects unanimity
in these feelings, a neutral score does not represent apathy, but simply
lack of consensus.
One way consensus is achieved on Reddit is by voting: users vote for
("upvote") or against ("downvote") comments, and the comment's score is
the number of votes for minus the number against. If voting is a way of
agreeing with the sentiment of a comment, then it can be seen as tacitly
expressing the same sentiment. We accounted for this by eliminating
comments that scored less than one, and weighting the sentiment of the
remainder by their scores:
``` {.python}
import numpy as np
comments["weighted"] = np.maximum(0,
# Sentiment score
comments["compound"] *
# Reddit score
comments["score"] *
# Scaling factor (for convenience; does not change analysis)
0.75)
```
One potential pitfall is that since the highest-scoring comments appear
highest on the page, this can be subject to the [bandwagon
effect](https://en.wikipedia.org/wiki/Bandwagon_effect), by which users
see, and then upvote, comments that are already popular.
We graphed the weighted sentiments of COVID-related posts in 2020
(@fig:2020), and formulated a baseline sentiment for comparison---maybe
sentiment is more positive in haler times, or maybe New Yorkers really
are [less positive](https://www.youtube.com/watch?v=lX4MoaTFp9g) in
general---so we looked at comments from the same months in the previous
year, without filtering by keyword (@fig:2019).
![COVID-related sentiment across cities, 2020 (5-day rolling
mean)
New
York
Houston ](images/2da9d373c128936e9fc288dbdf73f52c007e9989.svg){#fig:2020}
The overall results were unilluminating; the only particularly positive
or negative periods, in February 2020, are essentially artifacts due to
insufficient data (the virus had not yet seized the nation's attention).
![General sentiment across cities, 2019 (5-day rolling mean)
New
York
Houston ](images/9c8e4a73f52e3fcbe1adcc0bee7eaa6498ffac1a.svg){#fig:2019}
A clearer picture arose around public figures. After filtering by the
surname of each state's governor, we see a positive spike in New York in
the early days of shutdowns, whereas Houston takes a swift negative turn
when Abbott declares lockdown, and swiftly becomes positive after he
orders its end (@fig:gov).
![Sentiment around governors across cities, 2020 (5-day rolling
mean)
New
York (Keyword "Cuomo")
Houston
(Keyword
"Abbott") ](images/6544dc85384235aacfcc35bd24e6686b3ab8c9d6.svg){#fig:gov}
This seems to imply that New Yorkers, having been much harder hit, were
more sanguine about curtailments of public life than the relatively
freewheeling Texans.
A similarly revealing pattern is shown in response to the keyword
"Trump" (@fig:trump). While comments about the president tended negative
overall, they dipped steeply in late April, possibly in response to his
defunding the WHO.
![Sentiment around President Trump across cities, 2020 (5-day rolling
mean)
New
York (Keyword "Trump")
Houston
(Keyword
"Trump") ](images/66ef7a0fe8d3a4f8bc57cb95cb52fdefb573d8aa.svg){#fig:trump}
# Conclusions
We have shown that public attitudes as expressed on social media can be
tied to world events, and analyzed attitudes toward events during the
COVID-19 crisis. This analysis could be expanded with, e.g., a more
sophisticated weighting scheme, more precise filtering, or a larger
dataset.
[^1]: Dem., NY, in office since 2011
[^2]: Rep., TX, in office since 2015
[^3]: Some sample inputs and scores are shown below:
Text Score
--------------- ---------
great! +0.6588
great +0.6249
very good +0.4927
good +0.4404
not bad +0.4310
somewhat good +0.3832
okay +0.2263
meh −0.0772
not good −0.3412
bad −0.5423