{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "BaVNFpZJgECX"
},
"source": [
"# COVID-19 Data Collection and Analysis"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dLS-vnAJkZRc",
"outputId": "59cc19f8-6b60-4e19-8d95-de0aff4428ba"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
]
}
],
"source": [
"from google.colab import drive\n",
"drive.mount(\"/content/drive\")"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "7PmTSUTFhqg6"
},
"outputs": [],
"source": [
"import re\n",
"import requests\n",
"import pandas as pd\n",
"import numpy as np\n",
"from time import sleep\n",
"from bs4 import BeautifulSoup\n",
"import pickle \n",
"from typing import Optional"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cmZHCw15fh_O"
},
"source": [
"# 1. Data Collection\n",
"\n",
"We are interested in the degree to which the SARS-CoV-2 virus has affected United States citizens (SARS-CoV-2 is the virus that causes the COVID-19 disease). The Centers for Disease Control and Prevention (CDC) provides relevant data from USAFacts.org that includes the number of confirmed COVID-19 cases on a per-county basis. At the bottom of the web page (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/), a blue table provides a list of states, each of which has its own web page displaying the reported numbers of cases and deaths.\n",
"\n",
"We automatically downloaded each state's data with Requests and then manipulated it with BeautifulSoup. Specifically, we first fetched the web page located at `base_url` and save the request's returned object (a respond object) to `home_page`. We then used the BeautifulSoup object to parse the home page as an HTML document in order to extract the link for every state. With these extracted URLs, we populated a `state_urls` dictionary by setting each key to be the state name and the value to be the full URL. To avoid download state web pages multiple times frequently, we iterated through the `state_urls`, make a web request for each URL, and save the contents out to a file on the hard drive."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "28WF-Ye9kzsh"
},
"outputs": [],
"source": [
"# Every state's url begins with this prefix\n",
"base_url = 'https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/'\n",
"# Datasets will be saved to this directory\n",
"state_dir = \"./drive/MyDrive/state_data/\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JdtBOR97ffGn",
"outputId": "47f3d8e9-c96d-4283-9bff-8dd62f295bee"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"200\n",
"
US COVID-19 cases and deaths by state | USAFacts Optional[int]:\n",
" try:\n",
" return population_dict.get(state).get(county)\n",
" except AttributeError:\n",
" print('incorrect state name!')\n",
" return None\n",
"\n",
"def load_covid_data(state_info):\n",
" covid_data = {}\n",
" for (state, state_path) in state_info:\n",
" covid_data[state] = []\n",
" with open(state_path, 'r') as f:\n",
" soup = BeautifulSoup(f.read(), 'html.parser')\n",
" counties = soup.find_all('a', href=re.compile('county/'))\n",
" for c in counties:\n",
" row = c.find_parent('tr')\n",
" cols = [col.text.replace(',','') for col in row.find_all('td')]\n",
"\n",
" county_name = c.text\n",
" pop = get_pop(state, county_name)\n",
" if ((get_pop(state, county_name)) is None) or (pop == 0):\n",
" continue\n",
" covid_data[state].append({'county_name': county_name,\n",
" 'population': pop,\n",
" '7_day_avg_cases': float(cols[0]),\n",
" '7_date_ave_deaths': float(cols[1]),\n",
" 'cases': int(cols[2]),\n",
" 'deaths': int(cols[3])})\n",
" return covid_data"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "Ckoiwt0m2KKQ"
},
"outputs": [],
"source": [
"state_info = [(state, state_dir + state) for state in state_urls.keys()]\n",
"covid_data = load_covid_data(state_info)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WSEPIePT6U62"
},
"source": [
"# 3. Exploratory Data Analysis (EDA) I \n",
"\n",
"We first observed the single-most extreme counties and states, then inspected all states, after having sorted the data by some features."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d3KPJed-3XX4"
},
"source": [
"\n",
"We computed \n",
"1. The single county (and the state to which it belongs) that has the lowest rate of COVID cases per 100k people.\n",
"1. The single county (and the state to which it belongs) that has the highest rate of COVID cases per 100k people."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8IOTzSlb3bw3",
"outputId": "87f9c267-73fd-4b7b-d979-e782a4b31653"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hoonah-Angoon Census Area (Alaska) has the lowest COVID cases per 100k: 0.0\n",
"Loving County (Texas) has the highest COVID cases per 100k: 115976.33\n"
]
}
],
"source": [
"def calculate_county_stats(covid_data):\n",
" \n",
" min_county_count = 999999\n",
" min_county_name = \"\"\n",
" max_county_count = -1\n",
" max_county_name = \"\"\n",
" \n",
" # looks through every county in every state, while checking if we have a new low or high\n",
" for state in covid_data.keys():\n",
" for county in covid_data[state]:\n",
" pop = county['population']\n",
" if (pop is None) or (pop == 0):\n",
" continue\n",
" covid_rate = round(county['cases'] / (pop/100000),2)\n",
" if covid_rate < min_county_count:\n",
" min_county_count = covid_rate\n",
" min_county_name = county['county_name'] + \" (\" + state + \")\"\n",
" if covid_rate > max_county_count:\n",
" max_county_count = covid_rate\n",
" max_county_name = county['county_name'] + \" (\" + state + \")\"\n",
"\n",
" print(min_county_name + \" has the lowest COVID cases per 100k: \" + str(float(min_county_count)))\n",
" print(max_county_name + \" has the highest COVID cases per 100k: \" + str(float(max_county_count))) \n",
"\n",
"calculate_county_stats(covid_data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uMLxRaVT5WQf"
},
"source": [
"We calculated\n",
"1. The state that has the lowest number of deaths\n",
"1. The state that has the highest number of deaths\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1PrUM8B05crZ",
"outputId": "0a8b05fd-cec5-46b4-c8f6-a7296efcfa1b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vermont has the fewest COVID deaths: 646\n",
"California has the most COVID deaths: 89931\n"
]
}
],
"source": [
"def calculate_state_deaths(covid_data):\n",
" \n",
" min_state_deaths = 999999\n",
" min_state_name = \"\"\n",
" max_state_deaths = -1\n",
" max_state_name = \"\"\n",
" for state in covid_data.keys():\n",
" cur_state_count = 0\n",
" for county in covid_data[state]:\n",
" cur_state_count += county['deaths']\n",
" \n",
" if cur_state_count < min_state_deaths:\n",
" min_state_deaths = cur_state_count\n",
" min_state_name = state\n",
" if cur_state_count > max_state_deaths:\n",
" max_state_deaths = cur_state_count\n",
" max_state_name = state\n",
"\n",
" print(min_state_name + \" has the fewest COVID deaths: \" + str(min_state_deaths))\n",
" print(max_state_name + \" has the most COVID deaths: \" + str(max_state_deaths)) \n",
"\n",
"calculate_state_deaths(covid_data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Trsm-caH59xZ"
},
"source": [
"We calculated\n",
"1. The state that has the lowest rate of deaths based on its entire population\n",
"1. The state that has the highest rate of deaths based on its entire population\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7uxLpphX59jM",
"outputId": "7ce6b554-af5c-4794-8f2d-3e62fec13975"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hawaii has the lowest COVID death rate; 1 out of every 996 people has died\n",
"Mississippi has the highest COVID death rate; 1 out of every 239 people has died\n"
]
}
],
"source": [
"def calculate_state_deathrate(covid_data):\n",
" \n",
" min_state_death_rate = -1\n",
" min_state_name = \"\"\n",
" max_state_death_rate = 9999999\n",
" max_state_name = \"\"\n",
" \n",
" for state in covid_data.keys():\n",
" cur_state_deaths = 0\n",
" cur_state_population = 0\n",
" for county in covid_data[state]:\n",
" pop = county['population']\n",
" if (county['cases'] > 0) and (pop is not None):\n",
" cur_state_population += pop\n",
" cur_state_deaths += county['deaths']\n",
" \n",
" cur_state_deathrate = float(cur_state_population) / cur_state_deaths\n",
" \n",
" if cur_state_deathrate > min_state_death_rate:\n",
" min_state_death_rate = cur_state_deathrate\n",
" min_state_name = state\n",
" if cur_state_deathrate < max_state_death_rate:\n",
" max_state_death_rate = cur_state_deathrate\n",
" max_state_name = state\n",
" \n",
" print(min_state_name + \" has the lowest COVID death rate; 1 out of every \" + str(round(min_state_death_rate)) + \" people has died\")\n",
" print(max_state_name + \" has the highest COVID death rate; 1 out of every \" + str(round(max_state_death_rate)) + \" people has died\")\n",
"\n",
"calculate_state_deathrate(covid_data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x7Qo8Ozv7eXz"
},
"source": [
"Complicated analysis requires a better data structure like pandas dataframe. We now convert the previous dictionary of lists of dictionaries to a pandas dataframe. Each row corresponds to a unique county. Five columns are county, state, # total covid cases (integer), # covid case per 100k (float), and # covid deaths (integer)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 224
},
"id": "52E_9h4Z7SMv",
"outputId": "2df05e69-e74c-4351-9d50-a4110a5e31fd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3118, 5)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
county
\n",
"
state
\n",
"
# total covid cases
\n",
"
# covid cases per 100k
\n",
"
# covid deaths
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Autauga County
\n",
"
Alabama
\n",
"
15863
\n",
"
28393.205534
\n",
"
216
\n",
"
\n",
"
\n",
"
1
\n",
"
Baldwin County
\n",
"
Alabama
\n",
"
55862
\n",
"
25023.965883
\n",
"
681
\n",
"
\n",
"
\n",
"
2
\n",
"
Barbour County
\n",
"
Alabama
\n",
"
5681
\n",
"
23013.043831
\n",
"
98
\n",
"
\n",
"
\n",
"
3
\n",
"
Bibb County
\n",
"
Alabama
\n",
"
6457
\n",
"
28833.616147
\n",
"
105
\n",
"
\n",
"
\n",
"
4
\n",
"
Blount County
\n",
"
Alabama
\n",
"
15005
\n",
"
25948.535261
\n",
"
243
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" county state # total covid cases # covid cases per 100k \\\n",
"0 Autauga County Alabama 15863 28393.205534 \n",
"1 Baldwin County Alabama 55862 25023.965883 \n",
"2 Barbour County Alabama 5681 23013.043831 \n",
"3 Bibb County Alabama 6457 28833.616147 \n",
"4 Blount County Alabama 15005 25948.535261 \n",
"\n",
" # covid deaths \n",
"0 216 \n",
"1 681 \n",
"2 98 \n",
"3 105 \n",
"4 243 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def convert_to_pandas(covid_data):\n",
" \n",
" covid_data_flipped = []\n",
" for state, counties in covid_data.items():\n",
" for county in counties: \n",
" pop = county['population']\n",
" if (pop is None) or (pop == 0):\n",
" continue\n",
" cases = county['cases']\n",
" cur_dict = {\"county\":county['county_name'], \"state\":state,\n",
" \"# total covid cases\": cases,\n",
" \"# covid cases per 100k\": cases/(pop/100000),\n",
" \"# covid deaths\": county['deaths']}\n",
" covid_data_flipped.append(cur_dict)\n",
" covid_df = pd.json_normalize(covid_data_flipped)\n",
" return covid_df\n",
"\n",
"covid_df = convert_to_pandas(covid_data)\n",
"print(covid_df.shape)\n",
"covid_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"covid_df.to_csv('./combined_data/covid_df.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vMVBrhvS8qE0"
},
"source": [
"We can use this dataframe to compute same quantities as done above more easily\n",
"\n",
"1. the single county (and the state to which it belongs) that has the lowest rate of COVID cases per 100k people\n",
"1. the single county (and the state to which it belongs) that has the highest rate of COVID cases per 100k people\n",
"\n",
"\n",
"1. the state that has the lowest number of deaths\n",
"1. the state that has the highest number of deaths\n",
"\n",
"\n",
"1. The state that has the lowest rate of deaths based on its entire population\n",
"1. The state that has the highest rate of deaths based on its entire population\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BEsrXQAU8pP-",
"outputId": "1e97dd6f-5195-43cd-8b01-a5f79ae017de"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Kalawao County (Hawaii) has the lowest rate of confirmed COVID cases per 100k: 0.00\n",
"Loving County (Texas) has the highest rate of confirmed COVID cases per 100k: 113,017.75\n"
]
}
],
"source": [
"def calculate_county_stats2(covid_df):\n",
"\n",
" sorted_df = covid_df.sort_values(by=['# covid cases per 100k'])\n",
" lowest = sorted_df.iloc[0]\n",
" highest = sorted_df.iloc[-1]\n",
"\n",
" print(f\"{lowest['county']} ({lowest['state']}) has the lowest rate of confirmed COVID cases per 100k: {lowest['# covid cases per 100k']:,.2f}\")\n",
" print(f\"{highest['county']} ({highest['state']}) has the highest rate of confirmed COVID cases per 100k: {highest['# covid cases per 100k']:,.2f}\")\n",
" \n",
"calculate_county_stats2(covid_df)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Nhh5OeuB9Fao",
"outputId": "3d19ce1c-7892-4c5a-c876-30386aaf95f4"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vermont has the fewest COVID deaths: 640.0\n",
"California has the most COVID deaths: 89667.0\n"
]
}
],
"source": [
"def calculate_state_deaths2(covid_df):\n",
" \n",
" state_deaths = covid_df.groupby('state').sum().sort_values(by=['# covid deaths'])\n",
" lowest = state_deaths.iloc[0]\n",
" highest = state_deaths.iloc[-1]\n",
"\n",
" print(lowest.name + \" has the fewest COVID deaths: \" + str(lowest['# covid deaths']))\n",
" print(highest.name + \" has the most COVID deaths: \" + str(highest['# covid deaths']))\n",
"\n",
"calculate_state_deaths2(covid_df)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "PukG7pDl9Uw_",
"outputId": "1c0bcc88-abfa-41c2-9a5e-547fc53f9206"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hawaii has the lowest COVID death rate; 1 out of every 995 people has died\n",
"Mississippi has the highest COVID death rate; 1 out of every 238 people has died\n"
]
}
],
"source": [
"def calculate_state_deathrate2(covid_df):\n",
" \n",
" covid_df2 = covid_df\n",
" covid_df2['population'] = 100000*covid_df2['# total covid cases'] / covid_df2['# covid cases per 100k']\n",
" covid_df2 = covid_df2.groupby('state').sum()\n",
" covid_df2['death_rate'] = covid_df2['population'] / covid_df2['# covid deaths']\n",
" covid_df2 = covid_df2.sort_values(by=['death_rate'])\n",
"\n",
" print(covid_df2.iloc[-1].name + \" has the lowest COVID death rate; 1 out of every \" + str(int(covid_df2.iloc[-1]['death_rate'])) + \" people has died\")\n",
" print(covid_df2.iloc[0].name + \" has the highest COVID death rate; 1 out of every \" + str(int(covid_df2.iloc[0]['death_rate'])) + \" people has died\")\n",
"\n",
"calculate_state_deathrate2(covid_df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y9KrpyIp8_Ze"
},
"source": [
"Furthermore, considering that the data is messy and some are not reliable, we attempted to understand some of the uncertainty around COVID data. We consider that false negatives of deaths of COVID-19 is minimal. Every disease has a mortality rate and we can consider it's constant throughout all people in the US. Although some are at highe risk (e.g. older folks, people with pre-existing conditions, etc), we can imagine that this variance in the population to be fairly uniform throughout the USA. Therefore, if all counties were equal in their testing, we are supposed to see a consistent ratio between # people who died from COVID and # of people who tested positive for COVID, which is called 'case fatality rate'."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 502
},
"id": "de9sEr-hBqN1",
"outputId": "0bca88da-5f15-4d33-aa49-e1016b452bd2"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
county
\n",
"
state
\n",
"
# total covid cases
\n",
"
# covid cases per 100k
\n",
"
# covid deaths
\n",
"
population
\n",
"
# covid deaths per 100k
\n",
"
case_fatality_rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1702
\n",
"
Loup County
\n",
"
Nebraska
\n",
"
87
\n",
"
13102.409639
\n",
"
0
\n",
"
663
\n",
"
0
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
1696
\n",
"
Keya Paha County
\n",
"
Nebraska
\n",
"
118
\n",
"
14640.198511
\n",
"
0
\n",
"
805
\n",
"
0
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
544
\n",
"
Kalawao County
\n",
"
Hawaii
\n",
"
0
\n",
"
0.000000
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
87
\n",
"
Skagway Municipality
\n",
"
Alaska
\n",
"
30
\n",
"
2535.925613
\n",
"
0
\n",
"
1182
\n",
"
0
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
269
\n",
"
Jackson County
\n",
"
Colorado
\n",
"
166
\n",
"
11925.287356
\n",
"
0
\n",
"
1391
\n",
"
0
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2715
\n",
"
Sabine County
\n",
"
Texas
\n",
"
1254
\n",
"
11895.276039
\n",
"
89
\n",
"
10541
\n",
"
844
\n",
"
0.070973
\n",
"
\n",
"
\n",
"
427
\n",
"
Dodge County
\n",
"
Georgia
\n",
"
2110
\n",
"
10240.232953
\n",
"
154
\n",
"
20604
\n",
"
747
\n",
"
0.072986
\n",
"
\n",
"
\n",
"
1752
\n",
"
Storey County
\n",
"
Nevada
\n",
"
133
\n",
"
3225.806452
\n",
"
11
\n",
"
4122
\n",
"
266
\n",
"
0.082707
\n",
"
\n",
"
\n",
"
538
\n",
"
Wilcox County
\n",
"
Georgia
\n",
"
828
\n",
"
9588.882455
\n",
"
71
\n",
"
8634
\n",
"
822
\n",
"
0.085749
\n",
"
\n",
"
\n",
"
444
\n",
"
Glascock County
\n",
"
Georgia
\n",
"
269
\n",
"
9054.190508
\n",
"
25
\n",
"
2970
\n",
"
841
\n",
"
0.092937
\n",
"
\n",
" \n",
"
\n",
"
3118 rows × 8 columns
\n",
"
"
],
"text/plain": [
" county state # total covid cases \\\n",
"1702 Loup County Nebraska 87 \n",
"1696 Keya Paha County Nebraska 118 \n",
"544 Kalawao County Hawaii 0 \n",
"87 Skagway Municipality Alaska 30 \n",
"269 Jackson County Colorado 166 \n",
"... ... ... ... \n",
"2715 Sabine County Texas 1254 \n",
"427 Dodge County Georgia 2110 \n",
"1752 Storey County Nevada 133 \n",
"538 Wilcox County Georgia 828 \n",
"444 Glascock County Georgia 269 \n",
"\n",
" # covid cases per 100k # covid deaths population \\\n",
"1702 13102.409639 0 663 \n",
"1696 14640.198511 0 805 \n",
"544 0.000000 0 0 \n",
"87 2535.925613 0 1182 \n",
"269 11925.287356 0 1391 \n",
"... ... ... ... \n",
"2715 11895.276039 89 10541 \n",
"427 10240.232953 154 20604 \n",
"1752 3225.806452 11 4122 \n",
"538 9588.882455 71 8634 \n",
"444 9054.190508 25 2970 \n",
"\n",
" # covid deaths per 100k case_fatality_rate \n",
"1702 0 0.000000 \n",
"1696 0 0.000000 \n",
"544 0 0.000000 \n",
"87 0 0.000000 \n",
"269 0 0.000000 \n",
"... ... ... \n",
"2715 844 0.070973 \n",
"427 747 0.072986 \n",
"1752 266 0.082707 \n",
"538 822 0.085749 \n",
"444 841 0.092937 \n",
"\n",
"[3118 rows x 8 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def add_death_stats(covid_df):\n",
" \n",
" # can add an infintesimal or fillna after the fact to handle nans from divide by 0.\n",
" \n",
" covid_df['population'] = 100000*covid_df['# total covid cases'] / (covid_df['# covid cases per 100k']+0.0001)\n",
"# covid_df.fillna(0, inplace=True)\n",
" covid_df[\"population\"] = covid_df[\"population\"].astype('int32')\n",
" \n",
" covid_df['# covid deaths per 100k'] = 100000*covid_df['# covid deaths'] / (covid_df['population']+0.0001)\n",
"# covid_df.fillna(0, inplace=True)\n",
" covid_df[\"# covid deaths per 100k\"] = covid_df[\"# covid deaths per 100k\"].astype('int32')\n",
" \n",
" covid_df['case_fatality_rate'] = covid_df['# covid deaths'] / (covid_df['# total covid cases']+0.0001)\n",
"# covid_df.fillna(0, inplace=True)\n",
" covid_df = covid_df.sort_values(by=['case_fatality_rate'])\n",
"\n",
" return covid_df\n",
"\n",
"covid_updated = add_death_stats(covid_df)\n",
"covid_updated"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"covid_updated.to_csv('./combined_data/covid_updated.csv', index = False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2gVw9sYECdR1"
},
"source": [
"From all those analyses above, we learned that states vary wildly in their death rate (e.g., The number of deaths in New Jersey or California is orders of magnitude higher than those in Hawaii or Alaska) and COVID testing. States also fluctuate a lot amongst their counties, as some counties with very bad statistics are within states with good statistics. \n",
"\n",
"When it comes to data reliability, some states and counties are probably more proactive when it comes to testing, so they could have higher cases counts. Other counties might have a similar number of cases or higher, but they are just not being represented in the data due to lower testing. Deaths are thus harder to overlook, so states with lax testing policies may have inflated deaths per case metrics. Perhaps we could supplement the data with some measure of testing rates in the county or state."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RbR70HJmEm6a"
},
"source": [
"# 4. Incorporate More Data\n",
"\n",
"We are also interested in how COVID has impacted our world. We can better understand this by looking at how it relates to demographics, income, education, health, and politicala voting. \n",
"\n",
"Our `case_fatality_rate` column can be viewed as an approximation of how effective and thorough COVID testing is for a given county. Our `# covid deaths` column can be viewed as an extreme indication of how severe COVID has impacted a given county. Our `# covid cases per 100k` column be viewed as middle-ground between the two aforementioned features. That is, it measures the impact of the disease and is influenced by the thoroughness of COVID testing. \n",
"\n",
"Using these three informative features, we can inspect how impacted each county is, while correlating this with other features of each county, such as income-level, health metrics, demographics, etc. \n",
"\n",
"In this project, we merged our COVID case data with 'election2020_by_county.csv' dataset. We only care about 15 columns which are hispanic, minority, female, unemployed, income, nodegree, bachelor, inactivity, obesity, desity, cancer, voter_turnout, voter_gap, trump, biden. We droppde fipscode and population columns.\n",
"\n",
"A data description is as follows:\n",
"\n",
"- state: the state in which the county lies\n",
"- fipscode: an ID to identify each county\n",
"- county: the name of each county\n",
"- population: total population\n",
"- hispanic: percent of adults that are hispanic\n",
"- minority: percent of adults that are nonwhite\n",
"- female: percent of adults that are female\n",
"- unemployed: unemployment rate, as a percent\n",
"- income: median income\n",
"- nodegree: percent of adults who have not completed high school\n",
"- bachelor: percent of adults with a bachelor’s degree\n",
"- inactive: percent of adults who do not exercise in their leisure time\n",
"- obesity: percent of adults with BMI > 30\n",
"- density: population density, persons per square mile of land\n",
"- cancer: prevalence of cancer per 100,000 individuals\n",
"- voter_turnout: percentage of voting age population that voted\n",
"- voter_gap: percentage point gap in 2020 presidential voting: trump-briden\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 473
},
"id": "Za799tkjB4Hs",
"outputId": "3fba53e1-a3a9-4f28-a50d-f86f68e811ca"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3044, 23)\n"
]
},
{
"data": {
"text/html": [
"
\n",
" "
],
"text/plain": [
" trump income # total covid cases # covid cases per 100k \\\n",
"count 1005.000000 1005.000000 1005.000000 1005.000000 \n",
"mean 80.786965 43878.374129 5059.204975 24627.390904 \n",
"std 4.289503 9102.326701 6019.201974 6383.615594 \n",
"min 74.500000 23047.000000 32.000000 4916.885879 \n",
"25% 77.500000 37417.000000 1364.000000 20633.467594 \n",
"50% 80.100000 42310.000000 3257.000000 24538.953707 \n",
"75% 83.600000 49184.000000 6730.000000 28499.655884 \n",
"max 96.200000 86354.000000 57505.000000 113017.751479 \n",
"\n",
" # covid deaths population # covid deaths per 100k \\\n",
"count 1005.000000 1005.000000 1005.000000 \n",
"mean 85.427861 20120.834826 422.151244 \n",
"std 103.655065 23113.800617 162.744774 \n",
"min 1.000000 168.000000 27.000000 \n",
"25% 22.000000 5925.000000 327.000000 \n",
"50% 54.000000 13287.000000 422.000000 \n",
"75% 113.000000 25924.000000 525.000000 \n",
"max 1435.000000 223233.000000 1212.000000 \n",
"\n",
" case_fatality_rate obesity inactivity \n",
"count 1005.000000 1005.000000 1005.000000 \n",
"mean 0.017918 31.537114 28.320597 \n",
"std 0.008465 3.592986 4.534053 \n",
"min 0.001300 19.400000 13.600000 \n",
"25% 0.012937 28.900000 25.000000 \n",
"50% 0.016549 31.500000 27.900000 \n",
"75% 0.021352 33.900000 31.900000 \n",
"max 0.092937 43.200000 41.400000 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"view_partitions(merged, 'trump',\n",
" cols=['trump','income','# total covid cases', '# covid cases per 100k', '# covid deaths',\n",
" 'population', '# covid deaths per 100k', 'case_fatality_rate',\n",
" 'obesity', 'inactivity',])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8kcwLx7LKIhU"
},
"source": [
"We split the data into equal lower, middle, and upper quantiles based on first obeity and then inactivity. We can see that the the average death rates of counties in these partitions is positivly correlated with both of these features. This was expected as preexisting health conditions (obescity) and heath risks (inactivity) increase all cause mortality but also have a strong effect on how serious a covid infection can be. Finally we see that income has an even stronger relationship with the death rate, though here the correlation is a negative one. Obesity and inactivity are both negatively correlated with income as well. The relationship between voting for Trump and income is not a string one strong, though there is a positive correlation between Trump voting and obesity, inactivity, and covid death rate. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "onZEIEqYKVAn"
},
"source": [
"# 6. Data Weakness\n",
"\n",
"We can tell from comparing the populations between the groups that this data is not treated granularly as would be ideal. Very small population counties that get weighted the same as very large population counties in regards to the mean. So rural areas get over represented in the averages nationwide. This also explains why the Trump vs Biden is so far skewed from the actual well known national average based on popular votes.\n",
"\n",
"Also, the difference between income correlates more with population density than it might with an individual socio economic status. First, a higher income might not go as far towards standard of living in the city as it does in rural areas. Second, by using the average income over the whole county, income inequality in that county is not factored in. There could be many low income individuals living with many high income individuals in the same county. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oWmVkVxeK3yd"
},
"source": [
"# 7. Future Work\n",
"\n",
"We have done data gathing, parsing, and exploring data. We can continue to predict some behavior of the data (e.g. how a particular county will respond to COVID on a weekly basis).\n",
"\n",
"Alternatively, we could be interested in inference, whereby we are more concerned with trying to understand why and how a system behaves the way it does. We might wish to understand which factors most correlate and cause a certain event to happen. This could give us insights into where certain inequalities persist."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "project-notebook.ipynb",
"provenance": [],
"toc_visible": true
},
"interpreter": {
"hash": "6136f57e9522e1f7c9a1d00a6950a9c42424b667ef418888919cda0b3f236728"
},
"kernelspec": {
"display_name": "Python 3.8.8 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}