{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Estimation of COVID-19 pandemic\n", "\n", "## Loading data\n", "\n", "We will use data on COVID-19 infected individuals, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). Dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pytest\n", "import ipytest\n", "import unittest\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from pandas.testing import assert_frame_equal\n", "from pandas.testing import assert_series_equal\n", "\n", "ipytest.autoconfig()\n", "plt.rcParams[\"figure.figsize\"] = (10, 3) # make figures larger" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the most recent data directly from GitHub using `pd.read_csv`. If for some reason the data is not available, you can always use the copy available locally in the `data` folder - just uncomment the line below that defines `base_url`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# base_url = \"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/\" # loading from Internet\n", "base_url = \"../../assets/data/estimation-covid-19/\" # loading from disk\n", "infected_dataset_url = base_url + \"time_series_covid19_confirmed_global.csv\"\n", "recovered_dataset_url = base_url + \"time_series_covid19_recovered_global.csv\"\n", "deaths_dataset_url = base_url + \"time_series_covid19_deaths_global.csv\"\n", "countries_dataset_url = base_url + \"UID_ISO_FIPS_LookUp_Table.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now load the data for infected individuals and see how the data looks like:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infected = pd.read_csv(infected_dataset_url)\n", "infected.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that each row of the table defines the number of infected individuals for each country and/or province, and columns correspond to dates. Similar tables can be loaded for other data, such as number of recovered and number of deaths." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recovered = pd.read_csv(recovered_dataset_url)\n", "deaths = pd.read_csv(deaths_dataset_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making sense of the data\n", "\n", "From the table above the role of province column is not clear. Let's see the different values that are present in `Province/State` column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infected[\"Province/State\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the names we can deduce that countries like Australia and China have more detailed breakdown by provinces. Let's look for information on China to see the example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def column_filter(df, column_name, column_value):\n", " \"\"\"\n", " Filters a pandas DataFrame based on a column value.\n", "\n", " Returns:\n", " pandas.DataFrame: The filtered DataFrame.\n", " \"\"\"\n", " if df is None or not isinstance(df, pd.DataFrame) or df.empty:\n", " raise Exception(\"df is not a valid DataFrame\")\n", " if column_name not in df.columns:\n", " raise Exception(f\"{column_name} does not exist in df\")\n", " return df[df[_______] == ______]\n", "\n", "column_filter(infected, \"Country/Region\", \"China\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
column_name and column_value.\n",
"\n",
"pandas.DataFrame.groupby() and aggregation function sum().\n",
"\n",
"pandas.DataFrame.drop(columns=coulumns, inplace=True).\n",
"\n",
"pandas.DataFrame.loc[].\n",
"\n",
"pandas.DataFrame.diff().\n",
"\n",
"pandas.DataFrame.index.\n",
"\n",
"pandas.DataFrame.rolling(window).\n",
"\n",
"pandas.DataFrame.isna().\n",
"\n",
"filter_by_country_region() and usepandas.DataFrame.iloc[] to select the first series number.\n",
"\n",
"lambda and sum().\n",
"\n",
"pandas.DataFrame.replace().\n",
"\n",
"pandas.DataFrame.diff() \n",
"\n",
"pandas.DataFrame.index and mean().\n",
"\n",
"