{ "cells": [ { "cell_type": "markdown", "id": "3dbf4b0c-a305-48f3-b31b-34b71242dd61", "metadata": {}, "source": [ "# Johns Hopkins COVID-19 Dataset in Pandas\n", "\n", "COVID-19 is ravaging the globe. Let's look at the excellent Johns Hopkins\n", "dataset using Pandas. This will serve both as a guideline for getting the data\n", "and exploring on your own, as well as an example of Pandas multi-indexing in an\n", "easy to understand situation." ] }, { "cell_type": "code", "execution_count": null, "id": "42355661-3b32-4221-9fdc-3bfbe534cea2", "metadata": {}, "outputs": [], "source": [ "# Display final value in cell even if assigned\n", "%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'" ] }, { "cell_type": "code", "execution_count": null, "id": "a5aa090f-3c57-412a-bac6-521477d8d85a", "metadata": {}, "outputs": [], "source": [ "from typing import Optional\n", "from urllib.error import HTTPError\n", "\n", "import matplotlib.dates as mdates\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "plt.style.use(\"ggplot\")" ] }, { "cell_type": "markdown", "id": "7063f140-fc5a-4366-b9bf-c8859a264840", "metadata": {}, "source": [ "Now that we've made some basic imports, let's write a function that can\n", "read in a datafile from GitHub:" ] }, { "cell_type": "code", "execution_count": null, "id": "620612f7-d807-4227-9e28-14b55e1fa1c8", "metadata": {}, "outputs": [], "source": [ "def get_day(day: pd.Timestamp) -> pd.DataFrame:\n", " # Read in a datafile from GitHub\n", " try:\n", " table = pd.read_csv(\n", " \"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/\"\n", " \"master/csse_covid_19_data/csse_covid_19_daily_reports/\"\n", " f\"{day:%m-%d-%Y}.csv\",\n", " )\n", " except HTTPError:\n", " return pd.DataFrame()\n", "\n", " # Cleanup - sadly, the format has changed a bit over time - we can normalize that here\n", " table.columns = [\n", " f.replace(\"/\", \"_\")\n", " .replace(\" \", \"_\")\n", " .replace(\"Latitude\", \"Lat\")\n", " .replace(\"Longitude\", \"Long_\")\n", " for f in table.columns\n", " ]\n", "\n", " # This column is new in recent datasets\n", " if \"Admin2\" not in table.columns:\n", " table[\"Admin2\"] = None\n", "\n", " # New datasets have these, but they are not very useful for now\n", " table = table.drop(\n", " columns=[\"FIPS\", \"Combined_Key\", \"Lat\", \"Long_\"], errors=\"ignore\"\n", " )\n", "\n", " # If the last update time was useful, we would make this day only, rather than day + time\n", " # table[\"Last_Update\"] = pd.to_datetime(table[\"Last_Update\"]).dt.normalize()\n", " #\n", " # However, last update is odd, let's just make this the current day\n", " table[\"Last_Update\"] = day\n", "\n", " # Make sure indexes are not NaN, which causes later bits to not work. 0 isn't\n", " # perfect, but good enough.\n", " # Return as a multindex\n", " return table.fillna(0).set_index(\n", " [\"Last_Update\", \"Country_Region\", \"Province_State\", \"Admin2\"], drop=True\n", " )" ] }, { "cell_type": "markdown", "id": "faf76b61-89d5-4402-8fec-471fd681d3f7", "metadata": {}, "source": [ "Now let's loop over all days and build a multi-index DataFrame with the whole\n", "dataset. We'll be doing quite a bit of cleanup here as well. If you do this\n", "outside of a function, you should never modify an object in multiple cells;\n", "ideally you create an object like `df`, and make any modifications and\n", "replacements in the same cell. That way, running any cell again or running a\n", "cell multiple times will not cause unusual errors and problems to show up." ] }, { "cell_type": "code", "execution_count": null, "id": "c013d07c-6174-47e2-b428-5d1543a6ebcf", "metadata": {}, "outputs": [], "source": [ "def get_all_days(\n", " *, start_day: Optional[str] = None, end_day: Optional[str] = None\n", ") -> pd.DataFrame:\n", " # Make a list of all dates\n", " date_range = pd.date_range(\n", " start_day or \"2020-01-22\", end_day or pd.Timestamp.now().normalize()\n", " )\n", "\n", " # Create a generator that returns each day's dataframe\n", " day_gen = (get_day(day) for day in date_range)\n", "\n", " # Make a big dataframe, NaN is 0\n", " df = pd.concat(day_gen).fillna(0).astype(int)\n", "\n", " # Remove a few duplicate keys\n", " df = df.groupby(level=df.index.names).sum()\n", "\n", " # Sometimes active is not filled in; we can compute easily\n", " df[\"Active\"] = np.clip(df[\"Confirmed\"] - df[\"Deaths\"] - df[\"Recovered\"], 0, None)\n", "\n", " # Change in confirmed cases (placed in a pleasing location in the table)\n", " df.insert(\n", " 1,\n", " \"ΔConfirmed\",\n", " df.groupby(level=(\"Country_Region\", \"Province_State\", \"Admin2\"))[\"Confirmed\"]\n", " .diff()\n", " .fillna(0)\n", " .astype(int),\n", " )\n", "\n", " # Change in deaths\n", " df.insert(\n", " 3,\n", " \"ΔDeaths\",\n", " df.groupby(level=(\"Country_Region\", \"Province_State\", \"Admin2\"))[\"Deaths\"]\n", " .diff()\n", " .fillna(0)\n", " .astype(int),\n", " )\n", "\n", " return df" ] }, { "cell_type": "markdown", "id": "0d6ddef1-21fc-4e2b-b034-66524e08e13a", "metadata": {}, "source": [ "If this were a larger/real project, it would be time to bundle up the functions\n", "above and put them into a `.py` file - notebooks are for experimentation,\n", "teaching, and high level manipulation. Functions and classes should normally\n", "move to normal Python files when ready.\n", "\n", "Let's look at a few lines of this DataFrame to see what we have:" ] }, { "cell_type": "code", "execution_count": null, "id": "5443d282-f3d1-4499-ba4d-fcedb43193e1", "metadata": {}, "outputs": [], "source": [ "%%time\n", "df = get_all_days(end_day=\"2020-08-01\")" ] }, { "cell_type": "markdown", "id": "83367b8c-1b89-4f85-83f2-876f87a00050", "metadata": {}, "source": [ "The benefit of doing this all at once, in one DataFrame, should quickly become\n", "apparent. We can now use simple selection and grouping to \"ask\" almost anything\n", "about our dataset.\n", "\n", "As an example, let's look at just the US portion of the dataset. We'll use the\n", "pandas selection `.xs`:" ] }, { "cell_type": "code", "execution_count": null, "id": "888f5508-9784-44a5-8206-ad04844865fc", "metadata": {}, "outputs": [], "source": [ "us = df.xs(\"US\", level=\"Country_Region\")" ] }, { "cell_type": "markdown", "id": "c2814db8-b36b-4db7-b27c-c8f19f586446", "metadata": {}, "source": [ "Notice we have counties (early datasets just have one \"county\" called `\"0\"`). If\n", "we were only interested in states, we can group by the remaining levels and sum\n", "out the `\"Admin2\"` (county and similar) dimension:" ] }, { "cell_type": "code", "execution_count": null, "id": "107fd0a0-a8c8-4c76-b731-c56233b064a7", "metadata": {}, "outputs": [], "source": [ "by_state = us.groupby(level=(\"Last_Update\", \"Province_State\")).sum()" ] }, { "cell_type": "markdown", "id": "db492713-9b02-403c-8f60-59905b9f0e03", "metadata": {}, "source": [ "Using the same selector as before, we can pick out North Carolina:" ] }, { "cell_type": "code", "execution_count": null, "id": "e141948f-7336-4c9e-a61d-550be7a99a72", "metadata": {}, "outputs": [], "source": [ "by_state.xs(\"North Carolina\", level=\"Province_State\")" ] }, { "cell_type": "markdown", "id": "238742c2-5f62-4681-bfb2-37a70d2553d9", "metadata": {}, "source": [ "We can look at all of US, as well:" ] }, { "cell_type": "code", "execution_count": null, "id": "3189f9e1-0ae0-4968-822b-77fbfd71ac7d", "metadata": {}, "outputs": [], "source": [ "all_states = by_state.groupby(level=\"Last_Update\").sum()" ] }, { "cell_type": "markdown", "id": "68a20b3a-6acc-415f-8705-334de7b780db", "metadata": {}, "source": [ "#### US total cases\n", "\n", "Let's try a simple plot first; this is the one you see quite often." ] }, { "cell_type": "code", "execution_count": null, "id": "b9054b72-e49a-4d9d-96dc-1f543f064ca2", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 5))\n", "all_states.Confirmed.plot(logy=True, style=\"o\");" ] }, { "cell_type": "markdown", "id": "624936a4-e403-4d1f-bf40-d2c63be2cb71", "metadata": {}, "source": [ "#### Italy, new cases per day\n", "\n", "As another example, let's view the new cases per day for Italy. We will add a\n", "rolling mean, just to help guide the eye through the fluctuations - it is not a\n", "fit or anything fancy." ] }, { "cell_type": "code", "execution_count": null, "id": "319bb91e-a4ab-487d-8ccc-3839ae8cce42", "metadata": {}, "outputs": [], "source": [ "interesting = df.xs(\"Italy\", level=\"Country_Region\").groupby(level=\"Last_Update\").sum()\n", "\n", "plt.figure(figsize=(10, 5))\n", "interesting.ΔConfirmed.rolling(5, center=True).mean().plot(\n", " style=\"-\", label=\"Rolling mean\"\n", ")\n", "interesting.ΔConfirmed.plot(style=\"o\", label=\"Data\")\n", "plt.ylabel(\"New cases per day\")\n", "plt.legend();" ] }, { "cell_type": "markdown", "id": "bf7eab76-3caf-43ae-8620-a1d636ac0862", "metadata": {}, "source": [ "#### Italy, transmission rate\n", "\n", "It's more interesting to instead look at the transmission rate per day, which is\n", "new cases / active cases. The colors in the plot start changing when Italy\n", "implemented a lockdown on the 11th, and change over 14 days, which is roughly 1x\n", "the time to first symptoms. The lockdown make take longer than that to take full\n", "effect. There were several partial steps taken before the full lockdown on the\n", "4th and 9th. Notice the transmission is slowing noticeably!" ] }, { "cell_type": "code", "execution_count": null, "id": "70351a27-d680-491f-a24c-d3b0cfe7e395", "metadata": {}, "outputs": [], "source": [ "interesting = df.xs(\"Italy\", level=\"Country_Region\").groupby(level=\"Last_Update\").sum()\n", "growth = interesting.ΔConfirmed / interesting.Active\n", "growth = growth[\"2020-02-24\":]\n", "\n", "# Color based on lockdown (which happened in 3 stages, 4th, 9th, and 11th)\n", "lockdown = growth.index - pd.Timestamp(\"2020-03-11\")\n", "lockdown = np.clip(lockdown.days, 0, 14) / 14\n", "\n", "fix, ax = plt.subplots(figsize=(10, 5))\n", "ax.scatter(growth.index, growth, cmap=\"cool\", c=lockdown)\n", "\n", "ax.set_ylabel(\"new cases / active cases\")\n", "\n", "# set ticks every week\n", "ax.xaxis.set_major_locator(mdates.WeekdayLocator())\n", "# set major ticks format\n", "ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%b %d\"));" ] }, { "cell_type": "markdown", "id": "6e9d292c-2e8b-4c96-b9ad-f1989bf92ea1", "metadata": {}, "source": [ "#### US, transmission rate\n", "\n", "Same plot for the US. The colors in the plot start changing when the US started\n", "the 15 plan to slow the spread, and change over 14 days, which is roughly 1x the\n", "time to first symptoms. Each state has implemented different guidelines, so the\n", "effect will be spread out even further. Again, we are see the effect of the\n", "lockdown!" ] }, { "cell_type": "code", "execution_count": null, "id": "1449072b-512b-426d-8bc1-40391a1905b9", "metadata": {}, "outputs": [], "source": [ "interesting = df.xs(\"US\", level=\"Country_Region\").groupby(level=\"Last_Update\").sum()\n", "growth = interesting.ΔConfirmed / interesting.Active\n", "growth = growth[\"2020-03-01\":]\n", "\n", "# Not really a full lockdown, just a distancing guideline + local lockdowns later\n", "lockdown = growth.index - pd.Timestamp(\"2020-03-15\")\n", "lockdown = np.clip(lockdown.days, 0, 14) / 14\n", "\n", "fix, ax = plt.subplots(figsize=(10, 5))\n", "ax.scatter(growth.index, growth, cmap=\"cool\", c=lockdown)\n", "\n", "ax.set_ylabel(\"new cases / active cases\")\n", "\n", "# set ticks every week\n", "ax.xaxis.set_major_locator(mdates.WeekdayLocator())\n", "# set major ticks format\n", "ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%b %d\"));" ] }, { "cell_type": "code", "execution_count": null, "id": "ec516769-7f7f-4ad4-9162-d00570d0be93", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7d281bb7-c46f-4222-a210-efd0deccaace", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "44c4bf65-5d1f-4274-afb5-a63ba877efcb", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:performance-minicourse] *", "language": "python", "name": "conda-env-performance-minicourse-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 5 }