{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing, Reconciling, and Combining COVID-19 Data Sources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "\n", "In this note we use `gs-quant` to compare COVID-19 data sources. To do this, we first retrieve COVID-19-related time \n", "series data and preprocess it, joining different sources together to analyze confirmed cases. \n", "\n", "The contents of this notebook are as follows:\n", "\n", "- [1 - Getting Started](#1---Getting-Started)\n", "- [2 - COVID-19 Data](#2---COVID-19-Data)\n", "- [3 - Comparing Global Sources](#3---Comparing-Global-Sources)\n", "- [4 - Comparing US Sources](#4---Comparing-US-Sources)\n", "- [5 - Comparing subregions, combining with mobility data](#5---Comparing-subregions,-combining-with-mobility-data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1 - Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start every session with authenticating with your unique client id and secret. For information on\n", "how to get setup on GS Quant, see [Getting Started](/covid/guides/getting-started). Below produced\n", "using gs-quant version 0.8.126" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from gs_quant.session import GsSession, Environment\n", "\n", "GsSession.use(client_id=None, client_secret=None, scopes=('read_product_data',))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2 - COVID-19 Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start by defining a general function to load various datasets, which includes regional data, for the past week:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from gs_quant.data import Dataset\n", "import datetime\n", "\n", "# Note: There is data going back to 2019-12-31, you will need to write your own code to batch data fetching\n", "def get_datasets(datasets):\n", " ds_dict = {}\n", " end = datetime.date(2020, 7, 9)\n", " start = end - datetime.timedelta(weeks=1)\n", " \n", " for dataset in datasets:\n", " try:\n", " df = Dataset(dataset).get_data(start, end)\n", "\n", " keys = [x for x in ['countryId', 'subdivisionId'] if x in df.columns] + ['date']\n", " val_map = {'newConfirmed': 'totalConfirmed', 'newFatalities': 'totalFatalities'}\n", " vals = [x for x in list(val_map.keys()) if x in df.columns]\n", "\n", " df_t = df.groupby(keys).sum().groupby(level=0).cumsum().reset_index()[keys + vals].rename(columns=val_map)\n", " ds_dict[dataset] = df.reset_index().merge(df_t, on=keys, suffixes=('', '_y')).set_index('date')\n", "\n", " except Exception as err:\n", " print(f'Failed to obtain {dataset} with {getattr(err, \"message\", repr(err))}')\n", " return ds_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a list of some of the available datasets, and fetch all of them, so that we can compare them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "country_datasets = [\n", " 'COVID19_COUNTRY_DAILY_ECDC', \n", " 'COVID19_COUNTRY_DAILY_WHO',\n", " 'COVID19_COUNTRY_DAILY_WIKI', \n", " 'COVID19_US_DAILY_CDC'\n", "]\n", "df = get_datasets(country_datasets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we look at the date ranges of each dataset to determine how much history they have, and ensure they are \n", "up-to-date:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "for name, ds in df.items():\n", " print('{:<30} {} {}'.format(name, ds.index[0].date(), ds.index[-1].date())) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 - Comparing Global Sources\n", "\n", "Below is a general function to compare the time series of certain columns across datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "import re\n", "from typing import Union\n", "\n", "def compare_time_series(df, datasets, columns: Union[str, list], grouping: str = 'countryId', suffix_identifier: float = 1):\n", " columns = [columns] if isinstance(columns, str) else columns\n", " suffixes = list(map(lambda ds_name: '_' + re.findall('\\_([A-Z]+)', ds_name)[-suffix_identifier], datasets))\n", " df_combo = None\n", "\n", " for ds_name in datasets:\n", " ds = df[ds_name]\n", " df_combo = ds if df_combo is None else df_combo\n", " df_suffixes = ('', '_' + re.findall('\\_([A-Z]+)', ds_name)[-suffix_identifier])\n", " df_combo = df_combo.merge(ds, on=['date', grouping], suffixes=df_suffixes)\n", "\n", " return df_combo[[grouping] + [column + suffix for suffix in suffixes for column in columns]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, if we want to compare the time series for total confirmed cases across the WHO, ECDC, and Wikipedia \n", "datasets globally, we can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "datasets = ['COVID19_COUNTRY_DAILY_ECDC', 'COVID19_COUNTRY_DAILY_WHO', 'COVID19_COUNTRY_DAILY_WIKI'] \n", "df_to_compare = compare_time_series(df, datasets, columns='totalConfirmed')\n", "\n", "df_to_compare.describe().style.background_gradient(cmap='Blues', axis=1).format('{:,.2f}')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "This shows statistical properties for each dataset for all common countries and dates. As we can see, there's some \n", "variation in the data sources. Let's dig in a little further and plot the relationship between the WHO and ECDC for \n", "a number of countries: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "select_countries = ['GB', 'DE', 'IT', 'ES', 'FR', 'RU']\n", "to_plot = df_to_compare[df_to_compare.countryId.isin(select_countries)]\n", "\n", "sns.lmplot(x=\"totalConfirmed_ECDC\", y=\"totalConfirmed_WHO\", col=\"countryId\", data=to_plot, col_wrap=3, height=3, fit_reg=False);" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "As we can see, there is some dispersion between sources for certain countries. For information on the various ISO\n", "country codes, see [this guide](https://developer.gs.com/docs/covid/guides/standards/iso-countries/).\n", "\n", "### 4 - Comparing US Sources\n", "\n", "Now let's take a closer look at the US data, adding in the CDC dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "datasets = ['COVID19_US_DAILY_CDC', 'COVID19_COUNTRY_DAILY_ECDC', 'COVID19_COUNTRY_DAILY_WHO', 'COVID19_COUNTRY_DAILY_WIKI'] \n", "df_to_compare = compare_time_series(df, datasets, columns='totalConfirmed')\n", "\n", "df_to_compare.describe().style.background_gradient(cmap='Blues',axis=1).format('{:,.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of 21 of May 2020, CDC had the most confirmed cases, followed by Wikipedia, and then ECDC and WHO. This is not \n", "overly surprising given the information collection and validation flows. Now let's examine the last few points:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's compare all the series side by side:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "df_to_compare.plot(figsize=(10, 6), title='US')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "(df_to_compare['totalConfirmed_WHO']-df_to_compare['totalConfirmed_ECDC']).plot(figsize=(10, 6), title='Differences vs WHO', \n", " label='ECDC')\n", "(df_to_compare['totalConfirmed_WHO']-df_to_compare['totalConfirmed_CDC']).plot(label='CDC')\n", "(df_to_compare['totalConfirmed_WHO']-df_to_compare['totalConfirmed_WIKI']).plot(label='WIKI')\n", "\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chart illustrates how the ECDC and CDC map cases versus the WHO. At the start of the epidemic these sources were\n", "much closer, and diverged over time, with CDC leading in reporting for the US versus the ECDC and WHO. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5 - Comparing subregions, combining with mobility data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we illustrate how to compare datasets for specific countries (in this case, Italy) at different level of granularity (region, province, etc.) and how to ccombine epidemic data with mobility data from Google.\n", "\n", "As before, we fetch data for Italy, at three levels of granularity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datasets = ['COVID19_ITALY_DAILY_DPC', 'COVID19_REGION_DAILY_DPC', 'COVID19_PROVINCE_DAILY_DPC'] \n", "df = get_datasets(datasets)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_to_compare = compare_time_series(df, datasets, columns='totalConfirmed', suffix_identifier=3)\n", "\n", "df_to_compare.describe().style.background_gradient(cmap='Blues',axis=1).format('{:,.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We write a function to compare the data across different geographic subdivisions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from functools import reduce\n", "import pandas as pd\n", "\n", "def compare_totals_across_breakdowns(df, data1, data2, column_to_check):\n", " \n", " # pick the common indices between the data being compared\n", " common_idx_province = reduce(lambda x, y: x & y, \n", " df[data1[0]].groupby(data1[1]).apply(lambda x: x.index).tolist())\n", " common_idx_region = reduce(lambda x, y: x & y, \n", " df[data2[0]].groupby(data2[1]).apply(lambda x: x.index).tolist())\n", " idx = common_idx_province & common_idx_region\n", " \n", " # calculate the difference, and rename column\n", " diff = df[data1[0]].groupby(data1[1]).apply(lambda x : x.loc[idx][column_to_check]).T.apply(sum,axis=1) -\\\n", " df[data2[0]].groupby(data2[1]).apply(lambda x : x.loc[idx][column_to_check]).T.apply(sum,axis=1)\n", " diff = pd.DataFrame(diff).rename(columns={0: f'{data1[0]}-{data2[0]}'})\n", " return diff\n", "\n", "diff1 = compare_totals_across_breakdowns(df, ('COVID19_ITALY_DAILY_DPC','countryId'),\n", " ('COVID19_REGION_DAILY_DPC','subdivisionId'),'totalConfirmed')\n", "diff2 = compare_totals_across_breakdowns(df, ('COVID19_REGION_DAILY_DPC','subdivisionId'),\n", " ('COVID19_PROVINCE_DAILY_DPC','administrativeRegion'),'totalConfirmed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We plot the discrepancies below..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "to_plot = diff1.join(diff2)\n", "\n", "sns.lmplot(x=\"COVID19_ITALY_DAILY_DPC-COVID19_REGION_DAILY_DPC\", y=\"COVID19_REGION_DAILY_DPC-COVID19_PROVINCE_DAILY_DPC\",\n", " data=to_plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and interestingly, this indicates there is no discrepancy at all when we compare country-level aggregate data with region-level aggregate data, but we do see discrepancies when we compate province-level with region-level data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we illustrate how to join region-level data from Italy with mobility data from Google, which allows us to check, for example, how the increase in cases of COVID-19 affected mobility patterns in the population. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "df_mob = Dataset('COVID19_SUBDIVISION_DAILY_GOOGLE').get_data(start_date=datetime(2020,2,1).date(), countryId='IT')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_mob.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now join mobility data with region-level data in Italy, for the subdivision of Liguria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def join_dfs(subdivision, mobility, column_to_compare):\n", " df_red = df['COVID19_REGION_DAILY_DPC'][df['COVID19_REGION_DAILY_DPC']['subdivisionId'] == subdivision]\n", " subdivision_name = df_red['subdivisionName'].unique()[0]\n", " df1 = df_red[[column_to_compare]]\n", " df2 = df_mob[df_mob.subdivisionId.isin([subdivision]) & df_mob.group.isin([mobility])]\n", " df_joint = df1.merge(df2, on='date')\n", " return df_joint, mobility, subdivision_name\n", "\n", "df_joint, mobility, subdivision_name = join_dfs('IT-42', 'retailRecreation', 'newConfirmed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we plot a chart comparing mobility data with data on the growth of the epidemic." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_joint['Change in Mobility'] = df_joint['value'].diff()\n", "df_joint[['newConfirmed','value']].rename(columns={'newConfirmed':'New Confirmed Cases',\n", " 'value':f'Change in {mobility} mobility'}).plot(figsize=(10, 7,), grid=True,\n", " title=f'Comparison of new confirmed cases and mobility in {subdivision_name}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe a dramatic drop in moblity as the rate of new cases began to increase, a pattern that persisted during the peak of the epidemic in the region of Liguria.\n", "\n", "Please reach out to `covid-data@gs.com` with any questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Disclaimer\n", "This website may contain links to websites and the content of third parties (\"Third Party Content\"). We do not monitor, \n", "review or update, and do not have any control over, any Third Party Content or third party websites. We make no \n", "representation, warranty or guarantee as to the accuracy, completeness, timeliness or reliability of any \n", "Third Party Content and are not responsible for any loss or damage of any sort resulting from the use of, or for any \n", "failure of, products or services provided at or from a third party resource. If you use these links and the \n", "Third Party Content, you acknowledge that you are doing so entirely at your own risk." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }