{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# C3.ai COVID-19 Data Lake Quickstart in Python  \n",
    "\n",
    "Version 5.0 (August 11, 2020).\n",
    "\n",
    "This Jupyter notebook shows some examples of how to access and use each of the [C3.ai COVID-19 Data Lake](https://c3.ai/covid/) APIs. These examples show only a small piece of what you can do with the C3.ai COVID-19 Data Lake, but will get you started with performing your own exploration. See the [API documentation](https://c3.ai/covid-19-api-documentation/) for more details.\n",
    "\n",
    "Please contribute your questions, answers and insights on [Stack Overflow](https://www.stackoverflow.com). Tag `c3ai-datalake` so that others can view and help build on your contributions. For support, please send email to: [covid@c3.ai](mailto:covid@c3.ai)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "- [Helper methods for accessing the API](#helpers)\n",
    "- [Access OutbreakLocation data](#outbreaklocation)\n",
    "    - [Case counts](#outbreaklocation/casecounts)\n",
    "    - [Demographics](#outbreaklocation/demographics)\n",
    "    - [Mobility](#outbreaklocation/mobility)\n",
    "    - [Projections](#outbreaklocation/projections)\n",
    "    - [Economic indicators](#outbreaklocation/economics)\n",
    "- [Access LocationExposure data](#locationexposure)\n",
    "- [Access LineListRecord data](#linelistrecord)\n",
    "- [Join BiologicalAsset and Sequence data](#biologicalasset)\n",
    "- [Access BiblioEntry data](#biblioentry)\n",
    "- [Join TherapeuticAsset and ExternalLink data](#therapeuticasset)\n",
    "- [Join Diagnosis and DiagnosisDetail data](#diagnosis)\n",
    "- [Access VaccineCoverage data](#vaccinecoverage)\n",
    "- [Access Policy data](#policy)\n",
    "- [Access LaborDetail data](#labor)\n",
    "- [Access Survey data](#survey)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the [requests](https://requests.readthedocs.io/en/master/), [pandas>=1.0.0](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/3.2.1/index.html), and [scipy](https://www.scipy.org/) libraries before using this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "from scipy.stats import gamma\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensure that you have a recent version of pandas (>= 1.0.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"pandas version\", pd.__version__)\n",
    "assert pd.__version__[0] >= \"1\", \"To use this notebook, upgrade to the newest version of pandas. See https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html for details.\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"helpers\"></a>\n",
    "## Helper methods for accessing the API\n",
    "\n",
    "The helper methods in `c3aidatalake.py` convert a JSON response from the C3.ai APIs to a Pandas DataFrame. You may wish to view the code in `c3aidatalake.py` before using the quickstart examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import c3aidatalake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation\"></a>\n",
    "## Access OutbreakLocation data\n",
    "\n",
    "`OutbreakLocation` stores location data such as countries, provinces, cities, where COVID-19 outbeaks are recorded. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation) for more details and for a list of available locations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch facts about Germany\n",
    "locations = c3aidatalake.fetch(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"id == 'Germany'\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "locations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/casecounts\"></a>\n",
    "### Case counts\n",
    "\n",
    "A variety of sources provide counts of cases, deaths, recoveries, and other statistics for counties, provinces, and countries worldwide."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Total number of confirmed cases, deaths, and recoveries in Santa Clara, California\n",
    "today = pd.Timestamp.now().strftime(\"%Y-%m-%d\")\n",
    "\n",
    "casecounts = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"SantaClara_California_UnitedStates\"],\n",
    "            \"expressions\" : [\"JHU_ConfirmedCases\", \"JHU_ConfirmedDeaths\", \"JHU_ConfirmedRecoveries\"],\n",
    "            \"start\" : \"2020-01-01\",\n",
    "            \"end\" : today,\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "casecounts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot these counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedCases.data\"],\n",
    "    label = \"JHU_ConfirmedCases\"\n",
    ")\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedDeaths.data\"],\n",
    "    label = \"JHU_ConfirmedDeaths\"\n",
    ")\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedRecoveries.data\"],\n",
    "    label = \"JHU_ConfirmedCases\"\n",
    ")\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Count\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Export case counts as a .csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment the line below to export the DataFrame as a .csv file\n",
    "# casecounts.to_csv(\"casecounts.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/demographics\"></a>\n",
    "### Demographics\n",
    "\n",
    "Demographic and economic data from the US Census Bureau and The World Bank allow demographic comparisons across locations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population = c3aidatalake.fetch(\n",
    "    \"populationdata\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"!contains(parent, '_') && (populationAge == '>=65' || populationAge == 'Total') && gender == 'Male/Female' && year == '2018' && estimate == 'False' && percent == 'False'\"\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "population"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population_age_distribution = population.loc[\n",
    "    :, \n",
    "    [\"populationAge\", \"parent.id\", \"value\"]\n",
    "].pivot(index = \"parent.id\", columns = \"populationAge\")['value']\n",
    "population_age_distribution[\"proportion_over_65\"] = population_age_distribution[\">=65\"] / population_age_distribution[\"Total\"]\n",
    "\n",
    "population_age_distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Access global death counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_deaths = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : list(population_age_distribution.index),\n",
    "            \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
    "            \"start\" : \"2020-05-01\",\n",
    "            \"end\" : \"2020-05-01\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "global_deaths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_deaths_by_country = global_deaths.filter(regex=(\"\\.data\")).melt()\n",
    "global_deaths_by_country[\"country\"] = global_deaths_by_country[\"variable\"].str.replace(\"\\..*\", \"\")\n",
    "\n",
    "global_comparison = global_deaths_by_country.set_index(\"country\").join(population_age_distribution)\n",
    "global_comparison[\"deaths_per_million\"] = 1e6 * global_comparison[\"value\"] / global_comparison[\"Total\"] \n",
    "global_comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.scatter(\n",
    "    global_comparison[\"proportion_over_65\"],\n",
    "    global_comparison[\"deaths_per_million\"]\n",
    ")\n",
    "plt.xlabel(\"Proportion of population over 65\")\n",
    "plt.ylabel(\"Confirmed COVID-19 deaths\\nper million people\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/mobility\"></a>\n",
    "### Mobility\n",
    "\n",
    "Mobility data from Apple and Google provide a view of the impact of COVID-19 and social distancing on mobility trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mobility_trends = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"DistrictofColumbia_UnitedStates\"],\n",
    "            \"expressions\" : [\n",
    "                \"Apple_WalkingMobility\", \n",
    "                \"Apple_DrivingMobility\",\n",
    "                \"Google_ParksMobility\",\n",
    "                \"Google_ResidentialMobility\"\n",
    "              ],\n",
    "            \"start\" : \"2020-03-01\",\n",
    "            \"end\" : \"2020-04-01\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "mobility_trends"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot these mobility trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    [100 for d in mobility_trends[\"dates\"]],\n",
    "    label = \"Baseline\",\n",
    "    linestyle = \"dashed\",\n",
    "    color = \"black\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_WalkingMobility.data\"],\n",
    "    label = \"Apple_WalkingMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_DrivingMobility.data\"],\n",
    "    label = \"Apple_DrivingMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ParksMobility.data\"],\n",
    "    label = \"Google_ParksMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ResidentialMobility.data\"],\n",
    "    label = \"Google_ResidentialMobility\"\n",
    ")\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Mobility compared to baseline (%)\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/projections\"></a>\n",
    "### Projections\n",
    "\n",
    "Use the `GetProjectionHistory` API to retrieve versioned time series projections for specific metrics made at specific points in time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve projections made between April 13 and May 1 of mean total cumulative deaths in Spain from April 13 to May 13\n",
    "projections = c3aidatalake.getprojectionhistory(\n",
    "    {\n",
    "        \"outbreakLocation\": \"Spain\", \n",
    "        \"metric\": \"UniversityOfWashington_TotdeaMean_Hist\",\n",
    "        \"metricStart\": \"2020-04-13\", \n",
    "        \"metricEnd\": \"2020-05-13\",\n",
    "        \"observationPeriodStart\": \"2020-04-13\",\n",
    "        \"observationPeriodEnd\": \"2020-05-01\"\n",
    "    }\n",
    ")\n",
    "\n",
    "projections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve actual total cumulative deaths in Spain from April 1 to May 13\n",
    "deaths = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"Spain\"],\n",
    "            \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
    "            \"start\" : \"2020-04-01\",\n",
    "            \"end\" : \"2020-05-13\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "deaths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    deaths[\"dates\"],\n",
    "    deaths[\"Spain.JHU_ConfirmedDeaths.data\"],\n",
    "    label = \"JHU_ConfirmedDeaths\",\n",
    "    color = \"black\"\n",
    ")\n",
    "for col in projections.columns:\n",
    "    if 'data' in col:\n",
    "        expr = projections[col.replace(\"data\", \"expr\")].iloc[0]\n",
    "        projection_date = pd.to_datetime(expr.split(\" \")[-1])\n",
    "        plt.plot(\n",
    "            projections.loc[projections[\"dates\"] >= projection_date, \"dates\"],\n",
    "            projections.loc[projections[\"dates\"] >= projection_date, col],\n",
    "            label = expr\n",
    "        )\n",
    "\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Count\")\n",
    "plt.title(\"Cumulative death count projections versus actual count\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/economics\"></a>\n",
    "### Economic indicators\n",
    "\n",
    "GDP and employment statistics by business sector from the US Bureau of Economic Analysis enable comparisons of the drivers of local economies. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Real GDP for AccommodationAndFoodServices and FinanceAndInsurance in Alameda County, California\n",
    "\n",
    "realgdp = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\": {\n",
    "              \"ids\": [\"Alameda_California_UnitedStates\"], \n",
    "              \"expressions\": [\n",
    "                \"BEA_RealGDP_AccommodationAndFoodServices_2012Dollars\",\n",
    "                \"BEA_RealGDP_FinanceAndInsurance_2012Dollars\"\n",
    "\n",
    "              ], \n",
    "              \"start\": \"2000-01-01\", \n",
    "              \"end\": \"2020-01-01\", \n",
    "              \"interval\":\"YEAR\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "realgdp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "High frequency spending and earnings data from Opportunity Insights allow tracking of near real-time economic trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access consumer spending in healthcare and low income earnings in the healthcare and social assistance sector in California\n",
    "opportunityinsights = c3aidatalake.evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\": {\n",
    "            \"ids\": [\"California_UnitedStates\"], \n",
    "            \"expressions\": [\n",
    "                \"OIET_Affinity_SpendHcs\",\n",
    "                \"OIET_LowIncEmpAllBusinesses_Emp62\"\n",
    "            ], \n",
    "            \"start\": \"2020-01-01\", \n",
    "            \"end\": \"2020-06-01\", \n",
    "            \"interval\":\"DAY\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "    \n",
    "opportunityinsights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "\n",
    "plt.plot(\n",
    "    opportunityinsights.dates,\n",
    "    opportunityinsights['California_UnitedStates.OIET_Affinity_SpendHcs.data'] * 100,\n",
    "    label = 'Consumer spending in healthcare'\n",
    ")\n",
    "\n",
    "plt.plot(\n",
    "    opportunityinsights.dates,\n",
    "    opportunityinsights['California_UnitedStates.OIET_LowIncEmpAllBusinesses_Emp62.data'] * 100,\n",
    "    label = 'Low income earnings in\\nhealthcare & social assistance '\n",
    ")\n",
    "\n",
    "plt.legend()\n",
    "plt.title(\"California low-income earnings and consumer spending in healthcare\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Change relative to January 4-31 (%)\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"locationexposure\"></a>\n",
    "## Access Location Exposure data\n",
    "\n",
    "`LocationExposure` stores information based on the movement of people's mobile devices across locations over time. It stores the following:  \n",
    "* Location exposure index (LEX) for a pair of locations (`locationTarget`, `locationVisited`): the fraction of mobile devices that pinged in `locationTarget` on a date that also pinged in `locationVisited` at least once during the previous 14 days. The pair (`locationTarget`, `locationVisited`) can be two county locations or two state locations.\n",
    "* Device count: the number of distinct mobile devices that pinged at `locationTarget` on the date.\n",
    "\n",
    "See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationExposures) for more details. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exposure = c3aidatalake.read_data_json(\n",
    "    \"locationexposure\",\n",
    "    \"getlocationexposures\",\n",
    "    {\n",
    "        \"spec\":\n",
    "        {\n",
    "            \"locationTarget\": \"California_UnitedStates\",\n",
    "            \"locationVisited\": \"Nevada_UnitedStates\",\n",
    "            \"start\": \"2020-01-20\",\n",
    "            \"end\": \"2020-04-25\"\n",
    "        }\n",
    "    }\n",
    "    \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Access daily LEX where `locationTarget` is California and `locationVisited` is Nevada with the the `locationExposures` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lex = pd.json_normalize(exposure[\"locationExposures\"][\"value\"])\n",
    "\n",
    "lex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the LEX data to see the proportion of devices in California on each date that pinged in Nevada over the previous 14 days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    pd.to_datetime(lex[\"timestamp\"]),\n",
    "    lex[\"value\"]\n",
    ")\n",
    "plt.ylabel(\"Location exposure index (LEX)\")\n",
    "plt.title(\"Location exposure for target location California and visited location Nevada\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Access daily device counts with the `deviceCounts` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pd.json_normalize(exposure[\"deviceCounts\"][\"value\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"linelistrecord\"></a>\n",
    "## Access LineListRecord data\n",
    "\n",
    "`LineListRecord` stores individual-level crowdsourced information from laboratory-confirmed COVID-19 patients. Information includes gender, age, symptoms, travel history, location, reported onset, confirmation dates, and discharge status. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LineListRecord) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch the line list records tracked by MOBS Lab\n",
    "records = c3aidatalake.fetch(\n",
    "    \"linelistrecord\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"lineListSource == 'DXY'\"\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "records"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the most common symptoms in this dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all the symptoms, which are initially comma-separated\n",
    "symptom_df = records.copy()\n",
    "symptom_df[\"symptoms\"] = symptom_df[\"symptoms\"].str.split(\", \")\n",
    "symptom_df = symptom_df.explode(\"symptoms\")\n",
    "symptom_df = symptom_df.dropna(subset = [\"symptoms\"])\n",
    "symptom_freq = symptom_df.groupby([\"symptoms\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n",
    "\n",
    "# Plot the data\n",
    "plt.figure(figsize = (10, 6))\n",
    "plt.bar(symptom_freq.index, symptom_freq[\"id\"])\n",
    "plt.xticks(rotation = 90)\n",
    "plt.xlabel(\"Symptom\")\n",
    "plt.ylabel(\"Number of patients\")\n",
    "plt.title(\"Common COVID-19 symptoms\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a patient is symptomatic and later hospitalized, how long does it take for them to become hospitalized after developing symptoms?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the number of days from development of symptoms to hospitalization for each patient\n",
    "hospitalized = records.dropna(subset = [\"hospitalAdmissionDate\", \"symptomStartDate\"])\n",
    "hospitalization_time = np.array(\n",
    "    pd.to_datetime(hospitalized['hospitalAdmissionDate']) - pd.to_datetime(hospitalized['symptomStartDate'])\n",
    ").astype('timedelta64[D]').astype('float')\n",
    "hospitalization_time = hospitalization_time[hospitalization_time >= 0]\n",
    "\n",
    "# Hospitalization time of 0 days is replaced with 0.1 to indicate near-immediate hospitalization\n",
    "hospitalization_time[hospitalization_time <= 0.1] = 0.1\n",
    "\n",
    "# Fit a gamma distribution\n",
    "a, loc, scale = gamma.fit(hospitalization_time, floc = 0)\n",
    "dist = gamma(a, loc, scale)\n",
    "\n",
    "# Plot the results\n",
    "x = np.linspace(0, np.max(hospitalization_time), 1000)\n",
    "n_bins = int(np.max(hospitalization_time) + 1)\n",
    "print(n_bins)\n",
    "\n",
    "plt.figure(figsize = (10, 6))\n",
    "plt.hist(\n",
    "    hospitalization_time, \n",
    "    bins = n_bins, \n",
    "    range = (0, np.max(hospitalization_time)), \n",
    "    density = True, \n",
    "    label = \"Observed\"\n",
    ")\n",
    "plt.plot(x, dist.pdf(x), 'r-', lw=5, alpha=0.6, label = 'Gamma distribution')\n",
    "plt.ylim(0, 0.5)\n",
    "plt.xlabel(\"Days from development of symptoms to hospitalization\")\n",
    "plt.ylabel(\"Proportion of patients\")\n",
    "plt.title(\"Distribution of time to hospitalization\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"biologicalasset\"></a>\n",
    "## Join BiologicalAsset and Sequence data\n",
    "\n",
    "`BiologicalAsset` stores the metadata of the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. `Sequence` stores the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. See the API documentation for [BiologicalAsset](https://c3.ai/covid-19-api-documentation/#tag/BiologicalAsset) and [Sequence](https://c3.ai/covid-19-api-documentation/#tag/Sequence) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join data from BiologicalAsset & Sequence\n",
    "sequences = c3aidatalake.fetch(\n",
    "  \"biologicalasset\",\n",
    "  {\n",
    "    \"spec\" : {\n",
    "      \"include\" : \"this, sequence.sequence\",\n",
    "      \"filter\" : \"exists(sequence.sequence)\"\n",
    "    }\n",
    "  }\n",
    ")\n",
    "\n",
    "sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"biblioentry\"></a>\n",
    "## Access BiblioEntry data\n",
    "\n",
    "`BiblioEntry` stores the metadata about the journal articles in the CORD-19 Dataset. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/BiblioEntry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch metadata for the first two thousand (2000) BiblioEntry journal articles approved for commercial use\n",
    "# Note that 2000 records are returned; the full dataset can be accessed using the get_all = True argument in fetch\n",
    "bibs = c3aidatalake.fetch(\n",
    "  \"biblioentry\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"hasFullText == true\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "# Sort them to get the most recent articles first\n",
    "bibs[\"publishTime\"] = pd.to_datetime(bibs[\"publishTime\"])\n",
    "bibs = bibs.sort_values(\"publishTime\", ascending = False)\n",
    "\n",
    "bibs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `GetArticleMetadata` to access the full-text of these articles, or in this case, the first page text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bib_id = bibs.loc[0, \"id\"] \n",
    "print(bib_id)\n",
    "\n",
    "article_data = c3aidatalake.read_data_json(\n",
    "    \"biblioentry\",\n",
    "    \"getarticlemetadata\",\n",
    "    {\n",
    "        \"ids\" : [bib_id]\n",
    "    }\n",
    ")\n",
    "\n",
    "article_data[\"value\"][\"value\"][0][\"body_text\"][0][\"text\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"therapeuticasset\"></a>\n",
    "## Join TherapeuticAsset and ExternalLink data\n",
    "\n",
    "`TherapeuticAsset` stores details about the research and development (R&D) of coronavirus therapies, for example, vaccines, diagnostics, and antibodies. `ExternalLink` stores website URLs cited in the data sources containing the therapies stored in the TherapeuticAssets C3.ai Type. See the API documentation for [TherapeuticAsset](https://c3.ai/covid-19-api-documentation/#tag/TherapeuticAsset) and [ExternalLink](https://c3.ai/covid-19-api-documentation/#tag/ExternalLink) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join data from TherapeuticAsset and ExternalLink (productType, description, origin, and URL links)\n",
    "assets = c3aidatalake.fetch(\n",
    "  \"therapeuticasset\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"include\" : \"productType, description, origin, links.url\",\n",
    "          \"filter\" : \"origin == 'Milken'\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "assets = assets.explode(\"links\")\n",
    "assets[\"links\"] = [link[\"url\"] if type(link) == dict and \"url\" in link.keys() else None for link in assets[\"links\"]]\n",
    "assets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"diagnosis\"></a>\n",
    "## Join Diagnosis and DiagnosisDetail data\n",
    "\n",
    "`Diagnosis` stores basic clinical data (e.g. clinical notes, demographics, test results, x-ray or CT scan images) about individual patients tested for COVID-19, from research papers and healthcare institutions. \n",
    "\n",
    "\n",
    "`DiagnosisDetail` stores detailed clinical data (e.g. lab tests, pre-existing conditions, symptoms) about individual patients in key-value format. See the API documentation for [Diagnosis](https://c3.ai/covid-19-api-documentation/#tag/Diagnosis) and [DiagnosisDetail](https://c3.ai/covid-19-api-documentation/#tag/DiagnosisDetail) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses = c3aidatalake.fetch(\n",
    "  \"diagnosis\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"contains(testResults, 'COVID-19')\", \n",
    "          \"include\" : \"this, diagnostics.source, diagnostics.key, diagnostics.value\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "diagnoses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses_long = diagnoses.explode(\"diagnostics\")\n",
    "diagnoses_long = pd.concat([\n",
    "    diagnoses_long.reset_index(),\n",
    "    pd.json_normalize(\n",
    "        diagnoses_long.loc[diagnoses_long.source != 'UCSD', \"diagnostics\"]\n",
    "    )[[\"key\", \"value\"]]\n",
    "], axis = 1).drop(columns = \"diagnostics\")\n",
    "diagnoses_long"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses_wide = (\n",
    "    diagnoses_long\n",
    "    .loc[~diagnoses_long[['key', 'value']].isna().all(axis=1)]\n",
    "    .pivot(columns = \"key\", values = \"value\")\n",
    ")\n",
    "diagnoses_wide = pd.concat([diagnoses, diagnoses_wide], axis = 1).drop(columns = \"diagnostics\")\n",
    "diagnoses_wide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `GetImageURLs` API to view the image associated with a diagnosis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnosis_id = diagnoses_wide.loc[0, \"id\"] \n",
    "print(diagnosis_id)\n",
    "\n",
    "image_urls = c3aidatalake.read_data_json(\n",
    "    \"diagnosis\",\n",
    "    \"getimageurls\",\n",
    "    {\n",
    "        \"ids\" : [diagnosis_id]\n",
    "    }\n",
    ")\n",
    "\n",
    "print(image_urls[\"value\"][diagnosis_id][\"value\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"vaccinecoverage\"></a>\n",
    "## Access VaccineCoverage data\n",
    "\n",
    "`VaccineCoverage` stores historical vaccination rates for various demographic groups in US counties and states, based on data from the US Centers for Disease Control (CDC). See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/VaccineCoverage) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vaccine_coverage = c3aidatalake.fetch(\n",
    "  \"vaccinecoverage\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"vaxView == 'Influenza' && contains(vaccineDetails, 'General Population') && (location == 'California_UnitedStates' || location == 'Texas_UnitedStates') && contains(demographicClass, 'Race/ethnicity') && year == 2018\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "vaccine_coverage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does vaccine coverage vary by race/ethnicity in these locations?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vaccine_coverage[\"upperError\"] = vaccine_coverage[\"upperLimit\"] - vaccine_coverage[\"value\"]\n",
    "vaccine_coverage[\"lowerError\"] = vaccine_coverage[\"value\"] - vaccine_coverage[\"lowerLimit\"]\n",
    "\n",
    "plt.figure(figsize = (10, 6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.bar(\n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"demographicClassDetails\"], \n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"value\"], \n",
    "    yerr = [\n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"upperError\"], \n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"lowerError\"], \n",
    "    ]\n",
    ")\n",
    "plt.ylabel(\"Vaccination rate (%)\")\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.title(\"California, United States\")\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.bar(\n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"demographicClassDetails\"], \n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"value\"], \n",
    "    yerr = [\n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"upperError\"], \n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"lowerError\"], \n",
    "    ]\n",
    ")\n",
    "plt.ylabel(\"Vaccination rate (%)\")\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.title(\"Texas, United States\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"locationpolicysummary\"></a>\n",
    "## Access Policy data\n",
    "\n",
    "`LocationPolicySummary` stores COVID-19 social distancing and health policies and regulations enacted by US states. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationPolicySummary) for more details. \n",
    "<br />\n",
    "\n",
    "`PolicyDetail` stores country-level policy responses to COVID-19 including:  \n",
    "* Financial sector policies (from The World Bank: Finance Related Policy Responses to COVID-19), \n",
    "* Containment and closure, economic, and health system policies (from University of Oxford: Coronavirus Government Response Tracker, OxCGRT), and \n",
    "* Policies in South Korea (from Data Science for COVID-19: South Korea).\n",
    "      \n",
    "See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/PolicyDetail) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policy_united_states = c3aidatalake.fetch(\n",
    "  \"locationpolicysummary\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"contains(location.id, 'UnitedStates')\",\n",
    "          \"limit\" : -1\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "policy_united_states"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `AllVersionsForPolicy` API of `LocationPolicySummary` to access historical and current versions of a policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "versions = c3aidatalake.read_data_json(\n",
    "    \"locationpolicysummary\",\n",
    "    \"allversionsforpolicy\",\n",
    "    {\n",
    "        \"this\" : {\n",
    "            \"id\" : \"Wisconsin_UnitedStates_Policy\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "pd.json_normalize(versions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fetch all school closing policies that restrict gatherings between 11-100 people from OxCGRT dataset in `PolicyDetail`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "school_policy = c3aidatalake.fetch(\n",
    "  \"policydetail\",\n",
    "  {\n",
    "    \"spec\" : {\n",
    "        \"filter\": \"contains(lowerCase(name), 'school') && value == 3 && origin == 'University of Oxford'\",\n",
    "        \"limit\": -1\n",
    "    }\n",
    "  }\n",
    ")\n",
    "\n",
    "school_policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"labor\"></a>\n",
    "## Access LaborDetail data\n",
    "\n",
    "`LaborDetail` stores historical monthly labor force and employment data for US counties and states from US Bureau of Labor Statistics. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LaborDetail) for more details. \n",
    "<br />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch the unemployment rates of counties in California in March, 2020\n",
    "labordetail = c3aidatalake.fetch(\n",
    "    \"labordetail\",\n",
    "    {\n",
    "        \"spec\": {\n",
    "            \"filter\": \"year == 2020 && month == 3 && contains(parent, 'California_UnitedStates')\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "labordetail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"survey\"></a>\n",
    "## Access Survey data\n",
    "\n",
    "`SurveyData` stores COVID-19-related public opinion, demographic, and symptom prevalence data collected from COVID-19 survey responses. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/SurveyData) for more details. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Fetch participants who are located in California and who have a relatively strong intent to wear a mask in public because of COVID-19\n",
    "survey = c3aidatalake.fetch(\n",
    "    \"surveydata\",\n",
    "    {\n",
    "        \"spec\": {\n",
    "            \"filter\": \"location == 'California_UnitedStates' && coronavirusIntent_Mask >= 75\"\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "survey"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "employment_df = survey.copy()\n",
    "employment_df[\"coronavirusEmployment\"] = employment_df[\"coronavirusEmployment\"].str.split(\", \")\n",
    "employment_df = employment_df.explode(\"coronavirusEmployment\")\n",
    "employment_df = employment_df.groupby([\"coronavirusEmployment\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n",
    "\n",
    "# Plot the data\n",
    "plt.figure(figsize = (10, 6))\n",
    "plt.bar(employment_df.index, 100 * employment_df[\"id\"] / len(survey))\n",
    "plt.xticks(rotation = 90)\n",
    "plt.xlabel(\"Response to employment status question\")\n",
    "plt.ylabel(\"Proportion of participants (%)\")\n",
    "plt.title(\"Employment status of CA participants with strong intent to wear mask\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}