{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# C3.ai COVID-19 Data Lake Quickstart in Python \n",
"\n",
"Version 5.0 (August 11, 2020).\n",
"\n",
"This Jupyter notebook shows some examples of how to access and use each of the [C3.ai COVID-19 Data Lake](https://c3.ai/covid/) APIs. These examples show only a small piece of what you can do with the C3.ai COVID-19 Data Lake, but will get you started with performing your own exploration. See the [API documentation](https://c3.ai/covid-19-api-documentation/) for more details.\n",
"\n",
"Please contribute your questions, answers and insights on [Stack Overflow](https://www.stackoverflow.com). Tag `c3ai-datalake` so that others can view and help build on your contributions. For support, please send email to: [covid@c3.ai](mailto:covid@c3.ai)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"- [Helper methods for accessing the API](#helpers)\n",
"- [Access OutbreakLocation data](#outbreaklocation)\n",
" - [Case counts](#outbreaklocation/casecounts)\n",
" - [Demographics](#outbreaklocation/demographics)\n",
" - [Mobility](#outbreaklocation/mobility)\n",
" - [Projections](#outbreaklocation/projections)\n",
" - [Economic indicators](#outbreaklocation/economics)\n",
"- [Access LocationExposure data](#locationexposure)\n",
"- [Access LineListRecord data](#linelistrecord)\n",
"- [Join BiologicalAsset and Sequence data](#biologicalasset)\n",
"- [Access BiblioEntry data](#biblioentry)\n",
"- [Join TherapeuticAsset and ExternalLink data](#therapeuticasset)\n",
"- [Join Diagnosis and DiagnosisDetail data](#diagnosis)\n",
"- [Access VaccineCoverage data](#vaccinecoverage)\n",
"- [Access Policy data](#policy)\n",
"- [Access LaborDetail data](#labor)\n",
"- [Access Survey data](#survey)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the [requests](https://requests.readthedocs.io/en/master/), [pandas>=1.0.0](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/3.2.1/index.html), and [scipy](https://www.scipy.org/) libraries before using this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"from scipy.stats import gamma\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ensure that you have a recent version of pandas (>= 1.0.0)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"pandas version\", pd.__version__)\n",
"assert pd.__version__[0] >= \"1\", \"To use this notebook, upgrade to the newest version of pandas. See https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html for details.\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Helper methods for accessing the API\n",
"\n",
"The helper methods in `c3aidatalake.py` convert a JSON response from the C3.ai APIs to a Pandas DataFrame. You may wish to view the code in `c3aidatalake.py` before using the quickstart examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import c3aidatalake"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access OutbreakLocation data\n",
"\n",
"`OutbreakLocation` stores location data such as countries, provinces, cities, where COVID-19 outbeaks are recorded. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation) for more details and for a list of available locations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fetch facts about Germany\n",
"locations = c3aidatalake.fetch(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"id == 'Germany'\"\n",
" }\n",
" }\n",
")\n",
"\n",
"locations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Case counts\n",
"\n",
"A variety of sources provide counts of cases, deaths, recoveries, and other statistics for counties, provinces, and countries worldwide."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Total number of confirmed cases, deaths, and recoveries in Santa Clara, California\n",
"today = pd.Timestamp.now().strftime(\"%Y-%m-%d\")\n",
"\n",
"casecounts = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\" : {\n",
" \"ids\" : [\"SantaClara_California_UnitedStates\"],\n",
" \"expressions\" : [\"JHU_ConfirmedCases\", \"JHU_ConfirmedDeaths\", \"JHU_ConfirmedRecoveries\"],\n",
" \"start\" : \"2020-01-01\",\n",
" \"end\" : today,\n",
" \"interval\" : \"DAY\",\n",
" }\n",
" }\n",
")\n",
"\n",
"casecounts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot these counts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"plt.plot(\n",
" casecounts[\"dates\"],\n",
" casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedCases.data\"],\n",
" label = \"JHU_ConfirmedCases\"\n",
")\n",
"plt.plot(\n",
" casecounts[\"dates\"],\n",
" casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedDeaths.data\"],\n",
" label = \"JHU_ConfirmedDeaths\"\n",
")\n",
"plt.plot(\n",
" casecounts[\"dates\"],\n",
" casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedRecoveries.data\"],\n",
" label = \"JHU_ConfirmedCases\"\n",
")\n",
"plt.legend()\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.ylabel(\"Count\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Export case counts as a .csv file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment the line below to export the DataFrame as a .csv file\n",
"# casecounts.to_csv(\"casecounts.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Demographics\n",
"\n",
"Demographic and economic data from the US Census Bureau and The World Bank allow demographic comparisons across locations. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"population = c3aidatalake.fetch(\n",
" \"populationdata\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"!contains(parent, '_') && (populationAge == '>=65' || populationAge == 'Total') && gender == 'Male/Female' && year == '2018' && estimate == 'False' && percent == 'False'\"\n",
" }\n",
" },\n",
" get_all = True\n",
")\n",
"\n",
"population"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"population_age_distribution = population.loc[\n",
" :, \n",
" [\"populationAge\", \"parent.id\", \"value\"]\n",
"].pivot(index = \"parent.id\", columns = \"populationAge\")['value']\n",
"population_age_distribution[\"proportion_over_65\"] = population_age_distribution[\">=65\"] / population_age_distribution[\"Total\"]\n",
"\n",
"population_age_distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Access global death counts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"global_deaths = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\" : {\n",
" \"ids\" : list(population_age_distribution.index),\n",
" \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
" \"start\" : \"2020-05-01\",\n",
" \"end\" : \"2020-05-01\",\n",
" \"interval\" : \"DAY\",\n",
" }\n",
" },\n",
" get_all = True\n",
")\n",
"\n",
"global_deaths"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"global_deaths_by_country = global_deaths.filter(regex=(\"\\.data\")).melt()\n",
"global_deaths_by_country[\"country\"] = global_deaths_by_country[\"variable\"].str.replace(\"\\..*\", \"\")\n",
"\n",
"global_comparison = global_deaths_by_country.set_index(\"country\").join(population_age_distribution)\n",
"global_comparison[\"deaths_per_million\"] = 1e6 * global_comparison[\"value\"] / global_comparison[\"Total\"] \n",
"global_comparison"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"plt.scatter(\n",
" global_comparison[\"proportion_over_65\"],\n",
" global_comparison[\"deaths_per_million\"]\n",
")\n",
"plt.xlabel(\"Proportion of population over 65\")\n",
"plt.ylabel(\"Confirmed COVID-19 deaths\\nper million people\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Mobility\n",
"\n",
"Mobility data from Apple and Google provide a view of the impact of COVID-19 and social distancing on mobility trends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mobility_trends = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\" : {\n",
" \"ids\" : [\"DistrictofColumbia_UnitedStates\"],\n",
" \"expressions\" : [\n",
" \"Apple_WalkingMobility\", \n",
" \"Apple_DrivingMobility\",\n",
" \"Google_ParksMobility\",\n",
" \"Google_ResidentialMobility\"\n",
" ],\n",
" \"start\" : \"2020-03-01\",\n",
" \"end\" : \"2020-04-01\",\n",
" \"interval\" : \"DAY\",\n",
" }\n",
" },\n",
" get_all = True\n",
")\n",
"\n",
"mobility_trends"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot these mobility trends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"plt.plot(\n",
" mobility_trends[\"dates\"],\n",
" [100 for d in mobility_trends[\"dates\"]],\n",
" label = \"Baseline\",\n",
" linestyle = \"dashed\",\n",
" color = \"black\"\n",
")\n",
"plt.plot(\n",
" mobility_trends[\"dates\"],\n",
" mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_WalkingMobility.data\"],\n",
" label = \"Apple_WalkingMobility\"\n",
")\n",
"plt.plot(\n",
" mobility_trends[\"dates\"],\n",
" mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_DrivingMobility.data\"],\n",
" label = \"Apple_DrivingMobility\"\n",
")\n",
"plt.plot(\n",
" mobility_trends[\"dates\"],\n",
" mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ParksMobility.data\"],\n",
" label = \"Google_ParksMobility\"\n",
")\n",
"plt.plot(\n",
" mobility_trends[\"dates\"],\n",
" mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ResidentialMobility.data\"],\n",
" label = \"Google_ResidentialMobility\"\n",
")\n",
"plt.legend()\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.ylabel(\"Mobility compared to baseline (%)\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Projections\n",
"\n",
"Use the `GetProjectionHistory` API to retrieve versioned time series projections for specific metrics made at specific points in time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve projections made between April 13 and May 1 of mean total cumulative deaths in Spain from April 13 to May 13\n",
"projections = c3aidatalake.getprojectionhistory(\n",
" {\n",
" \"outbreakLocation\": \"Spain\", \n",
" \"metric\": \"UniversityOfWashington_TotdeaMean_Hist\",\n",
" \"metricStart\": \"2020-04-13\", \n",
" \"metricEnd\": \"2020-05-13\",\n",
" \"observationPeriodStart\": \"2020-04-13\",\n",
" \"observationPeriodEnd\": \"2020-05-01\"\n",
" }\n",
")\n",
"\n",
"projections"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve actual total cumulative deaths in Spain from April 1 to May 13\n",
"deaths = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\" : {\n",
" \"ids\" : [\"Spain\"],\n",
" \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
" \"start\" : \"2020-04-01\",\n",
" \"end\" : \"2020-05-13\",\n",
" \"interval\" : \"DAY\",\n",
" }\n",
" }\n",
")\n",
"\n",
"deaths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"plt.plot(\n",
" deaths[\"dates\"],\n",
" deaths[\"Spain.JHU_ConfirmedDeaths.data\"],\n",
" label = \"JHU_ConfirmedDeaths\",\n",
" color = \"black\"\n",
")\n",
"for col in projections.columns:\n",
" if 'data' in col:\n",
" expr = projections[col.replace(\"data\", \"expr\")].iloc[0]\n",
" projection_date = pd.to_datetime(expr.split(\" \")[-1])\n",
" plt.plot(\n",
" projections.loc[projections[\"dates\"] >= projection_date, \"dates\"],\n",
" projections.loc[projections[\"dates\"] >= projection_date, col],\n",
" label = expr\n",
" )\n",
"\n",
"plt.legend()\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.ylabel(\"Count\")\n",
"plt.title(\"Cumulative death count projections versus actual count\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Economic indicators\n",
"\n",
"GDP and employment statistics by business sector from the US Bureau of Economic Analysis enable comparisons of the drivers of local economies. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Real GDP for AccommodationAndFoodServices and FinanceAndInsurance in Alameda County, California\n",
"\n",
"realgdp = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\": {\n",
" \"ids\": [\"Alameda_California_UnitedStates\"], \n",
" \"expressions\": [\n",
" \"BEA_RealGDP_AccommodationAndFoodServices_2012Dollars\",\n",
" \"BEA_RealGDP_FinanceAndInsurance_2012Dollars\"\n",
"\n",
" ], \n",
" \"start\": \"2000-01-01\", \n",
" \"end\": \"2020-01-01\", \n",
" \"interval\":\"YEAR\"\n",
" }\n",
" }\n",
")\n",
"\n",
"realgdp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"High frequency spending and earnings data from Opportunity Insights allow tracking of near real-time economic trends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Access consumer spending in healthcare and low income earnings in the healthcare and social assistance sector in California\n",
"opportunityinsights = c3aidatalake.evalmetrics(\n",
" \"outbreaklocation\",\n",
" {\n",
" \"spec\": {\n",
" \"ids\": [\"California_UnitedStates\"], \n",
" \"expressions\": [\n",
" \"OIET_Affinity_SpendHcs\",\n",
" \"OIET_LowIncEmpAllBusinesses_Emp62\"\n",
" ], \n",
" \"start\": \"2020-01-01\", \n",
" \"end\": \"2020-06-01\", \n",
" \"interval\":\"DAY\"\n",
" }\n",
" }\n",
")\n",
" \n",
"opportunityinsights"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"\n",
"plt.plot(\n",
" opportunityinsights.dates,\n",
" opportunityinsights['California_UnitedStates.OIET_Affinity_SpendHcs.data'] * 100,\n",
" label = 'Consumer spending in healthcare'\n",
")\n",
"\n",
"plt.plot(\n",
" opportunityinsights.dates,\n",
" opportunityinsights['California_UnitedStates.OIET_LowIncEmpAllBusinesses_Emp62.data'] * 100,\n",
" label = 'Low income earnings in\\nhealthcare & social assistance '\n",
")\n",
"\n",
"plt.legend()\n",
"plt.title(\"California low-income earnings and consumer spending in healthcare\")\n",
"plt.xlabel(\"Date\")\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.ylabel(\"Change relative to January 4-31 (%)\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access Location Exposure data\n",
"\n",
"`LocationExposure` stores information based on the movement of people's mobile devices across locations over time. It stores the following: \n",
"* Location exposure index (LEX) for a pair of locations (`locationTarget`, `locationVisited`): the fraction of mobile devices that pinged in `locationTarget` on a date that also pinged in `locationVisited` at least once during the previous 14 days. The pair (`locationTarget`, `locationVisited`) can be two county locations or two state locations.\n",
"* Device count: the number of distinct mobile devices that pinged at `locationTarget` on the date.\n",
"\n",
"See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationExposures) for more details. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exposure = c3aidatalake.read_data_json(\n",
" \"locationexposure\",\n",
" \"getlocationexposures\",\n",
" {\n",
" \"spec\":\n",
" {\n",
" \"locationTarget\": \"California_UnitedStates\",\n",
" \"locationVisited\": \"Nevada_UnitedStates\",\n",
" \"start\": \"2020-01-20\",\n",
" \"end\": \"2020-04-25\"\n",
" }\n",
" }\n",
" \n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Access daily LEX where `locationTarget` is California and `locationVisited` is Nevada with the the `locationExposures` field."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lex = pd.json_normalize(exposure[\"locationExposures\"][\"value\"])\n",
"\n",
"lex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the LEX data to see the proportion of devices in California on each date that pinged in Nevada over the previous 14 days."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (8, 6))\n",
"plt.plot(\n",
" pd.to_datetime(lex[\"timestamp\"]),\n",
" lex[\"value\"]\n",
")\n",
"plt.ylabel(\"Location exposure index (LEX)\")\n",
"plt.title(\"Location exposure for target location California and visited location Nevada\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Access daily device counts with the `deviceCounts` field."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"pd.json_normalize(exposure[\"deviceCounts\"][\"value\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access LineListRecord data\n",
"\n",
"`LineListRecord` stores individual-level crowdsourced information from laboratory-confirmed COVID-19 patients. Information includes gender, age, symptoms, travel history, location, reported onset, confirmation dates, and discharge status. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LineListRecord) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fetch the line list records tracked by MOBS Lab\n",
"records = c3aidatalake.fetch(\n",
" \"linelistrecord\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"lineListSource == 'DXY'\"\n",
" }\n",
" },\n",
" get_all = True\n",
")\n",
"\n",
"records"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the most common symptoms in this dataset?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get all the symptoms, which are initially comma-separated\n",
"symptom_df = records.copy()\n",
"symptom_df[\"symptoms\"] = symptom_df[\"symptoms\"].str.split(\", \")\n",
"symptom_df = symptom_df.explode(\"symptoms\")\n",
"symptom_df = symptom_df.dropna(subset = [\"symptoms\"])\n",
"symptom_freq = symptom_df.groupby([\"symptoms\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n",
"\n",
"# Plot the data\n",
"plt.figure(figsize = (10, 6))\n",
"plt.bar(symptom_freq.index, symptom_freq[\"id\"])\n",
"plt.xticks(rotation = 90)\n",
"plt.xlabel(\"Symptom\")\n",
"plt.ylabel(\"Number of patients\")\n",
"plt.title(\"Common COVID-19 symptoms\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If a patient is symptomatic and later hospitalized, how long does it take for them to become hospitalized after developing symptoms?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the number of days from development of symptoms to hospitalization for each patient\n",
"hospitalized = records.dropna(subset = [\"hospitalAdmissionDate\", \"symptomStartDate\"])\n",
"hospitalization_time = np.array(\n",
" pd.to_datetime(hospitalized['hospitalAdmissionDate']) - pd.to_datetime(hospitalized['symptomStartDate'])\n",
").astype('timedelta64[D]').astype('float')\n",
"hospitalization_time = hospitalization_time[hospitalization_time >= 0]\n",
"\n",
"# Hospitalization time of 0 days is replaced with 0.1 to indicate near-immediate hospitalization\n",
"hospitalization_time[hospitalization_time <= 0.1] = 0.1\n",
"\n",
"# Fit a gamma distribution\n",
"a, loc, scale = gamma.fit(hospitalization_time, floc = 0)\n",
"dist = gamma(a, loc, scale)\n",
"\n",
"# Plot the results\n",
"x = np.linspace(0, np.max(hospitalization_time), 1000)\n",
"n_bins = int(np.max(hospitalization_time) + 1)\n",
"print(n_bins)\n",
"\n",
"plt.figure(figsize = (10, 6))\n",
"plt.hist(\n",
" hospitalization_time, \n",
" bins = n_bins, \n",
" range = (0, np.max(hospitalization_time)), \n",
" density = True, \n",
" label = \"Observed\"\n",
")\n",
"plt.plot(x, dist.pdf(x), 'r-', lw=5, alpha=0.6, label = 'Gamma distribution')\n",
"plt.ylim(0, 0.5)\n",
"plt.xlabel(\"Days from development of symptoms to hospitalization\")\n",
"plt.ylabel(\"Proportion of patients\")\n",
"plt.title(\"Distribution of time to hospitalization\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Join BiologicalAsset and Sequence data\n",
"\n",
"`BiologicalAsset` stores the metadata of the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. `Sequence` stores the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. See the API documentation for [BiologicalAsset](https://c3.ai/covid-19-api-documentation/#tag/BiologicalAsset) and [Sequence](https://c3.ai/covid-19-api-documentation/#tag/Sequence) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Join data from BiologicalAsset & Sequence\n",
"sequences = c3aidatalake.fetch(\n",
" \"biologicalasset\",\n",
" {\n",
" \"spec\" : {\n",
" \"include\" : \"this, sequence.sequence\",\n",
" \"filter\" : \"exists(sequence.sequence)\"\n",
" }\n",
" }\n",
")\n",
"\n",
"sequences"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access BiblioEntry data\n",
"\n",
"`BiblioEntry` stores the metadata about the journal articles in the CORD-19 Dataset. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/BiblioEntry) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fetch metadata for the first two thousand (2000) BiblioEntry journal articles approved for commercial use\n",
"# Note that 2000 records are returned; the full dataset can be accessed using the get_all = True argument in fetch\n",
"bibs = c3aidatalake.fetch(\n",
" \"biblioentry\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"hasFullText == true\"\n",
" }\n",
" }\n",
")\n",
"\n",
"# Sort them to get the most recent articles first\n",
"bibs[\"publishTime\"] = pd.to_datetime(bibs[\"publishTime\"])\n",
"bibs = bibs.sort_values(\"publishTime\", ascending = False)\n",
"\n",
"bibs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `GetArticleMetadata` to access the full-text of these articles, or in this case, the first page text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bib_id = bibs.loc[0, \"id\"] \n",
"print(bib_id)\n",
"\n",
"article_data = c3aidatalake.read_data_json(\n",
" \"biblioentry\",\n",
" \"getarticlemetadata\",\n",
" {\n",
" \"ids\" : [bib_id]\n",
" }\n",
")\n",
"\n",
"article_data[\"value\"][\"value\"][0][\"body_text\"][0][\"text\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Join TherapeuticAsset and ExternalLink data\n",
"\n",
"`TherapeuticAsset` stores details about the research and development (R&D) of coronavirus therapies, for example, vaccines, diagnostics, and antibodies. `ExternalLink` stores website URLs cited in the data sources containing the therapies stored in the TherapeuticAssets C3.ai Type. See the API documentation for [TherapeuticAsset](https://c3.ai/covid-19-api-documentation/#tag/TherapeuticAsset) and [ExternalLink](https://c3.ai/covid-19-api-documentation/#tag/ExternalLink) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Join data from TherapeuticAsset and ExternalLink (productType, description, origin, and URL links)\n",
"assets = c3aidatalake.fetch(\n",
" \"therapeuticasset\",\n",
" {\n",
" \"spec\" : {\n",
" \"include\" : \"productType, description, origin, links.url\",\n",
" \"filter\" : \"origin == 'Milken'\"\n",
" }\n",
" }\n",
")\n",
"\n",
"assets = assets.explode(\"links\")\n",
"assets[\"links\"] = [link[\"url\"] if type(link) == dict and \"url\" in link.keys() else None for link in assets[\"links\"]]\n",
"assets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Join Diagnosis and DiagnosisDetail data\n",
"\n",
"`Diagnosis` stores basic clinical data (e.g. clinical notes, demographics, test results, x-ray or CT scan images) about individual patients tested for COVID-19, from research papers and healthcare institutions. \n",
"\n",
"\n",
"`DiagnosisDetail` stores detailed clinical data (e.g. lab tests, pre-existing conditions, symptoms) about individual patients in key-value format. See the API documentation for [Diagnosis](https://c3.ai/covid-19-api-documentation/#tag/Diagnosis) and [DiagnosisDetail](https://c3.ai/covid-19-api-documentation/#tag/DiagnosisDetail) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diagnoses = c3aidatalake.fetch(\n",
" \"diagnosis\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"contains(testResults, 'COVID-19')\", \n",
" \"include\" : \"this, diagnostics.source, diagnostics.key, diagnostics.value\"\n",
" }\n",
" }\n",
")\n",
"\n",
"diagnoses"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diagnoses_long = diagnoses.explode(\"diagnostics\")\n",
"diagnoses_long = pd.concat([\n",
" diagnoses_long.reset_index(),\n",
" pd.json_normalize(\n",
" diagnoses_long.loc[diagnoses_long.source != 'UCSD', \"diagnostics\"]\n",
" )[[\"key\", \"value\"]]\n",
"], axis = 1).drop(columns = \"diagnostics\")\n",
"diagnoses_long"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diagnoses_wide = (\n",
" diagnoses_long\n",
" .loc[~diagnoses_long[['key', 'value']].isna().all(axis=1)]\n",
" .pivot(columns = \"key\", values = \"value\")\n",
")\n",
"diagnoses_wide = pd.concat([diagnoses, diagnoses_wide], axis = 1).drop(columns = \"diagnostics\")\n",
"diagnoses_wide"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the `GetImageURLs` API to view the image associated with a diagnosis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diagnosis_id = diagnoses_wide.loc[0, \"id\"] \n",
"print(diagnosis_id)\n",
"\n",
"image_urls = c3aidatalake.read_data_json(\n",
" \"diagnosis\",\n",
" \"getimageurls\",\n",
" {\n",
" \"ids\" : [diagnosis_id]\n",
" }\n",
")\n",
"\n",
"print(image_urls[\"value\"][diagnosis_id][\"value\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access VaccineCoverage data\n",
"\n",
"`VaccineCoverage` stores historical vaccination rates for various demographic groups in US counties and states, based on data from the US Centers for Disease Control (CDC). See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/VaccineCoverage) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vaccine_coverage = c3aidatalake.fetch(\n",
" \"vaccinecoverage\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"vaxView == 'Influenza' && contains(vaccineDetails, 'General Population') && (location == 'California_UnitedStates' || location == 'Texas_UnitedStates') && contains(demographicClass, 'Race/ethnicity') && year == 2018\"\n",
" }\n",
" }\n",
")\n",
"\n",
"vaccine_coverage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does vaccine coverage vary by race/ethnicity in these locations?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vaccine_coverage[\"upperError\"] = vaccine_coverage[\"upperLimit\"] - vaccine_coverage[\"value\"]\n",
"vaccine_coverage[\"lowerError\"] = vaccine_coverage[\"value\"] - vaccine_coverage[\"lowerLimit\"]\n",
"\n",
"plt.figure(figsize = (10, 6))\n",
"\n",
"plt.subplot(1, 2, 1)\n",
"plt.bar(\n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"demographicClassDetails\"], \n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"value\"], \n",
" yerr = [\n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"upperError\"], \n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"lowerError\"], \n",
" ]\n",
")\n",
"plt.ylabel(\"Vaccination rate (%)\")\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.title(\"California, United States\")\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.bar(\n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"demographicClassDetails\"], \n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"value\"], \n",
" yerr = [\n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"upperError\"], \n",
" vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"lowerError\"], \n",
" ]\n",
")\n",
"plt.ylabel(\"Vaccination rate (%)\")\n",
"plt.xticks(rotation = 45, ha = \"right\")\n",
"plt.title(\"Texas, United States\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access Policy data\n",
"\n",
"`LocationPolicySummary` stores COVID-19 social distancing and health policies and regulations enacted by US states. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationPolicySummary) for more details. \n",
"
\n",
"\n",
"`PolicyDetail` stores country-level policy responses to COVID-19 including: \n",
"* Financial sector policies (from The World Bank: Finance Related Policy Responses to COVID-19), \n",
"* Containment and closure, economic, and health system policies (from University of Oxford: Coronavirus Government Response Tracker, OxCGRT), and \n",
"* Policies in South Korea (from Data Science for COVID-19: South Korea).\n",
" \n",
"See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/PolicyDetail) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"policy_united_states = c3aidatalake.fetch(\n",
" \"locationpolicysummary\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\" : \"contains(location.id, 'UnitedStates')\",\n",
" \"limit\" : -1\n",
" }\n",
" }\n",
")\n",
"\n",
"policy_united_states"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the `AllVersionsForPolicy` API of `LocationPolicySummary` to access historical and current versions of a policy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"versions = c3aidatalake.read_data_json(\n",
" \"locationpolicysummary\",\n",
" \"allversionsforpolicy\",\n",
" {\n",
" \"this\" : {\n",
" \"id\" : \"Wisconsin_UnitedStates_Policy\"\n",
" }\n",
" }\n",
")\n",
"\n",
"pd.json_normalize(versions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fetch all school closing policies that restrict gatherings between 11-100 people from OxCGRT dataset in `PolicyDetail`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"school_policy = c3aidatalake.fetch(\n",
" \"policydetail\",\n",
" {\n",
" \"spec\" : {\n",
" \"filter\": \"contains(lowerCase(name), 'school') && value == 3 && origin == 'University of Oxford'\",\n",
" \"limit\": -1\n",
" }\n",
" }\n",
")\n",
"\n",
"school_policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access LaborDetail data\n",
"\n",
"`LaborDetail` stores historical monthly labor force and employment data for US counties and states from US Bureau of Labor Statistics. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LaborDetail) for more details. \n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fetch the unemployment rates of counties in California in March, 2020\n",
"labordetail = c3aidatalake.fetch(\n",
" \"labordetail\",\n",
" {\n",
" \"spec\": {\n",
" \"filter\": \"year == 2020 && month == 3 && contains(parent, 'California_UnitedStates')\"\n",
" }\n",
" }\n",
")\n",
"\n",
"labordetail"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Access Survey data\n",
"\n",
"`SurveyData` stores COVID-19-related public opinion, demographic, and symptom prevalence data collected from COVID-19 survey responses. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/SurveyData) for more details. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Fetch participants who are located in California and who have a relatively strong intent to wear a mask in public because of COVID-19\n",
"survey = c3aidatalake.fetch(\n",
" \"surveydata\",\n",
" {\n",
" \"spec\": {\n",
" \"filter\": \"location == 'California_UnitedStates' && coronavirusIntent_Mask >= 75\"\n",
" }\n",
" },\n",
" get_all = True\n",
")\n",
"\n",
"survey"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"employment_df = survey.copy()\n",
"employment_df[\"coronavirusEmployment\"] = employment_df[\"coronavirusEmployment\"].str.split(\", \")\n",
"employment_df = employment_df.explode(\"coronavirusEmployment\")\n",
"employment_df = employment_df.groupby([\"coronavirusEmployment\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n",
"\n",
"# Plot the data\n",
"plt.figure(figsize = (10, 6))\n",
"plt.bar(employment_df.index, 100 * employment_df[\"id\"] / len(survey))\n",
"plt.xticks(rotation = 90)\n",
"plt.xlabel(\"Response to employment status question\")\n",
"plt.ylabel(\"Proportion of participants (%)\")\n",
"plt.title(\"Employment status of CA participants with strong intent to wear mask\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}