{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# C3.ai COVID-19 Data Lake Quickstart in Python \n", "\n", "Version 5.0 (August 11, 2020).\n", "\n", "This Jupyter notebook shows some examples of how to access and use each of the [C3.ai COVID-19 Data Lake](https://c3.ai/covid/) APIs. These examples show only a small piece of what you can do with the C3.ai COVID-19 Data Lake, but will get you started with performing your own exploration. See the [API documentation](https://c3.ai/covid-19-api-documentation/) for more details.\n", "\n", "Please contribute your questions, answers and insights on [Stack Overflow](https://www.stackoverflow.com). Tag `c3ai-datalake` so that others can view and help build on your contributions. For support, please send email to: [covid@c3.ai](mailto:covid@c3.ai)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "- [Helper methods for accessing the API](#helpers)\n", "- [Access OutbreakLocation data](#outbreaklocation)\n", " - [Case counts](#outbreaklocation/casecounts)\n", " - [Demographics](#outbreaklocation/demographics)\n", " - [Mobility](#outbreaklocation/mobility)\n", " - [Projections](#outbreaklocation/projections)\n", " - [Economic indicators](#outbreaklocation/economics)\n", "- [Access LocationExposure data](#locationexposure)\n", "- [Access LineListRecord data](#linelistrecord)\n", "- [Join BiologicalAsset and Sequence data](#biologicalasset)\n", "- [Access BiblioEntry data](#biblioentry)\n", "- [Join TherapeuticAsset and ExternalLink data](#therapeuticasset)\n", "- [Join Diagnosis and DiagnosisDetail data](#diagnosis)\n", "- [Access VaccineCoverage data](#vaccinecoverage)\n", "- [Access Policy data](#policy)\n", "- [Access LaborDetail data](#labor)\n", "- [Access Survey data](#survey)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the [requests](https://requests.readthedocs.io/en/master/), [pandas>=1.0.0](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/3.2.1/index.html), and [scipy](https://www.scipy.org/) libraries before using this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "from scipy.stats import gamma\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ensure that you have a recent version of pandas (>= 1.0.0)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"pandas version\", pd.__version__)\n", "assert pd.__version__[0] >= \"1\", \"To use this notebook, upgrade to the newest version of pandas. See https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html for details.\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Helper methods for accessing the API\n", "\n", "The helper methods in `c3aidatalake.py` convert a JSON response from the C3.ai APIs to a Pandas DataFrame. You may wish to view the code in `c3aidatalake.py` before using the quickstart examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import c3aidatalake" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access OutbreakLocation data\n", "\n", "`OutbreakLocation` stores location data such as countries, provinces, cities, where COVID-19 outbeaks are recorded. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation) for more details and for a list of available locations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fetch facts about Germany\n", "locations = c3aidatalake.fetch(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"id == 'Germany'\"\n", " }\n", " }\n", ")\n", "\n", "locations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Case counts\n", "\n", "A variety of sources provide counts of cases, deaths, recoveries, and other statistics for counties, provinces, and countries worldwide." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Total number of confirmed cases, deaths, and recoveries in Santa Clara, California\n", "today = pd.Timestamp.now().strftime(\"%Y-%m-%d\")\n", "\n", "casecounts = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\" : {\n", " \"ids\" : [\"SantaClara_California_UnitedStates\"],\n", " \"expressions\" : [\"JHU_ConfirmedCases\", \"JHU_ConfirmedDeaths\", \"JHU_ConfirmedRecoveries\"],\n", " \"start\" : \"2020-01-01\",\n", " \"end\" : today,\n", " \"interval\" : \"DAY\",\n", " }\n", " }\n", ")\n", "\n", "casecounts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot these counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "plt.plot(\n", " casecounts[\"dates\"],\n", " casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedCases.data\"],\n", " label = \"JHU_ConfirmedCases\"\n", ")\n", "plt.plot(\n", " casecounts[\"dates\"],\n", " casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedDeaths.data\"],\n", " label = \"JHU_ConfirmedDeaths\"\n", ")\n", "plt.plot(\n", " casecounts[\"dates\"],\n", " casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedRecoveries.data\"],\n", " label = \"JHU_ConfirmedCases\"\n", ")\n", "plt.legend()\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.ylabel(\"Count\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Export case counts as a .csv file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment the line below to export the DataFrame as a .csv file\n", "# casecounts.to_csv(\"casecounts.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Demographics\n", "\n", "Demographic and economic data from the US Census Bureau and The World Bank allow demographic comparisons across locations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population = c3aidatalake.fetch(\n", " \"populationdata\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"!contains(parent, '_') && (populationAge == '>=65' || populationAge == 'Total') && gender == 'Male/Female' && year == '2018' && estimate == 'False' && percent == 'False'\"\n", " }\n", " },\n", " get_all = True\n", ")\n", "\n", "population" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population_age_distribution = population.loc[\n", " :, \n", " [\"populationAge\", \"parent.id\", \"value\"]\n", "].pivot(index = \"parent.id\", columns = \"populationAge\")['value']\n", "population_age_distribution[\"proportion_over_65\"] = population_age_distribution[\">=65\"] / population_age_distribution[\"Total\"]\n", "\n", "population_age_distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access global death counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "global_deaths = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\" : {\n", " \"ids\" : list(population_age_distribution.index),\n", " \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n", " \"start\" : \"2020-05-01\",\n", " \"end\" : \"2020-05-01\",\n", " \"interval\" : \"DAY\",\n", " }\n", " },\n", " get_all = True\n", ")\n", "\n", "global_deaths" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "global_deaths_by_country = global_deaths.filter(regex=(\"\\.data\")).melt()\n", "global_deaths_by_country[\"country\"] = global_deaths_by_country[\"variable\"].str.replace(\"\\..*\", \"\")\n", "\n", "global_comparison = global_deaths_by_country.set_index(\"country\").join(population_age_distribution)\n", "global_comparison[\"deaths_per_million\"] = 1e6 * global_comparison[\"value\"] / global_comparison[\"Total\"] \n", "global_comparison" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "plt.scatter(\n", " global_comparison[\"proportion_over_65\"],\n", " global_comparison[\"deaths_per_million\"]\n", ")\n", "plt.xlabel(\"Proportion of population over 65\")\n", "plt.ylabel(\"Confirmed COVID-19 deaths\\nper million people\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Mobility\n", "\n", "Mobility data from Apple and Google provide a view of the impact of COVID-19 and social distancing on mobility trends." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mobility_trends = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\" : {\n", " \"ids\" : [\"DistrictofColumbia_UnitedStates\"],\n", " \"expressions\" : [\n", " \"Apple_WalkingMobility\", \n", " \"Apple_DrivingMobility\",\n", " \"Google_ParksMobility\",\n", " \"Google_ResidentialMobility\"\n", " ],\n", " \"start\" : \"2020-03-01\",\n", " \"end\" : \"2020-04-01\",\n", " \"interval\" : \"DAY\",\n", " }\n", " },\n", " get_all = True\n", ")\n", "\n", "mobility_trends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot these mobility trends." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "plt.plot(\n", " mobility_trends[\"dates\"],\n", " [100 for d in mobility_trends[\"dates\"]],\n", " label = \"Baseline\",\n", " linestyle = \"dashed\",\n", " color = \"black\"\n", ")\n", "plt.plot(\n", " mobility_trends[\"dates\"],\n", " mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_WalkingMobility.data\"],\n", " label = \"Apple_WalkingMobility\"\n", ")\n", "plt.plot(\n", " mobility_trends[\"dates\"],\n", " mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_DrivingMobility.data\"],\n", " label = \"Apple_DrivingMobility\"\n", ")\n", "plt.plot(\n", " mobility_trends[\"dates\"],\n", " mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ParksMobility.data\"],\n", " label = \"Google_ParksMobility\"\n", ")\n", "plt.plot(\n", " mobility_trends[\"dates\"],\n", " mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ResidentialMobility.data\"],\n", " label = \"Google_ResidentialMobility\"\n", ")\n", "plt.legend()\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.ylabel(\"Mobility compared to baseline (%)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Projections\n", "\n", "Use the `GetProjectionHistory` API to retrieve versioned time series projections for specific metrics made at specific points in time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve projections made between April 13 and May 1 of mean total cumulative deaths in Spain from April 13 to May 13\n", "projections = c3aidatalake.getprojectionhistory(\n", " {\n", " \"outbreakLocation\": \"Spain\", \n", " \"metric\": \"UniversityOfWashington_TotdeaMean_Hist\",\n", " \"metricStart\": \"2020-04-13\", \n", " \"metricEnd\": \"2020-05-13\",\n", " \"observationPeriodStart\": \"2020-04-13\",\n", " \"observationPeriodEnd\": \"2020-05-01\"\n", " }\n", ")\n", "\n", "projections" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve actual total cumulative deaths in Spain from April 1 to May 13\n", "deaths = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\" : {\n", " \"ids\" : [\"Spain\"],\n", " \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n", " \"start\" : \"2020-04-01\",\n", " \"end\" : \"2020-05-13\",\n", " \"interval\" : \"DAY\",\n", " }\n", " }\n", ")\n", "\n", "deaths" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "plt.plot(\n", " deaths[\"dates\"],\n", " deaths[\"Spain.JHU_ConfirmedDeaths.data\"],\n", " label = \"JHU_ConfirmedDeaths\",\n", " color = \"black\"\n", ")\n", "for col in projections.columns:\n", " if 'data' in col:\n", " expr = projections[col.replace(\"data\", \"expr\")].iloc[0]\n", " projection_date = pd.to_datetime(expr.split(\" \")[-1])\n", " plt.plot(\n", " projections.loc[projections[\"dates\"] >= projection_date, \"dates\"],\n", " projections.loc[projections[\"dates\"] >= projection_date, col],\n", " label = expr\n", " )\n", "\n", "plt.legend()\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.ylabel(\"Count\")\n", "plt.title(\"Cumulative death count projections versus actual count\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Economic indicators\n", "\n", "GDP and employment statistics by business sector from the US Bureau of Economic Analysis enable comparisons of the drivers of local economies. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Real GDP for AccommodationAndFoodServices and FinanceAndInsurance in Alameda County, California\n", "\n", "realgdp = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\": {\n", " \"ids\": [\"Alameda_California_UnitedStates\"], \n", " \"expressions\": [\n", " \"BEA_RealGDP_AccommodationAndFoodServices_2012Dollars\",\n", " \"BEA_RealGDP_FinanceAndInsurance_2012Dollars\"\n", "\n", " ], \n", " \"start\": \"2000-01-01\", \n", " \"end\": \"2020-01-01\", \n", " \"interval\":\"YEAR\"\n", " }\n", " }\n", ")\n", "\n", "realgdp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "High frequency spending and earnings data from Opportunity Insights allow tracking of near real-time economic trends." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Access consumer spending in healthcare and low income earnings in the healthcare and social assistance sector in California\n", "opportunityinsights = c3aidatalake.evalmetrics(\n", " \"outbreaklocation\",\n", " {\n", " \"spec\": {\n", " \"ids\": [\"California_UnitedStates\"], \n", " \"expressions\": [\n", " \"OIET_Affinity_SpendHcs\",\n", " \"OIET_LowIncEmpAllBusinesses_Emp62\"\n", " ], \n", " \"start\": \"2020-01-01\", \n", " \"end\": \"2020-06-01\", \n", " \"interval\":\"DAY\"\n", " }\n", " }\n", ")\n", " \n", "opportunityinsights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "\n", "plt.plot(\n", " opportunityinsights.dates,\n", " opportunityinsights['California_UnitedStates.OIET_Affinity_SpendHcs.data'] * 100,\n", " label = 'Consumer spending in healthcare'\n", ")\n", "\n", "plt.plot(\n", " opportunityinsights.dates,\n", " opportunityinsights['California_UnitedStates.OIET_LowIncEmpAllBusinesses_Emp62.data'] * 100,\n", " label = 'Low income earnings in\\nhealthcare & social assistance '\n", ")\n", "\n", "plt.legend()\n", "plt.title(\"California low-income earnings and consumer spending in healthcare\")\n", "plt.xlabel(\"Date\")\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.ylabel(\"Change relative to January 4-31 (%)\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access Location Exposure data\n", "\n", "`LocationExposure` stores information based on the movement of people's mobile devices across locations over time. It stores the following: \n", "* Location exposure index (LEX) for a pair of locations (`locationTarget`, `locationVisited`): the fraction of mobile devices that pinged in `locationTarget` on a date that also pinged in `locationVisited` at least once during the previous 14 days. The pair (`locationTarget`, `locationVisited`) can be two county locations or two state locations.\n", "* Device count: the number of distinct mobile devices that pinged at `locationTarget` on the date.\n", "\n", "See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationExposures) for more details. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exposure = c3aidatalake.read_data_json(\n", " \"locationexposure\",\n", " \"getlocationexposures\",\n", " {\n", " \"spec\":\n", " {\n", " \"locationTarget\": \"California_UnitedStates\",\n", " \"locationVisited\": \"Nevada_UnitedStates\",\n", " \"start\": \"2020-01-20\",\n", " \"end\": \"2020-04-25\"\n", " }\n", " }\n", " \n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access daily LEX where `locationTarget` is California and `locationVisited` is Nevada with the the `locationExposures` field." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lex = pd.json_normalize(exposure[\"locationExposures\"][\"value\"])\n", "\n", "lex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the LEX data to see the proportion of devices in California on each date that pinged in Nevada over the previous 14 days." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (8, 6))\n", "plt.plot(\n", " pd.to_datetime(lex[\"timestamp\"]),\n", " lex[\"value\"]\n", ")\n", "plt.ylabel(\"Location exposure index (LEX)\")\n", "plt.title(\"Location exposure for target location California and visited location Nevada\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access daily device counts with the `deviceCounts` field." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.json_normalize(exposure[\"deviceCounts\"][\"value\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access LineListRecord data\n", "\n", "`LineListRecord` stores individual-level crowdsourced information from laboratory-confirmed COVID-19 patients. Information includes gender, age, symptoms, travel history, location, reported onset, confirmation dates, and discharge status. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LineListRecord) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fetch the line list records tracked by MOBS Lab\n", "records = c3aidatalake.fetch(\n", " \"linelistrecord\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"lineListSource == 'DXY'\"\n", " }\n", " },\n", " get_all = True\n", ")\n", "\n", "records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the most common symptoms in this dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get all the symptoms, which are initially comma-separated\n", "symptom_df = records.copy()\n", "symptom_df[\"symptoms\"] = symptom_df[\"symptoms\"].str.split(\", \")\n", "symptom_df = symptom_df.explode(\"symptoms\")\n", "symptom_df = symptom_df.dropna(subset = [\"symptoms\"])\n", "symptom_freq = symptom_df.groupby([\"symptoms\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n", "\n", "# Plot the data\n", "plt.figure(figsize = (10, 6))\n", "plt.bar(symptom_freq.index, symptom_freq[\"id\"])\n", "plt.xticks(rotation = 90)\n", "plt.xlabel(\"Symptom\")\n", "plt.ylabel(\"Number of patients\")\n", "plt.title(\"Common COVID-19 symptoms\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a patient is symptomatic and later hospitalized, how long does it take for them to become hospitalized after developing symptoms?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the number of days from development of symptoms to hospitalization for each patient\n", "hospitalized = records.dropna(subset = [\"hospitalAdmissionDate\", \"symptomStartDate\"])\n", "hospitalization_time = np.array(\n", " pd.to_datetime(hospitalized['hospitalAdmissionDate']) - pd.to_datetime(hospitalized['symptomStartDate'])\n", ").astype('timedelta64[D]').astype('float')\n", "hospitalization_time = hospitalization_time[hospitalization_time >= 0]\n", "\n", "# Hospitalization time of 0 days is replaced with 0.1 to indicate near-immediate hospitalization\n", "hospitalization_time[hospitalization_time <= 0.1] = 0.1\n", "\n", "# Fit a gamma distribution\n", "a, loc, scale = gamma.fit(hospitalization_time, floc = 0)\n", "dist = gamma(a, loc, scale)\n", "\n", "# Plot the results\n", "x = np.linspace(0, np.max(hospitalization_time), 1000)\n", "n_bins = int(np.max(hospitalization_time) + 1)\n", "print(n_bins)\n", "\n", "plt.figure(figsize = (10, 6))\n", "plt.hist(\n", " hospitalization_time, \n", " bins = n_bins, \n", " range = (0, np.max(hospitalization_time)), \n", " density = True, \n", " label = \"Observed\"\n", ")\n", "plt.plot(x, dist.pdf(x), 'r-', lw=5, alpha=0.6, label = 'Gamma distribution')\n", "plt.ylim(0, 0.5)\n", "plt.xlabel(\"Days from development of symptoms to hospitalization\")\n", "plt.ylabel(\"Proportion of patients\")\n", "plt.title(\"Distribution of time to hospitalization\")\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Join BiologicalAsset and Sequence data\n", "\n", "`BiologicalAsset` stores the metadata of the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. `Sequence` stores the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. See the API documentation for [BiologicalAsset](https://c3.ai/covid-19-api-documentation/#tag/BiologicalAsset) and [Sequence](https://c3.ai/covid-19-api-documentation/#tag/Sequence) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Join data from BiologicalAsset & Sequence\n", "sequences = c3aidatalake.fetch(\n", " \"biologicalasset\",\n", " {\n", " \"spec\" : {\n", " \"include\" : \"this, sequence.sequence\",\n", " \"filter\" : \"exists(sequence.sequence)\"\n", " }\n", " }\n", ")\n", "\n", "sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access BiblioEntry data\n", "\n", "`BiblioEntry` stores the metadata about the journal articles in the CORD-19 Dataset. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/BiblioEntry) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fetch metadata for the first two thousand (2000) BiblioEntry journal articles approved for commercial use\n", "# Note that 2000 records are returned; the full dataset can be accessed using the get_all = True argument in fetch\n", "bibs = c3aidatalake.fetch(\n", " \"biblioentry\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"hasFullText == true\"\n", " }\n", " }\n", ")\n", "\n", "# Sort them to get the most recent articles first\n", "bibs[\"publishTime\"] = pd.to_datetime(bibs[\"publishTime\"])\n", "bibs = bibs.sort_values(\"publishTime\", ascending = False)\n", "\n", "bibs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `GetArticleMetadata` to access the full-text of these articles, or in this case, the first page text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bib_id = bibs.loc[0, \"id\"] \n", "print(bib_id)\n", "\n", "article_data = c3aidatalake.read_data_json(\n", " \"biblioentry\",\n", " \"getarticlemetadata\",\n", " {\n", " \"ids\" : [bib_id]\n", " }\n", ")\n", "\n", "article_data[\"value\"][\"value\"][0][\"body_text\"][0][\"text\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Join TherapeuticAsset and ExternalLink data\n", "\n", "`TherapeuticAsset` stores details about the research and development (R&D) of coronavirus therapies, for example, vaccines, diagnostics, and antibodies. `ExternalLink` stores website URLs cited in the data sources containing the therapies stored in the TherapeuticAssets C3.ai Type. See the API documentation for [TherapeuticAsset](https://c3.ai/covid-19-api-documentation/#tag/TherapeuticAsset) and [ExternalLink](https://c3.ai/covid-19-api-documentation/#tag/ExternalLink) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Join data from TherapeuticAsset and ExternalLink (productType, description, origin, and URL links)\n", "assets = c3aidatalake.fetch(\n", " \"therapeuticasset\",\n", " {\n", " \"spec\" : {\n", " \"include\" : \"productType, description, origin, links.url\",\n", " \"filter\" : \"origin == 'Milken'\"\n", " }\n", " }\n", ")\n", "\n", "assets = assets.explode(\"links\")\n", "assets[\"links\"] = [link[\"url\"] if type(link) == dict and \"url\" in link.keys() else None for link in assets[\"links\"]]\n", "assets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Join Diagnosis and DiagnosisDetail data\n", "\n", "`Diagnosis` stores basic clinical data (e.g. clinical notes, demographics, test results, x-ray or CT scan images) about individual patients tested for COVID-19, from research papers and healthcare institutions. \n", "\n", "\n", "`DiagnosisDetail` stores detailed clinical data (e.g. lab tests, pre-existing conditions, symptoms) about individual patients in key-value format. See the API documentation for [Diagnosis](https://c3.ai/covid-19-api-documentation/#tag/Diagnosis) and [DiagnosisDetail](https://c3.ai/covid-19-api-documentation/#tag/DiagnosisDetail) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diagnoses = c3aidatalake.fetch(\n", " \"diagnosis\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"contains(testResults, 'COVID-19')\", \n", " \"include\" : \"this, diagnostics.source, diagnostics.key, diagnostics.value\"\n", " }\n", " }\n", ")\n", "\n", "diagnoses" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diagnoses_long = diagnoses.explode(\"diagnostics\")\n", "diagnoses_long = pd.concat([\n", " diagnoses_long.reset_index(),\n", " pd.json_normalize(\n", " diagnoses_long.loc[diagnoses_long.source != 'UCSD', \"diagnostics\"]\n", " )[[\"key\", \"value\"]]\n", "], axis = 1).drop(columns = \"diagnostics\")\n", "diagnoses_long" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diagnoses_wide = (\n", " diagnoses_long\n", " .loc[~diagnoses_long[['key', 'value']].isna().all(axis=1)]\n", " .pivot(columns = \"key\", values = \"value\")\n", ")\n", "diagnoses_wide = pd.concat([diagnoses, diagnoses_wide], axis = 1).drop(columns = \"diagnostics\")\n", "diagnoses_wide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `GetImageURLs` API to view the image associated with a diagnosis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diagnosis_id = diagnoses_wide.loc[0, \"id\"] \n", "print(diagnosis_id)\n", "\n", "image_urls = c3aidatalake.read_data_json(\n", " \"diagnosis\",\n", " \"getimageurls\",\n", " {\n", " \"ids\" : [diagnosis_id]\n", " }\n", ")\n", "\n", "print(image_urls[\"value\"][diagnosis_id][\"value\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access VaccineCoverage data\n", "\n", "`VaccineCoverage` stores historical vaccination rates for various demographic groups in US counties and states, based on data from the US Centers for Disease Control (CDC). See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/VaccineCoverage) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vaccine_coverage = c3aidatalake.fetch(\n", " \"vaccinecoverage\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"vaxView == 'Influenza' && contains(vaccineDetails, 'General Population') && (location == 'California_UnitedStates' || location == 'Texas_UnitedStates') && contains(demographicClass, 'Race/ethnicity') && year == 2018\"\n", " }\n", " }\n", ")\n", "\n", "vaccine_coverage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does vaccine coverage vary by race/ethnicity in these locations?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vaccine_coverage[\"upperError\"] = vaccine_coverage[\"upperLimit\"] - vaccine_coverage[\"value\"]\n", "vaccine_coverage[\"lowerError\"] = vaccine_coverage[\"value\"] - vaccine_coverage[\"lowerLimit\"]\n", "\n", "plt.figure(figsize = (10, 6))\n", "\n", "plt.subplot(1, 2, 1)\n", "plt.bar(\n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"demographicClassDetails\"], \n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"value\"], \n", " yerr = [\n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"upperError\"], \n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"lowerError\"], \n", " ]\n", ")\n", "plt.ylabel(\"Vaccination rate (%)\")\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.title(\"California, United States\")\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.bar(\n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"demographicClassDetails\"], \n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"value\"], \n", " yerr = [\n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"upperError\"], \n", " vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"lowerError\"], \n", " ]\n", ")\n", "plt.ylabel(\"Vaccination rate (%)\")\n", "plt.xticks(rotation = 45, ha = \"right\")\n", "plt.title(\"Texas, United States\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access Policy data\n", "\n", "`LocationPolicySummary` stores COVID-19 social distancing and health policies and regulations enacted by US states. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LocationPolicySummary) for more details. \n", "
\n", "\n", "`PolicyDetail` stores country-level policy responses to COVID-19 including: \n", "* Financial sector policies (from The World Bank: Finance Related Policy Responses to COVID-19), \n", "* Containment and closure, economic, and health system policies (from University of Oxford: Coronavirus Government Response Tracker, OxCGRT), and \n", "* Policies in South Korea (from Data Science for COVID-19: South Korea).\n", " \n", "See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/PolicyDetail) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "policy_united_states = c3aidatalake.fetch(\n", " \"locationpolicysummary\",\n", " {\n", " \"spec\" : {\n", " \"filter\" : \"contains(location.id, 'UnitedStates')\",\n", " \"limit\" : -1\n", " }\n", " }\n", ")\n", "\n", "policy_united_states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `AllVersionsForPolicy` API of `LocationPolicySummary` to access historical and current versions of a policy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "versions = c3aidatalake.read_data_json(\n", " \"locationpolicysummary\",\n", " \"allversionsforpolicy\",\n", " {\n", " \"this\" : {\n", " \"id\" : \"Wisconsin_UnitedStates_Policy\"\n", " }\n", " }\n", ")\n", "\n", "pd.json_normalize(versions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fetch all school closing policies that restrict gatherings between 11-100 people from OxCGRT dataset in `PolicyDetail`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "school_policy = c3aidatalake.fetch(\n", " \"policydetail\",\n", " {\n", " \"spec\" : {\n", " \"filter\": \"contains(lowerCase(name), 'school') && value == 3 && origin == 'University of Oxford'\",\n", " \"limit\": -1\n", " }\n", " }\n", ")\n", "\n", "school_policy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access LaborDetail data\n", "\n", "`LaborDetail` stores historical monthly labor force and employment data for US counties and states from US Bureau of Labor Statistics. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LaborDetail) for more details. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fetch the unemployment rates of counties in California in March, 2020\n", "labordetail = c3aidatalake.fetch(\n", " \"labordetail\",\n", " {\n", " \"spec\": {\n", " \"filter\": \"year == 2020 && month == 3 && contains(parent, 'California_UnitedStates')\"\n", " }\n", " }\n", ")\n", "\n", "labordetail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Access Survey data\n", "\n", "`SurveyData` stores COVID-19-related public opinion, demographic, and symptom prevalence data collected from COVID-19 survey responses. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/SurveyData) for more details. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Fetch participants who are located in California and who have a relatively strong intent to wear a mask in public because of COVID-19\n", "survey = c3aidatalake.fetch(\n", " \"surveydata\",\n", " {\n", " \"spec\": {\n", " \"filter\": \"location == 'California_UnitedStates' && coronavirusIntent_Mask >= 75\"\n", " }\n", " },\n", " get_all = True\n", ")\n", "\n", "survey" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "employment_df = survey.copy()\n", "employment_df[\"coronavirusEmployment\"] = employment_df[\"coronavirusEmployment\"].str.split(\", \")\n", "employment_df = employment_df.explode(\"coronavirusEmployment\")\n", "employment_df = employment_df.groupby([\"coronavirusEmployment\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n", "\n", "# Plot the data\n", "plt.figure(figsize = (10, 6))\n", "plt.bar(employment_df.index, 100 * employment_df[\"id\"] / len(survey))\n", "plt.xticks(rotation = 90)\n", "plt.xlabel(\"Response to employment status question\")\n", "plt.ylabel(\"Proportion of participants (%)\")\n", "plt.title(\"Employment status of CA participants with strong intent to wear mask\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }