{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Altair basics with COVID-19 data\n",
    "\n",
    "This notebook is intended to show how to use some intermediate Altair functionality around filtering, tidying, and interactivity with an interesting public dataset.  It is explicitly _not_ intended to convey any epidemiological insight (raw case counts are not necessarily useful).\n",
    "\n",
    "## Data acquisition and basic manipulation in Pandas\n",
    "\n",
    "We're going to start with data from [covidtracking.com](https://covidtracking.com/).  This is an excellent resource that provides an API for curated, historical, state-by-state data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import altair as alt\n",
    "alt.data_transformers.enable('json')\n",
    "\n",
    "df=pd.read_csv(\"https://covidtracking.com/api/v1/states/daily.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're first going to reformat the dates to add hyphens between the year, month, and day, so `20200228` becomes `2020-02-28`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def datemunge(di):\n",
    "    d = str(di)\n",
    "    return \"%s-%s-%s\" % (d[0:4], d[4:6], d[6:8])\n",
    "\n",
    "cleaned = df.copy()\n",
    "\n",
    "cleaned[\"date\"] = cleaned[\"date\"].apply(datemunge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll then `melt` the data frame so that each observation is in its own row, so that (for example) `state, date, positive, negative, hospitalized, icu` becomes `state, date, observation_type, observation_value`, where `observation_type` is one of `positive`, `negative`, `hospitalized`, or `icu`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned = pd.melt(cleaned, \n",
    "                  id_vars=['date', 'state', 'fips'], \n",
    "                  value_vars=list(set(df.columns) - set(['date', 'state', 'fips', 'hash', 'dateChecked', 'lastUpdateEt', 'dataQualityGrade'])), \n",
    "                  value_name=\"cases\",\n",
    "                  var_name=\"case type\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the difference between these representations by looking at the source data (`df`) for Wisconsin on April 9th and the melted data (`cleaned`) for Wisconsin on April 9th."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df[\"state\"] == \"WI\") & (df[\"date\"] == 20200409)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned[(cleaned[\"state\"] == \"WI\") & (cleaned[\"date\"] == \"2020-04-09\")].dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting per-state results\n",
    "\n",
    "The next cell shows a function that operates on our `cleaned` (long-form) data frame to produce a chart of results for a specific state.  We're using Altair's `transform_filter` function to postprocess the cleaned data to select \n",
    "\n",
    "1. only observations about a given state,\n",
    "2. only observations with a nonzero and non-`NaN` case count, and\n",
    "3. only observations of one of a set of case types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cases_for_state(state, show_points=False):\n",
    "    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']\n",
    "    chart = alt.Chart(cleaned).\\\n",
    "                encode(alt.X(\"date:N\"), \n",
    "                       alt.Y(\"cases\", scale=alt.Scale(type=\"log\")), \n",
    "                       alt.Color(\"case type\", \n",
    "                                 sort=alt.EncodingSortField(field=\"cases\", \n",
    "                                                            order=\"descending\", \n",
    "                                                            op=\"max\")),\n",
    "                       tooltip=['date', 'state', 'case type', 'cases']).\\\n",
    "                transform_filter(alt.datum.state == state).\\\n",
    "                transform_filter(alt.datum.cases > 0).\\\n",
    "                transform_filter(alt.FieldOneOfPredicate(\"case type\", case_types))\n",
    "    \n",
    "    return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cases_for_state(\"WI\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering in Pandas\n",
    "\n",
    "Of course, we could generate a data frame that solely has the rows we care about for a given state, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']\n",
    "\n",
    "wi_cases = cleaned[(cleaned[\"state\"] == \"WI\") &\n",
    "                   (cleaned[\"case type\"].isin(case_types)) &\n",
    "                   (pd.to_numeric(cleaned[\"cases\"], errors=\"coerce\") > 0)]\n",
    "\n",
    "alt.Chart(\n",
    "    wi_cases\n",
    "  ).mark_line(\n",
    "  ).encode(\n",
    "    alt.X(\"date:N\"), \n",
    "    alt.Y(\"cases\", scale=alt.Scale(type=\"log\")), \n",
    "    alt.Color(\"case type\", \n",
    "              sort=alt.EncodingSortField(field=\"cases\", \n",
    "                                         order=\"descending\", \n",
    "                                         op=\"max\"))\n",
    "  )\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filtering our data in Altair can be more convenient, though, and enable interactive charting, as we'll see shortly.\n",
    "\n",
    "### Fold transformations in Altair\n",
    "\n",
    "We used the `melt` function in Pandas to go from a wide-form table (`df`), in which each observation is a column, to a long-form table (`cleaned`), in which each observation is a row.\n",
    "\n",
    "We can also do this transformation in Altair, with the `transform_fold` function, as in the next cell.  The `fold` parameter takes a list of columns to break out into new observation types, and the `as_` parameter takes a two-element list consisting of what to call the observation type column (whose values are the names of the columns from `fold`) and what to call the observation value column (whose values are the values of the columns from `fold`).\n",
    "\n",
    "As a bonus, we'll also convert from \"date integers\" of the form `20200410` to actual date-time objects in Altair (instead of using `DataFrame.apply`).  We'll construct these by dividing the date value by 10,000 to get the year, dividing the remainder of the date value divided by 10,000 by 100 to get the month, and taking the remainder of the value divided by 100 to get the day.  (Since Vega dates use zero-indexed months, we'll also have to subtract one from the month.  Phew!)  \n",
    "\n",
    "This will turn into a Vega expression that we can pass into Altair's `transform_calculate` method, and that looks like this:\n",
    "\n",
    "```\n",
    "alt.expr.datetime(\n",
    "   alt.expr.floor(alt.datum.date / 10000), # year\n",
    "   alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month\n",
    "   alt.datum.date % 100 # day\n",
    ")\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cases_for_state_folded(state, show_points=False):\n",
    "    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']\n",
    "    cleaned_date = alt.expr.datetime(alt.expr.floor(alt.datum.date / 10000), # year\n",
    "                                     alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month\n",
    "                                     alt.datum.date % 100) # day\n",
    "    chart = alt.Chart(df).\\\n",
    "                encode(alt.X(\"monthdate(cleandate):O\", title=\"date\"), \n",
    "                       alt.Y(\"cases:Q\", scale=alt.Scale(type=\"log\")), \n",
    "                       alt.Color(\"case type:N\", \n",
    "                                 sort=alt.EncodingSortField(field=\"cases\", \n",
    "                                                            order=\"descending\", \n",
    "                                                            op=\"max\")),\n",
    "                       tooltip=['yearmonthdate(cleandate)', 'state', 'case type:N', 'cases:Q']).\\\n",
    "                transform_filter(alt.datum.state == state).\\\n",
    "                transform_calculate(\n",
    "                    cleandate=cleaned_date\n",
    "                ).\\\n",
    "                transform_fold(\n",
    "                    as_=[\"case type\", \"cases\"],\n",
    "                    fold=case_types\n",
    "                ).\\\n",
    "                transform_filter(alt.datum.cases > 0)\n",
    "                \n",
    "    return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()\n",
    "\n",
    "\n",
    "cases_for_state_folded(\"WI\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An interactive per-state plot\n",
    "\n",
    "We can also use Altair's [selection support](https://altair-viz.github.io/user_guide/interactions.html) to make an interactive chart that lets us choose which state to plot cases for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def interactive_cases_for_state():\n",
    "    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']\n",
    "    input_dropdown = alt.binding_select(options=cleaned[(pd.to_numeric(cleaned[\"cases\"], errors=\"coerce\") > 0) & (cleaned[\"case type\"] == \"positive\")][\"state\"].sort_values().unique())\n",
    "    selection = alt.selection_single(fields=['state'], bind=input_dropdown, name='Choose', init={\"state\":\"AK\"})\n",
    "\n",
    "    chart = alt.Chart(cleaned).\\\n",
    "                encode(alt.X(\"date:N\"), \n",
    "                       alt.Y(\"cases\", scale=alt.Scale(type=\"log\")), \n",
    "                       alt.Color(\"case type\", \n",
    "                                 sort=alt.EncodingSortField(field=\"cases\", \n",
    "                                                            order=\"descending\", \n",
    "                                                            op=\"max\")),\n",
    "                       tooltip=['date', 'state', 'case type', 'cases']).\\\n",
    "                transform_filter(selection).\\\n",
    "                transform_filter(alt.datum.cases > 0).\\\n",
    "                transform_filter(alt.FieldOneOfPredicate(\"case type\", case_types)).\\\n",
    "                add_selection(selection)\n",
    "    \n",
    "    return chart.mark_line()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactive_cases_for_state()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting cases by state on a map\n",
    "\n",
    "To plot case counts on a map, we'll need to integrate geographic data (the shapes of states as GeoJSON polygons) with our observations.\n",
    "\n",
    "We'll pull down state shapes from a public datasource that has both state and county data, using Altair's `topo_feature` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "states = alt.topo_feature(\"https://vega.github.io/vega-datasets/data/us-10m.json\", \"states\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To plot total case counts per state, we'll make a chloropleth in Altair and will need to join the case counts with the state shapes.  The state shapes are keyed by [FIPS numeric state codes](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code), not by alphabetical state codes.  We have the FIPS codes in the source data as `fips`, so we'll use Altair's `transform_lookup` function to indicate that we want to take the case count, case type, state, and date from a postprocessed data frame where the `fips` field matches the `id` field in our state collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "ctrim = cleaned[(cleaned[\"case type\"] == \"positive\") & (cleaned[\"date\"] == cleaned[\"date\"].max())].copy()\n",
    "\n",
    "alt.Chart(\n",
    "    states\n",
    "    ).mark_geoshape(\n",
    "    ).encode(\n",
    "        color='cases:Q',\n",
    "        tooltip=['state:N', 'cases:Q', 'date:N']\n",
    "    ).transform_lookup(\n",
    "        lookup='id',\n",
    "        from_=alt.LookupData(ctrim, 'fips', ['cases', 'case type', 'state', 'date'])\n",
    "    ).project(\n",
    "        type='albersUsa'\n",
    "    ).properties(\n",
    "        width=500, height=400\n",
    "    )\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}