{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Downtime Challenge | Exercise 2\n", "\n", "## 0. Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import plotly.express as px\n", "import re\n", "from datetime import datetime, date, timedelta" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"..\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from data.utils import get_days_index\n", "\n", "all_days = get_days_index(250)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "\n", "conn = sqlite3.connect(\"../data/dbs/Ex2.db\")\n", "c = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction\n", "\n", "In the last exercise, we examined some isolated data downtime incidents in the `EXOPLANETS` table. While the techniques from that exercise are helpful, in practice data downtime involves data infrastructure with more than one table. So, in this exercise, we'll look at a scenario where multiple tables interact.\n", "\n", "### `EXOPLANETS`\n", "Let's once again consider our `EXOPLANETS` table, but at a later date than before. Now, our table has additional entries, and additional _fields_. We now record the planets' orbital eccentricity and the contents of their atmosphere." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.execute(\"PRAGMA table_info(EXOPLANETS);\")\n", "c.fetchall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's take a look at our 10 most recent additions to the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\"SELECT * FROM EXOPLANETS ORDER BY DATE_ADDED DESC LIMIT 10\", conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A database entry in `EXOPLANETS` contains the following info:\n", "\n", "0. `_id`: A UUID corresponding to the planet.\n", "1. `distance`: Distance from Earth, in lightyears.\n", "2. `g`: Surface gravity as a multiple of $g$, the gravitational force constant.\n", "3. `orbital_period`: Length of a single orbital cycle in days.\n", "4. `avg_temp`: Average surface temperature in degrees Kelvin.\n", "5. `date_added`: The date our system discovered the planet and added it automatically to our databases.\n", "6. `eccentricity`: The [orbital eccentricity](https://en.wikipedia.org/wiki/Orbital_eccentricity) of the planet about its host star.\n", "7. `atmosphere`: The dominant chemical makeup of the planet's atmosphere.\n", "\n", "Note that like `distance`, `g`, `orbital_period`, and `avg_temp`, both `eccentricity` and `atmosphere` may be `NULL` for a given planet as a result of missing or erroneous data. For example, [rogue planets](https://en.wikipedia.org/wiki/Rogue_planet) have undefined orbital eccentricity, and many planets don't have atmospheres at all.\n", "\n", "Note also that data is not backfilled, meaning data entries from the beginning of the table will not have `eccentricity` and `atmosphere` information. Check out all of these `None` values at the beginning of the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\n", " \"\"\"SELECT\n", " date_added,\n", " eccentricity,\n", " atmosphere\n", " FROM EXOPLANETS\n", " ORDER BY DATE_ADDED ASC\n", " LIMIT 10\"\"\",\n", " conn\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `EXOPLANETS_SCHEMA`\n", "\n", "Thankfully, we have been recording historical `table_info` on the `EXOPLANETS` table and collecting the results in a table called `EXOPLANETS_SCHEMA`, updated daily." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.execute(\"PRAGMA table_info(EXOPLANETS_SCHEMA);\")\n", "c.fetchall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Querying the very beginning and end of `EXOPLANETS_SCHEMA`'s data reflects that `EXOPLANETS`'s metadata has changed since January 2020:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exoplanets_schema_df = pd.read_sql_query(\"SELECT * FROM EXOPLANETS_SCHEMA\", conn)\n", "print(\"Was: \" + exoplanets_schema_df.iloc[0][\"schema\"])\n", "print(\"Is now: \" + exoplanets_schema_df.iloc[-1][\"schema\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Exercise: Understanding **Schema Change**\n", "\n", "When exactly did `EXOPLANETS` start recording new data? The metadata in `EXOPLANETS_SCHEMA` should tell us. See if you can write a SQL query that returns the date(s) the schema changed.\n", "\n", "To start, here's a query that returns each date and schema in the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SQL = \"\"\"\n", "SELECT\n", " DATE,\n", " SCHEMA\n", "FROM\n", " EXOPLANETS_SCHEMA\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this query, and the trick with `LAG()` we utilized in the last notebook, try to get a table comparing a date's schema to the schema from the day prior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see something like this:\n", "![SegmentLocal](../data/assets/ex2img1.png \"segment\")\n", "\n", "Next, let's simply add a `WITH` statement like we did before, and return all of the rows where `SCHEMA` and `PAST_SCHEMA` are different!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A correct implementation should show a single date, **2020-07-19**. If you got that, nice work! This date will come in handy later, so keep it in mind." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. An additional table\n", "\n", "Now, we want to involve another table in our DB. `HABITABLES` records information about the habitability of exoplanets we've discovered. This table takes data from `EXOPLANETS` and other upstream tables and transforms it to produce a `habitability` index: a real number between 0 and 1 indicating how likely the planet is to harbor life." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.execute(\"PRAGMA TABLE_INFO(HABITABLES);\")\n", "c.fetchall()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\"SELECT * FROM HABITABLES LIMIT 10\", conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Exercise: Visualizing Distribution Errors\n", "Like in exercise 1, I'll write a quick query assessing a **distributional** feature of the `HABITABILITY` table -- how habitable is the average planet we detect, as a function of the day it was detected?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SQL = \"\"\"\n", "SELECT\n", " DATE_ADDED,\n", " AVG(HABITABILITY) AS AVG_HABITABILITY\n", "FROM\n", " HABITABLES\n", "GROUP BY\n", " DATE_ADDED\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_habitability = pd.read_sql_query(SQL, conn)\n", "avg_habitability = avg_habitability \\\n", " .rename(columns={clmn: clmn.lower() for clmn in avg_habitability.columns})\n", "avg_habitability = avg_habitability.set_index(\"date_added\")\n", "avg_habitability = avg_habitability.reindex(all_days)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(x=all_days, y=avg_habitability[\"avg_habitability\"])\n", "fig.update_xaxes(title=\"Date\")\n", "fig.update_yaxes(title=\"AVG(habitability)\")\n", "fig.add_vline(x=\"2020-07-19\", line_color='red')\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I plotted the date of the schema change, 2020-07-19, in red as a visual aid. Clearly, unless our instruments are malfunctioning, something is wrong! The planets we're adding to the table *after* the schema change seem much less habitable on average. Using a `SQL` query below, see if you can figure out what exactly happened.\n", "\n", "*Hint*: When averages change, it's natural to look for occurrences of unusual values. When is `habitability` NULL, 0, or outside of the range $[0, 1]$? What about other fields in the table that might be related?\n", "\n", "Let's try looking at the zero rate to see if anything is unusual.\n", "\n", "As a hint, it's easy to find special conditions, like when a metric is equal to `0`, using a SQL [`CASE` statement](https://www.w3schools.com/sql/sql_case.asp). You'll probably want to aggregate all of these `CASE`s, using `SUM()`, and be sure to `CAST(... AS FLOAT)` if you're dividing anything. Try it out!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zero_rate = pd.read_sql_query(SQL, conn)\n", "zero_rate = zero_rate \\\n", " .rename(columns={clmn: clmn.lower() for clmn in zero_rate.columns})\n", "zero_rate = zero_rate.set_index(\"date_added\")\n", "zero_rate = zero_rate.reindex(all_days)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(x=all_days, y=zero_rate[\"zero_rate\"])\n", "fig.add_vline(x=\"2020-07-19\", line_color='red')\n", "fig.update_xaxes(title=\"Date\")\n", "fig.update_yaxes(title=\"Habitability Zero Rate\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a small amount of digging, we can uncover something important -- the `habitability` index is never exactly 0 *before* the schema change, but afterwards we see the rate of 0s jump up to ~50%. This has the detected effect of dipping the average value of the field.\n", "\n", "Note that in practice, you should look for both -- measuring a field's **rate of zero values**, as well as its **average value**, can both help with identifying data downtime issues. As we saw last exercise, the **rate of null values** can also be helpful.\n", "\n", "# Great work!\n", "\n", "In the next exercise, we'll look at queries that span multiple tables, another step towards building intelligent data downtime systems." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }