{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Downtime Challenge | Exercise 1\n", "\n", "## 0. Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import plotly.express as px\n", "import re\n", "from datetime import datetime, date, timedelta" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"..\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from data.utils import get_days_index\n", "\n", "all_days = get_days_index(200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "\n", "conn = sqlite3.connect(\"../data/dbs/Ex1.db\")\n", "c = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction\n", "\n", "Welcome to the Data Downtime Challenge! In this exercise, we'll learn Data Observability through practice on some sample datasets. Each subproblem will ask you to craft some `SQL` queries that help us learn about the state of our tables and identify Data Downtime issues.\n", "\n", "For these exercises, we'll be using mock astronomical data to identify habitable planets.\n", "\n", "![SegmentLocal](../data/assets/planets.gif \"segment\")\n", "\n", "The `Ex1.db` database contains a single table, `EXOPLANETS`, which has information on nearly 2000 exoplanets across the Milky Way Galaxy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.execute(\"PRAGMA table_info(EXOPLANETS);\")\n", "c.fetchall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A database entry in `EXOPLANETS` contains the following info:\n", "\n", "0. `_id`: A UUID corresponding to the planet.\n", "1. `distance`: Distance from Earth, in lightyears.\n", "2. `g`: Surface gravity as a multiple of $g$, the gravitational force constant.\n", "3. `orbital_period`: Length of a single orbital cycle in days.\n", "4. `avg_temp`: Average surface temperature in degrees Kelvin.\n", "5. `date_added`: The date our system discovered the planet and added it automatically to our databases.\n", "\n", "Note that one or more of `distance`, `g`, `orbital_period`, and `avg_temp` may be `NULL` for a given planet as a result of missing or erroneous data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\"SELECT * FROM EXOPLANETS LIMIT 10\", conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Exercise: Visualizing Freshness" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping by the `DATE_ADDED` column can give us insight into how `EXOPLANETS` updates daily. For example, we can query for the number of new IDs added per day:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SQL = \"\"\"\n", "SELECT\n", " DATE_ADDED,\n", " COUNT(*) AS ROWS_ADDED\n", "FROM\n", " EXOPLANETS\n", "GROUP BY\n", " DATE_ADDED\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows_added = pd.read_sql_query(SQL, conn)\n", "rows_added = rows_added.rename(columns={clmn: clmn.lower() for clmn in rows_added.columns})\n", "rows_added = rows_added.set_index(\"date_added\")\n", "rows_added = rows_added.reindex(all_days)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like `EXOPLANETS` typically updates with around 100 new entries each day. Something looks off in a few places, though. We have what we'd call a **freshness** incident -- on a couple of occasions, the table doesn't update at all for 3 or more days. It has \"stale\" (3+ day old) data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(x=all_days, y=rows_added[\"rows_added\"])\n", "fig.update_xaxes(title=\"Date\")\n", "fig.update_yaxes(title=\"Rows Added\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we'll try writing some `SQL` code that returns timestamps for when freshness incidents occur. Feel free to use the query above as a starting point.\n", "\n", "An example solution is given in `solutions/exercise_1.ipynb`, if needed for comparison.\n", "\n", "To start, let's just copy the `SQL` statement from above, which gives the count of entries added per day." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that your output looks like this:\n", "![SegmentLocal](../data/assets/ex1img1.png \"segment\")\n", "\n", "Great! Since we've grouped by the `DATE_ADDED` field, we now have one row entry for each day where data came in. As a next step, let's devise a way to compare adjacent dates in our grouped output. For example, in row 1 above, we'd like to know that the previous date (on row 0) was `2021-01-01`.\n", "\n", "A great way to compare adjacent rows in SQL is to use the [`LAG` window function](https://www.sqltutorial.org/sql-window-functions/sql-lag/). Also, you can try including our data from above using [SQL's `WITH` prefix](https://modern-sql.com/feature/with)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that your output looks like this:\n", "![SegmentLocal](../data/assets/ex1img2.png \"segment\")\n", "\n", "Awesome! The ability to compare adjacent dates is crucial for detecting stale data. Our next step is this: given two adjacent dates, calculate the *difference in days* between those dates. We're answering the question, \"How many days old is the previous batch?\"\n", "\n", "Since we're in SQLite, we can cast our strings into dates with `JULIANDAY()`, and [easily find the difference between them](https://www.w3resource.com/sqlite/sqlite-julianday.php)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "pd.read_sql_query(SQL, conn).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See if you can get something looking like this:\n", "![SegmentLocal](../data/assets/ex1img3.png \"segment\")\n", "\n", "Well done! We're basically all the way there. Recall that our original task was to identify **freshness incidents** -- that is, dates where the previous data entry is more than 1 day old. After adding another `WITH` statement and a `WHERE` clause, our query should be able to do just that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "SQL = \"\"\"\n", "\n", "\"\"\"\n", "# END YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "detections = pd.read_sql_query(SQL, conn)\n", "detections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your result looks like this:\n", "![SegmentLocal](../data/assets/ex1img4.png \"segment\")\n", "\n", "then congratulations! You've built a detector for **freshness incidents**, a key part of any data observability solution. With the following code, you can visualize your detections along with the update data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for _, row in detections.iterrows():\n", " fig.add_vline(x=row['DATE_ADDED'], line_color='red')\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next exercise, we'll build off of these simpler reports to handle scenarios with multiple tables and lineage information." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }