{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Data Quality Bootcamp | Data Cleaning Demo\n", "Ryan Othniel Kearns | \n", "\n", "16 March 2023\n", "\n", "___\n", "\n", "Welcome to the Data Quality Bootcamp! In this demo, we demonstrate some best practices for data cleaning, and illustrate why they're important for data quality. We'll use a sample dataset containing information about exoplanets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import plotly.express as px\n", "import plotly.io as pio\n", "\n", "pio.renderers.default = \"plotly_mimetype+notebook\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"..\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's connect to our database.\n", "\n", "We're making use of `sqlite3` and `pandas` in python to be able to fluidly display SQL query results in a notebook format, with easy visualizations. If you're not familiar with python, don't worry about the scaffolding that's going on -- the SQL is the \"meat\" of this presentation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "\n", "conn = sqlite3.connect(\"../data/dbs/strata.db\")\n", "cur = conn.cursor()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def setup_local_db():\n", " import random\n", " import uuid\n", "\n", " for i in range(20): cur.execute(f\"\"\"--sql\n", " INSERT INTO exoplanets\n", " VALUES (\n", " '{str(uuid.uuid4())}',\n", " {random.random() * 200},\n", " {random.random() * 15},\n", " {random.random() * 500},\n", " {random.random() * 150},\n", " '2020-03-05',\n", " {random.random()},\n", " 'co2'\n", " )\n", " ;\n", " \"\"\").fetchall()\n", " \n", " for i in range(57): cur.execute(f\"\"\"--sql\n", " INSERT INTO exoplanets\n", " VALUES (\n", " '{str(uuid.uuid4())}',\n", " {random.random() * 200},\n", " {random.random() * 15},\n", " {random.random() * 500},\n", " {random.random() * 150},\n", " '2020-03-05',\n", " {random.random()},\n", " 'o2'\n", " )\n", " ;\n", " \"\"\").fetchall()\n", " \n", " for i in range(18): cur.execute(f\"\"\"--sql\n", " INSERT INTO exoplanets\n", " VALUES (\n", " '{str(uuid.uuid4())}',\n", " {random.random() * 200},\n", " {random.random() * 15},\n", " {random.random() * 500},\n", " {random.random() * 150},\n", " '2020-03-05',\n", " {random.random()},\n", " 'n2'\n", " )\n", " ;\n", " \"\"\").fetchall()\n", " \n", " for i in range(23): cur.execute(f\"\"\"--sql\n", " INSERT INTO exoplanets\n", " VALUES (\n", " '{str(uuid.uuid4())}',\n", " {random.random() * 200},\n", " {random.random() * 15},\n", " {random.random() * 500},\n", " {random.random() * 150},\n", " '2020-03-05',\n", " {random.random()},\n", " 'h2so4'\n", " )\n", " ;\n", " \"\"\").fetchall()\n", " \n", "setup_local_db()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at our main dataset, called `exoplanets`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT * FROM exoplanets LIMIT 5;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "What's the schema of this table?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "-- note that this syntax is specific to SQLite... in Snowflake for example, we would use `DESCRIBE TABLE EXOPLANETS;`\n", "PRAGMA TABLE_INFO(exoplanets);\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, so some questions immediately pop up about this dataset. Let's go through them formulaically to ensure clean data!\n", "\n", "## 1. Handling Nulls\n", "\n", "First: The `notnull` column in our table info says that none of the columns are non-null (that is, enforced to be non-null by the database). That seems bad, especially for a column name like `_id`, which presumably identifies the planet each row represents in our system.\n", "\n", "Are there actually null values present for every one of these columns?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT COUNT(*) AS count_null_id\n", "FROM exoplanets\n", "WHERE _id IS NULL;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Phew! No null `_id`s!\n", "\n", "How about other columns?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT COUNT(*) AS count_null_distance\n", "FROM exoplanets\n", "WHERE distance IS NULL;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "So we have some nulls for the `distance` field... how many as a share of the total?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT\n", " CAST(SUM(\n", " CASE\n", " WHEN distance IS NULL THEN 1\n", " ELSE 0\n", " END\n", " ) AS FLOAT) / COUNT(*) AS distance_null_rate\n", "FROM exoplanets;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "5% is a workable number of nulls for me. Let's get a sense of how the distances are distributed so we can decide on what to do about this pesky share of null data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT distance\n", "FROM exoplanets\n", "WHERE distance IS NOT NULL;\n", "\n", "\"\"\"\n", "\n", "px.histogram(pd.read_sql_query(sql, conn))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Looks like `distance` can't be a negative value (that's good...) and is otherwise distributed log-normally. (For non-statisticians, this is what we'd expect from a natural phenomena that can take on only positive values.)\n", "\n", "Anyway! There are three things we can do with `null` values when data cleaning:\n", "1. **Drop** the rows with null values.\n", "2. **Interpolate** the values using the distribution of non-null data.\n", "3. Overwrite some other **default** value to the null values.\n", "\n", "Approach #1 would reduce the size of our dataset by 5% without affecting the distribution of distances.\n", "\n", "Alternatively (#2), we could replace any null `distance` value with the median distance. That seems pretty sensible, right? This means we can use 100% of our data, though our log-normal distribution is going to be biased now by the interpolation. Take a look:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT\n", " CASE\n", " WHEN distance IS NULL THEN average_distance\n", " ELSE distance\n", " END AS distance_interpolated\n", "FROM exoplanets\n", "JOIN (SELECT AVG(distance) AS average_distance FROM exoplanets);\n", "\n", "\"\"\"\n", "\n", "px.histogram(pd.read_sql_query(sql, conn))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Option #3 is to replace null `distance`s with another default value. For example, 0 might be a sensible default for null values, depending on your use case. This approach will also introduce bias in your data, though:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT\n", " CASE\n", " WHEN distance IS NULL THEN 0\n", " ELSE distance\n", " END AS distance_interpolated\n", "FROM exoplanets;\n", "\n", "\"\"\"\n", "\n", "px.histogram(pd.read_sql_query(sql, conn))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Validating Categorical Values\n", "\n", "So above we did some interesting data cleaning on a _numerical_ column, `distance`. But should we be worried about cleaning other kinds of data? As it turns out, yes -- the alternative type of data in our table is _categorical_ data, for example, the `atmosphere` column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT atmosphere\n", "FROM exoplanets\n", "WHERE atmosphere IS NOT NULL\n", "LIMIT 5;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Looks like this data takes text that represents a chemical compound -- probably the most common chemical in that planet's atmosphere.\n", "\n", "Notice that `atmosphere` is null a high amount of the time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT\n", " CAST(SUM(\n", " CASE\n", " WHEN atmosphere IS NULL THEN 1\n", " ELSE 0\n", " END\n", " ) AS FLOAT) / COUNT(*) AS atmosphere_null_rate\n", "FROM exoplanets;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Almost all of the time! What gives?\n", "\n", "Let's look at the distinct categorical values to figure out why:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "SELECT atmosphere, COUNT(*)\n", "FROM exoplanets\n", "GROUP BY 1;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "So, first observation: there's no option for \"No atmosphere\" in these values. It's likely that `null` in this case records that the exoplanet does not have an atmosphere. If we know that to be the case (we shouldn't just assume it!), then we should replace these `null`s with a default value, \"no_atmosphere\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "WITH atmosphere_default_none AS(\n", " SELECT\n", " CASE\n", " WHEN atmosphere IS NULL THEN 'no_atmosphere'\n", " ELSE atmosphere\n", " END AS atmosphere\n", " FROM exoplanets\n", ")\n", "\n", "SELECT atmosphere, COUNT(*)\n", "FROM atmosphere_default_none\n", "GROUP BY 1;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Second, we have a problem with our categorical values! Some of them are lowercased and some are not. If we leave the data like this, we'll be ignoring something crucial about the values -- namely that `O2` and `o2` are the same chemical compound.\n", "\n", "Let's fix it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = \"\"\"--sql\n", "\n", "WITH atmosphere_default_none AS(\n", " SELECT\n", " CASE\n", " WHEN atmosphere IS NULL THEN 'NO_ATMOSPHERE'\n", " ELSE UPPER(atmosphere)\n", " END AS atmosphere\n", " FROM exoplanets\n", ")\n", "\n", "SELECT atmosphere, COUNT(*)\n", "FROM atmosphere_default_none\n", "GROUP BY 1;\n", "\n", "\"\"\"\n", "\n", "pd.read_sql_query(sql, conn)" ] } ], "metadata": { "kernelspec": { "display_name": "env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "82e5b94668ff4d8d85450e0fe9248e3086d7fda13f258d0873b6f11117a6aadc" } } }, "nbformat": 4, "nbformat_minor": 2 }