{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import YouTubeVideo\n", "\n", "YouTubeVideo(id=\"SdbKs-crm-g\", width=\"100%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By this point in the book, you should have observed\n", "that we have written a number of _tests_ for our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why test?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If you like it, put a ring on it...\n", "\n", "...and if you rely on it, test it.\n", "\n", "I am personally a proponent of writing tests for our data\n", "because as data scientists,\n", "the fields of our data, and their correct values,\n", "form the \"data programming interface\" (DPI)\n", "much like function signatures form\n", "the \"application programming interface\" (API).\n", "Since we test the APIs that we rely on,\n", "we probably should test the DPIs that we rely on too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What to test\n", "\n", "When thinking about what part of the data to test,\n", "it can be confusing.\n", "After all, data are seemingly generated\n", "from random processes\n", "(my Bayesian foxtail has been revealed),\n", "and it seems difficult to test random processes.\n", "\n", "That said, from my experience handling data,\n", "I can suggest a few principles." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test invariants\n", "\n", "Firstly, we test __invariant properties__ of the data.\n", "Put in plain language, things we know _ought_ to be true.\n", "\n", "Using the Divvy bike dataset example,\n", "we know that every node ought to have a station name.\n", "Thus, the minimum that we can test\n", "is that the `station_name` attribute is present on every node.\n", "As an example:\n", "\n", "```python\n", "def test_divvy_nodes(G):\n", " \"\"\"Test node metadata on Divvy dataset.\"\"\"\n", " for n, d in G.nodes(data=True):\n", " assert \"station_name\" in d.keys()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test nullity\n", "\n", "Secondly, we can test that values that ought **not** to be null\n", "should not be null.\n", "\n", "Using the Divvy bike dataset example again,\n", "if we _also_ know that the station name\n", "cannot be null or an empty string,\n", "then we can bake that into the test.\n", "\n", "```python\n", "def test_divvy_nodes(G):\n", " \"\"\"Test node metadata on Divvy dataset.\"\"\"\n", " for n, d in G.nodes(data=True):\n", " assert \"station_name\" in d.keys()\n", " assert bool(d[\"station_name\"])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test boundaries\n", "\n", "We can also test boundary values.\n", "For example, within the city of Chicago,\n", "we know that latitude and longitude values\n", "ought to be within the vicinity of\n", "`41.85003, -87.65005`.\n", "If we get data values that are, say,\n", "outside the range of `[41, 42]; [-88, -87]`,\n", "then we know that we have data issues as well.\n", "\n", "Here's an example:\n", "\n", "```python\n", "def test_divvy_nodes(G):\n", " \"\"\"Test node metadata on Divvy dataset.\"\"\"\n", " for n, d in G.nodes(data=True):\n", " # Test for station names.\n", " assert \"station_name\" in d.keys()\n", " assert bool(d[\"station_name\"])\n", "\n", " # Test for longitude/latitude\n", " assert d[\"latitude\"] >= 41 and d[\"latitude\"] <= 42\n", " assert d[\"longitude\"] >= -88 and d[\"longitude\"] <= -87\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An apology to geospatial experts: \n", "_I genuinely don't know the bounding box lat/lon coordinates of Chicago,\n", "so if you know those coordinates, please reach out\n", "so I can update the test._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continuous data testing\n", "\n", "The key idea with testing is to have tests that continuously run\n", "all the time in the background\n", "without you ever needing to intervene to kickstart it off.\n", "It's like having a bot in the background always running checks for you\n", "so you don't have to kickstart them.\n", "\n", "To do so, you should be equipped with a few tools.\n", "I won't go into them in-depth here,\n", "as I will be writing\n", "a \"continuous data testing\" essay in the near future.\n", "That said, here is the gist.\n", "\n", "Firstly, **use `pytest` to get set up with testing.**\n", "You essentially write a `test_something.py` file\n", "in which you write your test suite,\n", "and your test functions are all nothinng more than simple functions.\n", "\n", "```python\n", "# test_data.py\n", "def test_divvy_nodes(G):\n", " \"\"\"Test node metadata on Divvy dataset.\"\"\"\n", " for n, d in G.nodes(data=True):\n", " # Test for station names.\n", " assert \"station_name\" in d.keys()\n", " assert bool(d[\"station_name\"])\n", "\n", " # Test for longitude/latitude\n", " assert d[\"latitude\"] >= 41 and d[\"latitude\"] <= 42\n", " assert d[\"longitude\"] >= -88 and d[\"longitude\"] <= -87\n", "```\n", "\n", "At the command line, if you ran `pytest`,\n", "it will automatically discover all functions prefixed with `test_`\n", "in all `.py` files underneath the current working directory.\n", "\n", "Secondly, **set up a continuous pipelining system**\n", "to continuously run data tests.\n", "For example, you can set up\n", "[Jenkins](https://www.jenkins.io/),\n", "[Travis](https://travis-ci.org/),\n", "[Azure Pipelines](https://azure.microsoft.com/en-us/services/devops/pipelines/),\n", "[Prefect](https://www.prefect.io/),\n", "and more,\n", "depending on what your organization has bought into.\n", "\n", "Sometimes data tests take longer than software tests,\n", "especially if you are pulling dumps from a database,\n", "so you might want to run this portion of tests\n", "in a separate pipeline instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading\n", "\n", "- In my essays collection, I wrote about [testing data](https://ericmjl.github.io/essays-on-data-science/software-skills/testing/#tests-for-data).\n", "- Itamar Turner-Trauring has written about [keeping tests quick and speedy](https://pythonspeed.com/articles/slow-tests-fast-feedback/), which is extremely crucial to keeping yourself motivated to write tests." ] } ], "metadata": { "kernelspec": { "display_name": "nams", "language": "python", "name": "nams" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }