{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \"Statistical Typing: A Runtime Type System for Data Science and Machine Learning\"\n", "> \"Data science and machine learning relies on high quality datasets for visualization, statistical inference, and modeling. Statistical typing is a runtime typing system that enables data scientists, engineers, and analysts to validate real-world data and isolate units of processing, analysis, or model-training logic to implement more robust data testing.\"\n", "\n", "- toc: false\n", "- branch: master\n", "- badges: true\n", "- comments: true\n", "- categories: [statistical typing, unit testing]\n", "- image: images/logo.png\n", "- hide: false\n", "- search_exclude: true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Data Quality Problem\n", "\n", "One of the central challenges in data science (DS) and machine learning (ML) is\n", "managing and maintaining data quality. As an ML engineer and practitioner who\n", "frequently constructs, cleans, explores, and models proprietary (i.e. non-benchmark)\n", "datasets, \"bad data\" makes all the difference between accurate versus misleading\n", "data visualizations, statistical inferences, and models. In this article I want\n", "to hone in on three problems that are fairly unique to DS/ML practice when it comes to\n", "dealing with tabular data in the Python ecosystem using [pandas](https://pandas.pydata.org/https://pandas.pydata.org/),\n", "which is one of the de facto tools for data manipulation in the DS/ML toolchain." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Integrity Errors Fail Silently" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tooling for type safety is improving in the Python ecosystem with the broad adoption\n", "of the [typing](https://docs.python.org/3/library/typing.html) module and\n", "projects like [mypy](http://mypy-lang.org/), which eases the developer\n", "experience for writing readable, reliable code.\n", "\n", "However, for most DS/ML work, this isn't quite sufficient. This is because\n", "logical data types don't always capture the statistical distributions of the\n", "variables under study, which is a key thing to do when, for example, your data\n", "distribution shifts as a result of a world-wide pandemic,\n", "[causing ML models to break](https://www.technologyreview.com/2020/05/11/1001563/covid-pandemic-broken-ai-machine-learning-amazon-retail-fraud-humans-in-the-loop/)\n", "in unexpected ways.\n", "\n", "Having systems in place that fail early (and loudly ๐Ÿ”Š) when the data distribution\n", "is not what you assumed is one of the critical pieces to building reliable\n", "production systems, and we need better tooling for this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing Software is Hard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another important tool in the developer's arsenal is testing. There are many\n", "[books](https://www.amazon.com/dp/1617296279), [articles](https://kentcdodds.com/blog/write-tests), and\n", "[discussions](https://stackoverflow.com/questions/5357601/whats-the-difference-between-unit-tests-and-integration-tests)\n", "available online about different types of testing techniques and how to put them\n", "into practice. I won't dive deeply into it here, but in short, testing your code\n", "makes it easier to change it and know when you've broken something while at the same time serving as\n", "documentation.\n", "\n", "Even in exploratory or research contexts, it's a good idea to write tests for your\n", "code because it strengthens your confidence in the robustness of the insights that\n", "you're taking away from your analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing Software and Statistical Distributions is Even Harder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The challenge with testing software compounds when processing data for the\n", "purpose of statistical analysis and modeling.\n", "\n", "Consider a machine learning pipeline that creates a predictive model from\n", "survey responses. The barriers to testing data and transformation code\n", "tends to be much higher than the business logic that processes survey responses\n", "and stores raw values in a database because the latter tends to be simpler\n", "and more atomic by design.\n", "\n", "By _atomic_ โš› I mean that each piece of data that's filled out by respondents\n", "and stored in the database can be tested in isolation without having to\n", "analyze the aggregate statistical patterns across a larger sample of responses.\n", "On the other hand, for my statistical analysis to make sense, the overall\n", "integrity of the statistical distribution ๐Ÿ“Š of the responses needs to be taken\n", "into account." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Technical Debt: `No Tests == Legacy Code`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the effort that goes into exploring, cleaning, and figuring out how to\n", "test my dataset is so high, I'm discouraged from writing tests for my pipeline\n", "code. As the famous software development quip goes:\n", "\n", "\n", "> \"legacy code is code without tests\"\n", "> - Michael C. Feathers, [Working Effectively with Legacy Code](https://www.amazon.com/Working-Effectively-Legacy-Robert-Martin/dp/0131177052)\n", "\n", "\n", "But you might think: \"but when I put my code in production, surely I\n", "โ€“ or one of my collaborators โ€“ will write tests then?\". The thing is,\n", "regardless of who writes those tests or when they're written, someone'll have to\n", "do it at some point, so the sooner and more quickly you can climb\n", "the technical debt mountain ๐Ÿ” the better!\n", "\n", "In the rest of this post I'll try to convince you that statistical typing gives\n", "us the tools we need to do just that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What's Statistical Typing?\n", "\n", "If you've used strong, statically-typed languages before, or the [mypy](http://mypy-lang.org/)\n", "static type checker with type-hinting in Python, you may have noticed that type definitions can\n", "often catch nasty type-related bugs ๐Ÿž that render certain kinds of unit tests unnecessary. Other\n", "tools, like [pydantic](https://pydantic-docs.helpmanual.io/), enforce types at runtime via a\n", "data parsing model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beyond Logical Data Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Statistical typing extends the concept of [logical data types](https://en.wikipedia.org/wiki/Data_type)\n", "to the class of [statistical data types](https://en.wikipedia.org/wiki/Statistical_data_type)\n", "and, ultimately, [probability distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions).\n", "Statistical data types builds on top of logical data types, and in fact there's considerable overlap between the two.\n", "\n", "For example, the _binary_ logical data type is also a statistical data type. The key\n", "difference is that statistical data types hold additional semantics that govern the kinds\n", "of statistical operations that we can perform on variables of a particular type and\n", "probability distributions that describe those variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical Distributions as Schemas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you can specify the set of acceptable values that a variable can take,\n", "from the data type, set/range of values, or even what distribution a particular\n", "variable is drawn from? This is the goal of statistical typing: to enumerate a\n", "practical set of constraints that specify what should be considered valid data for\n", "a particular dataset.\n", "\n", "For example, we might want a _categorical_ variable to be drawn _somewhat_ uniformly\n", "from a set of values `{A, B, C}`. We can express this as a hypothesis test that\n", "causes our pipeline to fail if any one of the values occurs significantly more\n", "frequently than the others, given a pre-defined level of statistical significance.\n", "\n", "Or we may want a _real-valued_ variable to be drawn roughly from a normal\n", "distribution with mean `ยต` and variance `ฯƒ`, which can also be specified with an\n", "alpha value that we deem acceptable for a particular analysis.\n", "\n", "In essence, a statistical typing implementation involves specifying three kinds\n", "of metadata for a given set of variables:\n", "\n", "1. logical data types, e.g. `int`, `str`, `float`, etc.\n", "2. deterministic properties, e.g. `categorical` values and `real-valued` ranges\n", "3. probabilistic properties, e.g. sufficient statistics like `mean` and `standard deviation`\n", "\n", "The challenge presented by item _3_ is obvious: discovering the underlying\n", "probability distributions of real-world data is often non-trivial. However,\n", "even if we can only express or automatically infer these metadata up to point\n", "_2_, we can still get something quite powerful ๐Ÿ’ช: **property-based testing of statistical\n", "analysis code**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Schemas as Generative Data Contracts ๐Ÿ“œ" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a statistically-typed dataframe, not only can we validate real-world data to\n", "ensure that our assumptions about them hold up, but we can also test our data\n", "transformation, analysis, and modeling code _given valid samples_ according to\n", "our schema definition. Statistical typing effectively gives DS/ML practitioners\n", "the tools to easily isolate their code from real-world data, providing a convenient\n", "way of implementing unit tests." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical Typing in Practice with `pandera`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let me illustrate how these concepts would work in practice with a toy problem using\n", "[pandera](https://github.com/pandera-dev/pandera), a runtime data validation library\n", "for [pandas dataframes](https://pandas.pydata.org/) that I've been developing over\n", "the last few years.\n", "\n", "Suppose you're building a predictive model of house prices given features about\n", "different houses:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "raw_data = \"\"\"\n", "square_footage,n_bedrooms,property_type,price\n", "750,1,condo,200000\n", "900,2,condo,400000\n", "1200,2,house,500000\n", "1100,3,house,450000\n", "1000,2,condo,300000\n", "1000,2,townhouse,300000\n", "1200,2,townhouse,350000\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the raw data above you can see that we have the following columns:\n", "- feature 1: `square_footage`\n", "- feature 2: `n_bedrooms`\n", "- feature 3: `property_type`\n", "- target: `price`\n", "\n", "Our modeling pipeline will involve two steps:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def process_data(raw_data): # step 1: prepare data for model training\n", " ...\n", " \n", "def train_model(processed_data): # step 2: fit a model on processed data\n", " ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining Schemas with `pandera`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At its core, `pandera` provides a flexible and expressive API for defining\n", "dataframe schemas and seamlessly integrating data validation logic into your\n", "data analysis pipelines, all while separating the concerns of data cleaning\n", "and validation." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandera as pa\n", "from pandera.typing import Series\n", "\n", "\n", "PROPERTY_TYPES = [\"condo\", \"townhouse\", \"house\"]\n", "\n", "\n", "class BaseSchema(pa.SchemaModel):\n", " square_footage: Series[int] = pa.Field(in_range={\"min_value\": 0, \"max_value\": 3000})\n", " n_bedrooms: Series[int] = pa.Field(in_range={\"min_value\": 0, \"max_value\": 10})\n", " price: Series[float] = pa.Field(in_range={\"min_value\": 0, \"max_value\": 1000000})\n", "\n", " class Config:\n", " coerce = True\n", "\n", "\n", "class RawData(BaseSchema):\n", " property_type: Series[str] = pa.Field(isin=PROPERTY_TYPES)\n", "\n", "\n", "class ProcessedData(BaseSchema):\n", " property_type_condo: Series[int] = pa.Field(isin=[0, 1])\n", " property_type_house: Series[int] = pa.Field(isin=[0, 1])\n", " property_type_townhouse: Series[int] = pa.Field(isin=[0, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code above, we can see that we're defining a `BaseSchema`, which shares\n", "columns that are common between the raw and processed data. We're also making\n", "sure that the columns are coerced to the expected data types during validation.\n", "\n", "`RawData` and `ProcessedData` inherit from `BaseSchema`, and just by looking\n", "at them we can see the difference that we expect between the raw and processed data:\n", "our `process_data` function should convert the `property_type` categorical\n", "variable into a set of dummy variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Runtime Data Testing\n", "\n", "Now we use python's type-hinting syntax to annotate the `process_data` and\n", "`train_model` functions, decorating them with `@pa.check_types` to make sure\n", "that the inputs and outputs are validated at runtime:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pandera as pa\n", "from pandera.typing import DataFrame\n", "from sklearn.linear_model import LinearRegression\n", "\n", "\n", "@pa.check_types\n", "def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:\n", " return pd.get_dummies(\n", " raw_data.astype({\"property_type\": pd.CategoricalDtype(PROPERTY_TYPES)})\n", " )\n", "\n", "@pa.check_types\n", "def train_model(processed_data: DataFrame[ProcessedData]):\n", " estimator = LinearRegression()\n", " targets = processed_data[\"price\"]\n", " features = processed_data.drop(\"price\", axis=1)\n", " estimator.fit(features, targets)\n", " return estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now every time we run our pipeline our data is validated as it passes through\n", "the various transformations:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "model training successful!\n" ] } ], "source": [ "from io import StringIO\n", "\n", "\n", "def run_pipeline(raw_data):\n", " processed_data = process_data(raw_data)\n", " estimator = train_model(processed_data)\n", " # evaluate model, save artifacts, etc...\n", " print(\"model training successful!\")\n", "\n", "\n", "run_pipeline(pd.read_csv(StringIO(raw_data.strip())))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So if we pass invalid data into `run_pipeline`, we should get an error:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "error in check_types decorator of function 'process_data': > failed element-wise validator 0:\n", "\n", "failure cases:\n", " index failure_case\n", "0 0 unknown\n" ] } ], "source": [ "invalid_data = \"\"\"\n", "square_footage,n_bedrooms,property_type,price\n", "750,1,unknown,200000\n", "900,2,condo,400000\n", "1200,2,house,500000\n", "\"\"\"\n", "\n", "try:\n", " run_pipeline(pd.read_csv(StringIO(invalid_data.strip())))\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, `pandera` tells exactly what went wrong: the `property_type`\n", "column has an invalid category `unknown` at the 0th entry." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Property-based Testing\n", "\n", "But wait, there's more! Since we've already defined our schemas, we can isolate\n", "the processing and model-training code from real-world data to test that\n", "each component in your pipeline is functioning as expected.\n", "\n", "`pandera` builds on top of the [hypothesis](https://hypothesis.readthedocs.io/)\n", "package to generate synthetic data from search strategies that try to find\n", "the simplest case that would falsify your tests:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import hypothesis\n", "\n", "@hypothesis.given(RawData.strategy(size=3))\n", "def test_process_data(raw_data):\n", " process_data(raw_data)\n", "\n", " \n", "@hypothesis.given(ProcessedData.strategy(size=3))\n", "def test_train_model(processed_data):\n", " estimator = train_model(processed_data)\n", " preds = estimator.predict(processed_data.drop(\"price\", axis=1))\n", " assert len(preds) == processed_data.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could run this as a [unittest](https://docs.python.org/3/library/unittest.html)\n", "or [pytest](https://docs.pytest.org/) suite, but for now we can just run the\n", "tests manually like so:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โœ… tests successful!\n" ] } ], "source": [ "def run_test_suite():\n", " test_process_data()\n", " test_train_model()\n", " print(\"โœ… tests successful!\")\n", " \n", " \n", "run_test_suite()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So if we were to incorrectly implement any of the components in our pipeline,\n", "we'd see errors early on. In this case, we're just going to return the raw data\n", "without the dummified `property_type` variable." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Falsifying example: test_process_data(\n", " raw_data= square_footage n_bedrooms price property_type\n", " 0 0 0 0.0 condo\n", " 1 0 0 0.0 condo\n", " 2 0 0 0.0 condo,\n", ")\n", "error in check_types decorator of function 'process_data': column 'property_type_condo' not in dataframe\n", " square_footage n_bedrooms price property_type\n", "0 0 0 0.0 condo\n", "1 0 0 0.0 condo\n", "2 0 0 0.0 condo\n" ] } ], "source": [ "# an incorrect implementation of process_data\n", "@pa.check_types\n", "def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:\n", " return raw_data\n", "\n", "\n", "try:\n", " run_test_suite()\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, our test suite catches the fact that `property_type_condo` doesn't exist\n", "in our processed data output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get some more intuition about what's going on with the data synthesis\n", "strategies by interactively generating data using the `example` method." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
square_footagen_bedroomspriceproperty_type
020427746370.512051townhouse
19540167.336181townhouse
246793707.778242townhouse
\n", "
" ], "text/plain": [ " square_footage n_bedrooms price property_type\n", "0 2042 7 746370.512051 townhouse\n", "1 9 5 40167.336181 townhouse\n", "2 467 9 3707.778242 townhouse" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RawData.example(size=3)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
square_footagen_bedroomspriceproperty_type_condoproperty_type_houseproperty_type_townhouse
016258417650.777844000
13170788150.855590111
219377710676.511681101
\n", "
" ], "text/plain": [ " square_footage n_bedrooms price property_type_condo \\\n", "0 1625 8 417650.777844 0 \n", "1 317 0 788150.855590 1 \n", "2 1937 7 710676.511681 1 \n", "\n", " property_type_house property_type_townhouse \n", "0 0 0 \n", "1 1 1 \n", "2 0 1 " ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ProcessedData.example(size=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Under the hood, `pandera` is collecting all of the schema properties and\n", "converting it into a search strategy using the\n", "[pandas-supported hypothesis strategies](https://hypothesis.readthedocs.io/en/latest/numpy.html#pandas).\n", "Currently, one limitation that you can see from the `ProcessedData` example above is that the\n", "generated data doesn't quite capture the joint distribution between the `property_type_*` dummy\n", "variables, as the second row contains `1`s for all of the property types. Depending on\n", "what exactly it is you're trying to test, this may or may not matter. Ultimately, it's\n", "still up to you to determine what to test and how ๐Ÿค”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What's Next?\n", "\n", "There's still a lot to do in `pandera` to fully-realize the vision of\n", "statistical typing, but I think the main API ideas and features are there\n", "to get started and reap the benefits of statistical typing:\n", "\n", "1. Runtime data validation when executing pipeline during development/production.\n", "2. Property-based unit testing by isolating transformation code from real data.\n", "3. Self-documenting pipelines that explicitly define the types and statistical\n", " properties of data as it flows through your pipeline.\n", " \n", "There are a few things in the roadmap that I'm excited about:\n", "- [Decouple `pandera` and `pandas` type systems](https://github.com/pandera-dev/pandera/issues/369)\n", "- [Add support for parallelized dataframes for larger datasets](https://github.com/pandera-dev/pandera/issues/119)\n", "- [Add a more comprehensive suite of built-in statistical hypothesis tests](https://github.com/pandera-dev/pandera/issues/168)\n", "- [Implement data synthesis strategies for hypothesis tests](https://github.com/pandera-dev/pandera/issues/370)\n", "- [Support data synthesis strategies for joint distributions](https://github.com/pandera-dev/pandera/issues/371)\n", "- [Support machine-learning-specific schemas](https://github.com/pandera-dev/pandera/issues/179)\n", "\n", "If you're interested in this project, please consider helping out with\n", "[code contributions](https://github.com/pandera-dev/pandera/blob/master/.github/CONTRIBUTING.md),\n", "[submitting feature requests, bugs, documentation improvements](https://github.com/pandera-dev/pandera/issues),\n", "and [support](https://github.com/sponsors/cosmicBboy)!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 4 }