{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Validated DataFrames\n", "\n", "Dislcaimer: I personally hate pandas because of it's groteske abuse of strings as variables. Strings are useless for static code inspection and yet pandas encourages users to plaster code with strings of the type `df['some random name'] = np.float64(df['other name'])` and hope that the rest of the workflow doesn't crash. It gets worse when operators are chained brainlessly:\n", "\n", "```python\n", "peak_date = list(df.groupby('date').agg({ord_id_col_name: 'nunique'}).reset_index() \\\n", " .sort_values(ord_id_col_name, ascending=False)['date'])[0] \n", "```\n", "\n", "The conglomerate of `list`, an index, three chained operations and \"`\\`\" as a desparate attempt to make it fit to the IDE's enforcement of 120 characters is the pinnacle of poor readability.\n", "\n", "So let's do something about it. Let's make a verified dataframe that get's rid of the strings...\n", "\n", "First the imports, and then make a dataframe:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_1col_2
03a
12b
21c
30d
\n", "
" ], "text/plain": [ " col_1 col_2\n", "0 3 a\n", "1 2 b\n", "2 1 c\n", "3 0 d" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}\n", "df = pd.DataFrame.from_dict(data)\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A pleasant endgame for a validated dataframe would along something like this:\n", "\n", "```python\n", "import pandas as pd\n", "\n", "class DF(object):\n", " def __init__(self, df):\n", " if not isinstance(df, pd.DataFrame):\n", " raise TypeError\n", " self.df = df\n", " self._columns = set(df.columns)\n", " \n", " def __getattr__(self, name):\n", " if name in self._columns:\n", " return getattr(self,name)\n", " elif hasattr(self.df, name):\n", " return getattr(self.df, name)\n", " else:\n", " # Default behaviour\n", " return object.__getattribute__(self, name)\n", " \n", " @property\n", " def col_1(self):\n", " return self.df[\"col_1\"]\n", "\n", " @property\n", " def col_2(self):\n", " return self.df[\"col_2\"]\n", "```\n", "\n", "Onto writing the code generator.\n", "\n", "First I create two templates: One for the class with override of gettattr, and one for the properties, that are intended to inform the static code inspector." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "_class_template = \"\"\"import pandas as pd\n", "\n", "class DF(object):\n", " def __init__(self, df):\n", " if not isinstance(df, pd.DataFrame):\n", " raise TypeError\n", " self.df = df\n", " self._columns = set(df.columns)\n", " \n", " def __getattr__(self, name):\n", " if name in self._columns:\n", " return getattr(self,name)\n", " elif hasattr(self.df, name):\n", " return getattr(self.df, name)\n", " else:\n", " # Default behaviour\n", " return object.__getattribute__(self, name)\n", " {properties}\n", " \"\"\"\n", "\n", "\n", "_property_template = \"\"\"\n", " @property\n", " def {name}(self):\n", " return self.df[\"{name}\"]\n", "\"\"\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next I need a function to populate my templates" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# This can now be used with hints:\n", "def make_validated_df(df, path):\n", " if not isinstance(df, pd.DataFrame):\n", " raise TypeError\n", "\n", " mapping = {n: df[n].dtype for n in df.columns}\n", " for name, dtype in mapping.items():\n", " if name == \"df\":\n", " raise ValueError(\"df is a reserved keyword.\")\n", " if not df[name].dtype == dtype:\n", " raise TypeError(f\"{name} is {df[name].dtype}, not {dtype}\")\n", "\n", " properties = []\n", " for name in mapping:\n", " prop = _property_template.format(name=name)\n", " properties.append(prop)\n", "\n", " columns = str(set(mapping.keys()))\n", " s = _class_template.format(properties=\"\".join(properties), columns=columns)\n", "\n", " with path.open(\"w\") as fo:\n", " fo.write(s)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that this file creates the importable module `vdf.py`, so now I can use the function like this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "make_validated_df(df, path=Path(\"vdf.py\"))\n", "from vdf import DF as myDF" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_2
col_1
0d
1c
2b
3a
\n", "
" ], "text/plain": [ " col_2\n", "col_1 \n", "0 d\n", "1 c\n", "2 b\n", "3 a" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = myDF(df)\n", "\n", "df2.groupby(df2.col_1).sum() # no strings attached!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And it produces the same output as the regular dataframe:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " col_2\n", "col_1 \n", "0 d\n", "1 c\n", "2 b\n", "3 a\n" ] } ], "source": [ "print(df.groupby(\"col_1\").sum())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This best thing however, is that IDE support now is enabled:\n", "\n", "![](./artwork/validated_dataframe_w_intellitype_support.png)" ] } ], "metadata": { "kernelspec": { "display_name": "pages310", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }