{ "cells": [ { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "# Pandas fast mutate architecture" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem: users may need to define their own functions for SQL or pandas\n", "\n", "In siuba, much of what users do involves expressions using `_`.\n", "Depending on the backend they're using, these expressions are then transformed and executed.\n", "However, sometimes no translation exists for a method.\n", "\n", "This is not so different from pandas or SQL alchemy, where a limited number of methods are available to users.\n", "\n", "For example, in pandas...\n", "\n", "* you can do `some_data.cumsum()`\n", "* you *can't* do `some_data.cumany()`\n", "\n", "Moreover, you can use `.cummean()` on an ungrouped, but not a grouped DataFrame. And as a final cruel twist, some methods are fast when grouped, while others (e.g. `expanding().sum()`) use the slow apply route.\n", "\n", "## What's the way out?\n", "\n", "In pandas, it's not totally clear how you would define something like `.cumany()`, and let it run on grouped or ungrouped data, without **submitting a PR to pandas itself**.\n", "\n", "(maybe by [registering an accessor](https://github.com/Zsailer/pandas_flavor#register-accessors), but this doesn't apply to grouped DataFrames.)\n", "\n", "This is the tyranny of methods. The object defining the method owns the method. To add or modify a method, you need to modify the class behind the object.\n", "\n", "Now, this isn't totally true--the class could provide a way for you to register your method (like accessors). But wouldn't it be nice if the actions we wanted to perform on data didn't have to check in with the data class itself? Why does the data class get to decide what we do with it, and why does it get priviledged methods?\n", "\n", "### Enter singledispatch\n", "\n", "Rather than registering functions onto your class (i.e. methods), singledispatch lets you register classes with your functions.\n", "\n", "In singledispatch, this works by having the class of your first argument, decide which version of a function to call." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default dispatch over: \n", "Special dispatch for an integer!\n" ] } ], "source": [ "from functools import singledispatch\n", "\n", "# by default dispatches on object, which everything inherits from\n", "@singledispatch\n", "def cool_func(x):\n", " print(\"Default dispatch over:\", type(x))\n", " \n", "@cool_func.register(int)\n", "def _cool_func_int(x):\n", " print(\"Special dispatch for an integer!\")\n", " \n", "cool_func('x')\n", "cool_func(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This concept is incredibly powerful for two reasons...\n", "\n", "* many people can define actions over a DataFrame, without a quorum of priviledged methods.\n", "* you can use normal importing, so don't have to worry about name conflicts\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## singledispatch in siuba\n", "\n", "siuba uses singledispatch in two places\n", "\n", "* dispatching verbs like `mutate`, whose actions depend on the backend they're operating on (e.g. SQL vs pandas)\n", "* creating symbolic calls\n", "\n", "It's worth looking at symbolic calls in detail" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3\n", "1 4\n", "2 5\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from siuba.siu import symbolic_dispatch, _\n", "import pandas as pd\n", "\n", "@symbolic_dispatch(cls = pd.Series)\n", "def add2(x):\n", " return x + 2\n", "\n", "add2(pd.Series([1,2,3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One special property of `symbolic_dispatch` is that if we pass it a symbol, then it returns a symbol." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "█─'__call__'\n", "├─█─'__custom_func__'\n", "│ └─\n", "└─█─'__call__'\n", " ├─█─.\n", " │ ├─_\n", " │ └─'astype'\n", " └─" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sym = add2(_.astype(int))\n", "\n", "sym" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3\n", "1 4\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sym(pd.Series(['1', '2']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that in this case these two bits of code work the same...\n", "\n", "```python\n", "ser = pd.Series(['1', '2'])\n", "sym = add2(_.astype(int))\n", "sym(ser)\n", "\n", "func = lambda _: add2(_.astype(int))\n", "func(ser)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "siuba knows that if the function's first argument is a symbolic expression, then the function needs to return a symbolic expression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What should we singledispatch over\n", "\n", "In essence, siuba needs to allow dispatching over the forms of data it can operate on, including..\n", "\n", "* regular Series\n", "* grouped Series\n", "* (maybe) sqlachemy column mappings\n", "\n", "## Are there any risks?\n", "\n", "I'm glad you asked! There is one very big risk with singledispatch, and it's this:\n", "\n", " singledispatch will dispatch on the \"closest\" matching parent class it has registered.\n", " \n", "This means that if it has object registered, then at the very least, it will dispatch on that.\n", "**This is a big problem since e.g. sqlalchemy column mappings and everything else is an object**.\n", "\n", "In order to mitigate this risk, there are two compelling options...\n", "\n", "1. Put an upper bound on dispatching classes ([related concept in type annotations](https://www.python.org/dev/peps/pep-0484/#type-variables-with-an-upper-bound))\n", "2. Require an explicit annotation on return type\n", "\n", "The downsides are that (1) requires a custom dispatch implementation, and (2) requires that people know about type annotations.\n", "\n", "That said, I'm curious to explore option (2), as this has an appealing logic: an appropriate function will be a subtype of the one we typically use." ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Requiring an annotation over return type\n", "\n", "In order to fully contextualize the process, consider the stage where something may need to be pulled from the dispatcher: call shaping via CallTreeLocal.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from siuba.siu import CallTreeLocal, strip_symbolic\n", "\n", "def as_string(x):\n", " return x.astype(str)\n", "\n", "ctl = CallTreeLocal(local = {'as_string': as_string})\n", "\n", "call = ctl.enter(strip_symbolic(_.as_string()))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'func': '__call__',\n", " 'args': (, _),\n", " 'kwargs': {}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call object holding function as first argument\n", "call.__dict__" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "function" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# proof it's just the function\n", "type(call.args[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now this setup is good and well--but how is a user going to put *their* function on CallTreeLocal?\n", "\n", "Register it? Nah. What they need is a clear interface.\n", "\n", "We're already \"bouncing\" symbolic dispatch functions when they get a symbolic expression. We can use this mechanic to make CallTreeLocal more \"democratic\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that when we \"bounce\" add2, it reports the function as a \"__custom_func__\"." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "█─'__call__'\n", "├─█─'__custom_func__'\n", "│ └─\n", "└─_" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@symbolic_dispatch(cls = pd.Series)\n", "def add2(x):\n", " return x + 2\n", "\n", "add2(_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is because it's a special call, called a `FuncArg` (name subject to change). We can modify CallTreeLocal to perform custom behavior when it enters / exits `__custom_func__`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "class SpecialClass: pass\n", "\n", "@add2.register(SpecialClass)\n", "def _add2_special(x):\n", " print(\"Wooweee!\")\n", "\n", "class CallTree2(CallTreeLocal):\n", " # note: self.dispatch_cls already used in init for this very purpose\n", " \n", " def enter___custom_func__(self, node):\n", " # the function itself is the first arg\n", " dispatcher = node.args[0]\n", " # hardcoding for now...\n", " return dispatcher.dispatch(self.dispatch_cls)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "ctl2 = CallTree2({}, dispatch_cls = SpecialClass)\n", "\n", "func = ctl2.enter(strip_symbolic(add2(_)))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(_)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "siuba.siu.Call" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, there's one major problem--CallTree2 may still dispatch the default function!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "@symbolic_dispatch\n", "def add3(x):\n", " print(\"Calling add3 default\")\n", "\n", "call3 = ctl2.enter(strip_symbolic(add3(_)))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calling add3 default\n" ] } ], "source": [ "call3(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**THIS MEANS THAT EVERY SINGLEDISPATCH FUNCTION WILL AT LEAST USE ITS DEFAULT**\n", "\n", "Imagine that some defined the default, but then it gets fired for SQL, and for pandas, etc etc..\n", "\n", "What a headache." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Keeping only when there's a compatible return type\n", "\n", "We can check the result annotation of the function we'd dispatch, to know whether it won't. In this case, we assume it won't work if the result is not a subclass of the one our SQL tools expect: ClauseElement.\n", "We can shut down the process early if we know the function won't return what we need.\n", "\n", "This is because a function is a subtype of another function if it's input is contravarient (e.g. a parent), and **it's output is covariant (e.g. a subclass)**." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# used to get type info\n", "import inspect\n", "\n", "# the most basic of SQL classes\n", "from sqlalchemy.sql.elements import ClauseElement\n", "\n", "RESULT_CLS = ClauseElement\n", "\n", "class CallTree3(CallTreeLocal):\n", " # note: self.dispatch_cls already used in init for this very purpose\n", " \n", " def enter___custom_func__(self, node):\n", " # the function itself is the first arg\n", " dispatcher = node.args[0]\n", " # hardcoding for now...\n", " f = dispatcher.dispatch(self.dispatch_cls)\n", " sig = inspect.signature(f)\n", " ret_type = sig.return_annotation\n", " \n", " if issubclass(ret_type, RESULT_CLS):\n", " return f\n", " \n", " raise TypeError(\"Return type, %s, not subclass of %s\" %(ret_type, RESULT_CLS))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import sql\n", "sel = sql.select([sql.column('id'), sql.column('x'), sql.column('y')])\n", "\n", "# this is what siuba sql expressions operate on\n", "col_class = sel.columns.__class__" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "clt3 = CallTree3({}, dispatch_cls = col_class)\n", "\n", "@symbolic_dispatch\n", "def f_bad(x):\n", " return x + 1\n", "\n", "@symbolic_dispatch\n", "def f_good(x: ClauseElement) -> ClauseElement:\n", " return x.contains('woah')\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Return type, , not subclass of \n" ] } ], "source": [ "# here is the error for the first, without that pesky stack trace\n", "try:\n", " clt3.enter(strip_symbolic(f_bad(_)))\n", "except TypeError as err:\n", " print(err)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(_)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here is the good one going through\n", "clt3.enter(strip_symbolic(f_good(_)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I get this in my life today?\n", "\n", "Well, runtime evaluation of result types isn't the most fleshed out process in python. And there are some edge cases.\n", "\n", "For example, what should we do if the return type is a [Union](https://www.python.org/dev/peps/pep-0484/#union-types)? [Any](https://www.python.org/dev/peps/pep-0484/#the-any-type)?\n", "\n", "There is also a bug with the Union implementation before 3.7, where if it receives 3 classes, and 1 is the parent of the others, it just returns the parent..." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "__main__.A" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from typing import Union\n", "\n", "class A: pass\n", "\n", "class B(A): pass\n", "\n", "class C(B): pass\n", "\n", "Union[A,B,C]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be honest--I think we can be optimistic for now that anyone using a Union as their return type knows what they're doing with siuba. I think the main behaviors we want to support are...\n", "\n", "1. Can create singledispatch, with potentially a default function\n", "2. Don't shoot yourself in the foot when the default is fired for SQL and pandas\n", "\n", "And even a crude result type check will ensure that. In some ways the existence of a result type is almost all the proof we need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To decide\n", "\n", "* What should siuba do when dispatch function doesn't qualify? Fall back to local?\n", "* Related: should local only look up methods? (makes sense to me)\n", "* If so, how do we implement SQL dialects? Have ImmutableColumnCollection >= SqlColumns >= PostgresqlColumns, etc.." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Could siuba allow static type checking?\n", "\n", "I think so. It would take a bit of work. Mostly PRs to the typing package to...\n", "\n", "* Implement [higher-kinded types](https://www.stephanboyer.com/post/115/higher-rank-and-higher-kinded-types)\n", "* Support static checking of singledispatch (or stubbing with @overload)\n", "* Wait for pandas type annotations, or stub, so we can check the pipe, which uses `__rshift__` 😅" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }