{ "cells": [ { "cell_type": "code", "execution_count": 16, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"display.max_rows\", 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Backends\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### pandas (fast grouped) _" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

(grouped data frame)

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cylmpghpavg_mpghp_per_mpgdemeaned
0621.011019.7428575.238095-12.285714
1621.011019.7428575.238095-12.285714
.....................
30815.033515.10000022.333333125.785714
31421.410926.6636365.09345826.363636
\n", "

32 rows × 6 columns

\n", "
" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pandas fast grouped implementation ----\n", "from siuba.data import cars\n", "from siuba import _\n", "from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize\n", "\n", "fast_mutate(\n", " cars.groupby('cyl'),\n", " avg_mpg = _.mpg.mean(), # aggregation\n", " hp_per_mpg = _.hp / _.mpg, # elementwise \n", " demeaned = _.hp - _.hp.mean(), # elementwise + agg\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SQL _" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
# Source: lazy query\n",
       "# DB Conn: Engine(sqlite:///:memory:)\n",
       "# Preview:\n",
       "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cylavg_mpg
0426.663636
1619.742857
2815.100000
\n", "

# .. may have more rows

" ], "text/plain": [ "# Source: lazy query\n", "# DB Conn: Engine(sqlite:///:memory:)\n", "# Preview:\n", " cyl avg_mpg\n", "0 4 26.663636\n", "1 6 19.742857\n", "2 8 15.100000\n", "# .. may have more rows" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from siuba import _, mutate, group_by, summarize, show_query\n", "from siuba.sql import LazyTbl\n", "from sqlalchemy import create_engine\n", "\n", "# create sqlite db, add pandas DataFrame to it\n", "engine = create_engine(\"sqlite:///:memory:\")\n", "cars.to_sql(\"cars\", engine, if_exists=\"replace\")\n", "\n", "# define query\n", "q = (LazyTbl(engine, \"cars\")\n", " >> group_by(_.cyl)\n", " >> summarize(avg_mpg=_.mpg.mean())\n", ")\n", "\n", "q" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT cars.cyl, avg(cars.mpg) AS avg_mpg \n", "FROM cars GROUP BY cars.cyl\n" ] } ], "source": [ "res = show_query(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supported methods\n", "\n", "The table below shows the pandas methods supported by different backends. Note that the regular, ungrouped backend supports all methods, and the fast grouped implementation supports most methods a person could use without having to call the (slow) `DataFrame.apply` method.\n", "\n", "> 🚧This table is displayed a bit funky, but will be cleaned up!" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pandas (ungrouped)\n", "\n", "In general, ungrouped pandas DataFrames do not require any translation.\n", "On this kind of data, verbs like `mutate` are just alternative implementations of methods like `DataFrame.assign`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gxy
0a12
1a23
2b34
\n", "
" ], "text/plain": [ " g x y\n", "0 a 1 2\n", "1 a 2 3\n", "2 b 3 4" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from siuba import _, mutate\n", "\n", "df = pd.DataFrame({\n", " 'g': ['a', 'a', 'b'], \n", " 'x': [1,2,3],\n", " })\n", "\n", "df.assign(y = lambda _: _.x + 1)\n", "\n", "mutate(df, y = _.x + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Siuba verbs also work on grouped DataFrames, but are not always fast. They are the potentially slow, reference implementation." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

(grouped data frame)

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gxyz
0a12-0.5
1a230.5
2b340.0
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mutate(\n", " df.groupby('g'),\n", " y = _.x + 1,\n", " z = _.x - _.x.mean()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pandas (fast grouped)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you could easily enable these fast methods by default, by aliasing them at import.\n", "\n", "```python\n", "from siuba.experimental.pd_groups import fast_mutate as mutate\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Architecture (1)\n", "\n", "Currently, the fast grouped implementation puts all the logic in the verbs. That is, `fast_mutate` dispatches for DataFrameGroupBy a function that handles all the necessary translation of lazy expressions.\n", "\n", "See TODO link this ADR for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL\n", "\n", "### Architecture (2)\n" ] }, { "cell_type": "raw", "metadata": { "jupyter": { "source_hidden": true }, "raw_mimetype": "text/html" }, "source": [ "
Call
Call
shape_call
shape_call
track_call_windows
track_call_wind...
(Call, List[sqla.OverClause])
(Call, List[sqla.OverClause])
SqlTblsource: sqla.Enginefuncs: Dicttbl: sql.Tableops: List[sqla.Select]group_by: Tupleorder_by: Tuple/last_op: Select-- passed to CallTreeLocalrm_attrcall_sub_attrdispatch_clsresult_clsappend_op(Select, ...) : SqlTblcopy(**kwargs): SqlTblshape_call(Call, ...) : Calltrack_call_windows(Call):  get_ordered_col_names()WindowReplacercolumns: Mapping[sqla.Column]group_byorder_bywindow_ctewindows: List+ method(type): typeCallTreeLocallocal: dictcall_sub_attr: setchain_sub_attr: Booldispatch_cls: Typeresult_cls: Typecreate_local_call(str, ...) : CallCallListenergeneric_visit: typeenter(Call): Callexit   (Call): Callgeneric_enter(Call): Callgeneric_exit   (Call): Callenter_if_call(T): T-- enter, exit methods formatenter_<Call.func>(Call)-- egenter___get_item__(Call)
group_by(...)
group_by(...)
arrange(...)
arrange(...)
mutate/filter/summarize(...)
mutate/filter/summarize(....
copy(group_by = Tuple[str])
copy(group_by =...
copy(order_by = Tuple[sqla.Clause])
copy(order_by =...
append_op(...)
append_op(...)
SQL Architecture
SQL Architecture
Note

sqla = sqlachemy
Note...
Viewer does not support full SVG 1.1
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SQL implementation consists largely of the following:\n", " \n", "* **LazyTbl** - a class that holds a sqlalchemy connection, table name, and list of select statements.\n", "* **Verbs** that dispatch on LazyTbl - eg. mutate takes a LazyTbl, and returns a LazyTbl that has a new select statement corresponding to that mutate.\n", "* **CallListeners** for (1) translating lazy expressions to SQL specific functions, and (2) adding grouping information to OVER clauses.\n", "\n", "See TODO link this ADR for more details." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }