{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> ๐Ÿ“ข: **This document was used during early development of siuba. See the [siuba intro doc](https://siuba.readthedocs.io/en/latest/api_table_core/03_select.html).**\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from siuba.siu import _, explain" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_.somecol.min()\n" ] }, { "data": { "text/plain": [ "5" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = _['a'] + _['b']\n", "\n", "(_ + _)(1)\n", "d = {'a': 1, 'b': 2}\n", "\n", "explain(_.somecol.min())\n", "\n", "(_['a'] + _['b'])(d)\n", "\n", "f = _['a'] + 4\n", "f(d)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rules\n", "\n", "1. When `_()` represents a call, rather than executing one, it is called symbolic call\n", "2. _ always performs symbolic calls, except immediately after...\n", " * binary operations. e.g. `_ + _`\n", " * a symbolic call. e.g. `_()`\n", " * an index operation. e.g. `_['a']`\n", "3. You can explicitly tell _ to do a normal call, using `~~`. e.g. `~~_.func`\n", "\n", "**Rational:**\n", "It is much less common for people to make a call after a binary operation.\n", "\n", "For example,\n", "\n", "* uncommon: `(_.a + _.b)()`\n", "* common: `(_.a + _.b).sum()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Common cases" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data = ['a','b','c']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['aa', 'bb', 'cc']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Binary operation\n", "list(map(_ * 2, data))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A', 'B', 'C']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method call\n", "list(map(_.upper(), data))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Index\n", "get_ax = _['a']['x']\n", "\n", "get_ax({'a': {'x': 1}, 'b': 2})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Escaping" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Escaping\n", "from collections import namedtuple\n", "Point = namedtuple('Point', ['x', 'y'])\n", "\n", "points = [Point(x = 0, y = 1), Point(x = 1, y = 2)]\n", "\n", "# doesn't work, since _.x() is a symbolic call, like _.upper()\n", "#list(map(_.x, points))\n", "\n", "# works via escaping\n", "list(map(~~_.x, points))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 3]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# needs no escaping, since binary op!\n", "\n", "list(map(_.x + _.y, points))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 0]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# contrived complex example of escaping\n", "# access .imag attribute of x + y\n", "\n", "list(map(~~(_.x + _.y).imag, points))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review of siu expressions..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ready to call:\n", "\n", "* _.a + _.b\n", "* (_.a + _.b) / 2\n", "* _.sum()\n", "* -_.sum()\n", "* _.a.sum() + _.b.sum()\n", "* (_.a + _.b).sum()\n", "* _[\"a\"].sum()\n", "* _[\"a\"] + _[\"b\"]\n", "* ~~_.a\n", "* ~~-_.a\n", "\n", "Not ready to call:\n", "\n", "* _\n", "* _.a\n", "* -_.a\n", "* (_.a + _.b).sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Benefits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transparent\n", "\n", "Lambdas lock your code away.\n", "You know that when you call it, it will do some work, but you don't know what that is.\n", "Siu expressions can state what they want to do." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "((_.a + (_.b / 2) + _.c**_.d) >> _) & _\n" ] } ], "source": [ "f = _.a + _.b / 2 + _.c**_.d >> _ & _\n", "\n", "explain(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, siu expressions are represented via a call tree..." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "โ–ˆโ”€/\n", "โ”œโ”€โ–ˆโ”€+\n", "โ”‚ โ”œโ”€โ–ˆโ”€.\n", "โ”‚ โ”‚ โ”œโ”€_\n", "โ”‚ โ”‚ โ””โ”€'a'\n", "โ”‚ โ””โ”€โ–ˆโ”€.\n", "โ”‚ โ”œโ”€_\n", "โ”‚ โ””โ”€'b'\n", "โ””โ”€2" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(_.a + _.b) / 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While still rough, we can do analyses on siu expressions" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'a', 'b', 'c'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "symbol = _.a[_.b + 1] + _['c']\n", "\n", "# hacky way to go from symbol to call for now\n", "call = symbol.source\n", "\n", "call.op_vars()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas, sql, and more\n", "\n", "Down the road, we can use siu's transparency in execution engines.\n", "\n", "People can say **what** they want to do, and we can optimize **how** to do it (e.g. in pandas, sql, etc..)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## metahooks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One kind of crazy thing I did was create a metahook, that automatically turns an imported function into one that creates siu expressions... (feature is currently unused!)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1,_('a') + _('b'))\n" ] }, { "data": { "text/plain": [ "4" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import siuba.meta_hook\n", "from siuba.meta_hook.operator import add, sub\n", "from siuba.meta_hook.pandas import DataFrame\n", "\n", "f = add(1, _['a'] + _['b'])\n", "explain(f)\n", "\n", "f({'a': 1, 'b': 2})" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "โ–ˆโ”€'__call__'\n", "โ”œโ”€\n", "โ””โ”€{'a': [1, 2, 3]}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame({'a': [1,2,3]})" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "โ–ˆโ”€+\n", "โ”œโ”€โ–ˆโ”€.\n", "โ”‚ โ”œโ”€_\n", "โ”‚ โ””โ”€'a'\n", "โ””โ”€โ–ˆโ”€.\n", " โ”œโ”€_\n", " โ””โ”€'b'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_.a + _.b" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "โ–ˆโ”€+\n", "โ”œโ”€โ–ˆโ”€'__call__'\n", "โ”‚ โ””โ”€โ–ˆโ”€.\n", "โ”‚ โ”œโ”€_\n", "โ”‚ โ””โ”€'a'\n", "โ””โ”€โ–ˆโ”€.\n", " โ”œโ”€_\n", " โ””โ”€'b'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_.a() + _.b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Is siu fast?\n", "\n", "It depends how many times you call it.\n", "For many applications you only need to call an expression once (e.g. in pandas).\n", "If you call it many times, like in the example below, then it will be slower than using a lambda.\n", "\n", "However, for libraries that expect siu expressions, knowing what they want to do means that we can actually speed up operations.\n", "\n", "Below I just show the downside, that out of the box they're slower than lambdas :/" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def lmap(*args, **kwargs): return list(map(*args, **kwargs))\n", "l = [dict(a = 1) for ii in range(10*6)]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "212 ยตs ยฑ 50.2 ยตs per loop (mean ยฑ std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%%timeit\n", "# NBVAL_IGNORE_OUTPUT\n", "\n", "x = lmap(_['a'], l)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.29 ยตs ยฑ 199 ns per loop (mean ยฑ std. dev. of 7 runs, 100000 loops each)\n" ] } ], "source": [ "%%timeit\n", "# NBVAL_IGNORE_OUTPUT\n", "\n", "x = lmap(lambda x: x['a'], l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Limitations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When siu expressions contain list literals, they can't know about any expressions inside those lists. E.g." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a', _, _, _]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = _ + [_, _, _]\n", "\n", "f(['a'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can easily be worked around, though!" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a', ['a'], ['a'], ['a']]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from siuba.meta_hook import lazy_func\n", "\n", "@lazy_func\n", "def List(*args):\n", " return list(args)\n", "\n", "f = _ + List(_, _, _)\n", "\n", "f(['a'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Similar Projects\n", "\n", "The below projects use similar symbolic objects, but driven by lambdas (so the **what** can't be inspected)\n", "\n", "* [fn.py](https://github.com/kachayev/fn.py)\n", "* [phi](https://github.com/cgarciae/phi)\n", "* [pandasply](https://github.com/coursera/pandas-ply)\n", "\n", "The projects below are symbolic computation engines, but lack a simple, generic, symbolic object\n", "\n", "* [ibis](https://github.com/ibis-project/ibis)\n", "* [blaze](https://github.com/blaze/blaze)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "165px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }