{ "cells": [ { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "YAML specification for Series method support across backends\n", "============================================================\n", "\n", "(drafted around 21 March 2020)\n", "\n", "Siuba aims to support as pandas Series methods across a range of backends.\n", "This means that users should be able to use the same method call they would on a pandas Series, and get back\n", "roughly the same result.\n", "\n", "Backends include:\n", "\n", "* pandas - siuba makes grouped operations more flexible and faster, while returning the same results as using the `.apply` method.\n", "* SQL (sqlite and postgresql) - for the exact same code used on pandas, users should be able to generate a SQL query.\n", "* spark (in progress) - as with SQL, users should be able to execute in spark.\n", "\n", "Note that since pandas (e.g. with the `.apply` method) is the **reference implementation**. There are three big benefits of specifying the reference as data (eg in yaml): testing conformance, documentating, tracking changes over releases.\n", "\n", "**In this document, I'll first review the value of a spec, go through siuba's current spec, and then the script for transitioning to yaml**.\n", "\n", "## More on importance\n", "**Testing conformance** is important because we need to trust the different backends can be swapped in. This means, that for every series method they support, there should be at least one test that they...\n", "\n", "* return the exact same result\n", "* in cases where the results differ (e.g. pandas returns an int, SQL returns a float), it should be marked explicitly somewhere.\n", "* in cases where they don't support a verb (e.g. some SQL aggregations can't be used in mutate).\n", "\n", "**Documenting** is important because there are over 300 Series methods. A compact representation of support across backends will let people work quickly.\n", "\n", "**Tracking changes over releases** is important because as time goes on, we'll likely need to react to methods being deprecated in pandas.\n", "\n", "## Why didn't I start with the YAML specification?\n", "\n", "Because there are 300+ Series methods, I wanted to prioritize a wide format, with few enclosing things `()/{}/[]`.\n", "I was concerned that a long document would require a lot of scanning, and would be hard to jump in to.\n", "\n", "I was also doing a lot of research / exploration on spreadsheet (actually, on airtable!).\n", "\n", "Now that things are much further along, I'm ready to pay down the technical debt, while preserving two valuable modes of interacting with the spec:\n", "\n", "* yaml <-> spreadsheet - so I can have the wide representation on the spreadsheet\n", "* yaml -> implementation - so it's no longer python code (better for docs), more explicit, and changes can be clearly diff'd in PRs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New YAML specification\n", "\n", "### Schema\n", "\n", "* example\n", "* category\n", "* backends:\n", " - postgresql: \\\n", "* importance: (currently held in airtable)\n", "* action:\n", " - status: {supported, todo, maydo, wontdo}\n", " - kind: {elwise,agg,window}\n", " - input_type: (default: null)\n", " - result_type: (default: null)\n", " - flags: [no_mutate, ...]\n", "\n", "### Case: Pandas versions\n", "\n", "Siuba supports multiple pandas versions, so the spec will contain methods that exist in one version but not another. Similar considerations appply for deprecated methods.\n", "\n", "### Case: Backend exceptions\n", "\n", "Backends may...\n", "\n", "* **be unable to implement a method**. For example, postgresql does not have nanosecond resolution, so cannot do a nanosecond method.\n", "* **return a different type**. For example, postgresql date operations often return floats, while pandas' return ints.\n", "\n", "To this end, the spec allows the `backends` field to override settings configured in `action`.\n", "\n", "### Requirements\n", "\n", "The spec should be used to do the following, without pulling in other data sources...\n", "\n", "* generate a support table\n", "* generate series method unit tests\n", "* generate the fast pandas grouped methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Managing contributions\n", "\n", "Because all activity is now summarized through the spec, we should be able...\n", "\n", "* to point contributors to TODO entries\n", "* ask them to audit maydo or wontdo issues (e.g. If I mistakenly thought something was not possible in postgres)\n", "\n", "Adding contributions will likely follow these steps...\n", "\n", "* open base dialect file (`siuba.sql.transform.py`)\n", "* open postgresl dialect file (`siuba.sql.dialect.postgresql.py`)\n", "* for todo method, add implementation to the first file if generic, otherwise the second.\n", "* modify `siuba.spec.series.yml` to change \"todo\" to \"supported\"" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "# Migration script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Below I read the existing spec (written using siu expressions), and wrangle it into the new yaml format. This is needed, since flagging exceptions for actions on different backends were tacked on haphazardly as I went.\n", "\n", "For example:\n", "\n", "* unlike most aggregation methods, postgres's implementation of `_.nunique()` can't be used in a mutate.\n", "* for date functions, pandas often returns an integer where postgres returns a float.\n", "\n", "Rather than override postgres's behavior in the second case, I'd prefer to declare it. By declaring it, we can always change how we handle it later.\n", "\n", "Below I read in the siuba spec and convert it to yaml. It's messy, but it gets the job done." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from siuba.spec import series" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# NOTE: this is very messy--but also a one way trip to the new YAML format hopefully forever\n", "\n", "PLANS = {\"Todo\", \"Maydo\", \"Wontdo\"}\n", "POSTGRESQL_STATUS = {\"xfail\": \"todo\", \"not_impl\": \"wontdo\", None: None}\n", "\n", "def get_postgresql_status(result):\n", " status = POSTGRESQL_STATUS[result.get(\"postgresql\")]\n", " \n", " if status is None:# and result[\"type\"] not in PLANS:\n", " if \"sql_type\" in result:\n", " return {\"postgresql\": {\"result_type\": \"float\"}}\n", " if \"no_mutate\" in result:\n", " return {\"postgresql\": {\"flags\": [\"no_mutate\"]}}\n", "\n", " return {}\n", " \n", " return {\"postgresql\": {\"status\": status}}\n", "\n", "def get_pandas_status(result):\n", " return \"supported\" if result[\"type\"] not in PLANS else result[\"type\"].lower()\n", "\n", "def get_type_info2(call, method, category):\n", " if call.func != \"__rshift__\":\n", " raise ValueError(\"Expected first expressions was >>\")\n", " \n", " out = {}\n", " expr, result = call.args\n", " \n", " #accessors = ['str', 'dt', 'cat', 'sparse']\n", " #accessor = ([ameth for ameth in accessors if ameth in expr.op_vars()] + [None])[0]\n", " result_dict = result.to_dict()\n", " \n", " # format action ----\n", " action = {\n", " \"status\": get_pandas_status(result_dict),\n", " **result_dict\n", " }\n", " if action[\"type\"] not in PLANS:\n", " action[\"kind\"] = action[\"type\"].lower()\n", " if \"postgresql\" in action:\n", " del action[\"postgresql\"]\n", " if \"no_mutate\" in action:\n", " del action[\"no_mutate\"]\n", " if \"sql_type\" in action:\n", " del action[\"sql_type\"]\n", " del action[\"type\"]\n", " if \"op\" in action:\n", " action[\"input_type\"] = \"bool\"\n", " del action[\"op\"]\n", " \n", " \n", " # backends ---\n", " backends = get_postgresql_status(result_dict)\n", " \n", " return dict(\n", " example = str(expr),\n", " category = category,\n", " #expr_frame = replace_meta_args(expr, _.x, _.y, _.z),\n", " #accessor = accessor[0],\n", " backends = backends,\n", " action = action\n", " )\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "out = {}\n", "for category, d in series.funcs_stripped.items():\n", " for name, call in d.items():\n", " out[name] = get_type_info2(call, name, category)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading into pandas" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
methodexamplecategoryaction.statusaction.kindaction.input_typebackends.postgresql.statusbackends.postgresql.flagsbackends.postgresql.result_type
0__invert__~__special_methodssupportedelwiseboolNaNNaNNaN
1__and___ & __special_methodssupportedelwiseboolNaNNaNNaN
2__or___ | __special_methodssupportedelwiseboolNaNNaNNaN
3__xor___ ^ __special_methodssupportedelwisebooltodoNaNNaN
4__neg__-__special_methodssupportedelwiseNaNNaNNaNNaN
..............................
365to_dense_.to_dense()iowontdoNaNNaNNaNNaNNaN
366to_string_.to_string()iotodoNaNNaNNaNNaNNaN
367to_markdown_.to_markdown()iotodoNaNNaNNaNNaNNaN
368to_clipboard_.to_clipboard()iowontdoNaNNaNNaNNaNNaN
369to_latex_.to_latex()iowontdoNaNNaNNaNNaNNaN
\n", "

370 rows × 9 columns

\n", "
" ], "text/plain": [ " method example category action.status \\\n", "0 __invert__ ~_ _special_methods supported \n", "1 __and__ _ & _ _special_methods supported \n", "2 __or__ _ | _ _special_methods supported \n", "3 __xor__ _ ^ _ _special_methods supported \n", "4 __neg__ -_ _special_methods supported \n", ".. ... ... ... ... \n", "365 to_dense _.to_dense() io wontdo \n", "366 to_string _.to_string() io todo \n", "367 to_markdown _.to_markdown() io todo \n", "368 to_clipboard _.to_clipboard() io wontdo \n", "369 to_latex _.to_latex() io wontdo \n", "\n", " action.kind action.input_type backends.postgresql.status \\\n", "0 elwise bool NaN \n", "1 elwise bool NaN \n", "2 elwise bool NaN \n", "3 elwise bool todo \n", "4 elwise NaN NaN \n", ".. ... ... ... \n", "365 NaN NaN NaN \n", "366 NaN NaN NaN \n", "367 NaN NaN NaN \n", "368 NaN NaN NaN \n", "369 NaN NaN NaN \n", "\n", " backends.postgresql.flags backends.postgresql.result_type \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", ".. ... ... \n", "365 NaN NaN \n", "366 NaN NaN \n", "367 NaN NaN \n", "368 NaN NaN \n", "369 NaN NaN \n", "\n", "[370 rows x 9 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.json_normalize([{'method': k, **v} for k, v in out.items()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dumping to yaml\n", "\n", "Key questions:\n", "\n", "* should we sort alphabetically? **I think yes.** It won't group categories, but will very predictable. The YAML spec is for storing, the airtable (or DF) for browsing.\n", "* loading the yaml takes half a second--is that reasonable for delaying the import? No, I don't think so. But it's only loaded during testing, and for the experimental fast pandas grouped ops. Before removing their experimental status, we should use the spec to generate the fast grouped ops code (or maybe save a subset of the spec for them, which shouldn't be edited)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "T:\n", " action:\n", " status: wontdo\n", " backends: {}\n", " category: attributes\n", " example: _.T\n", "__add__:\n", " action:\n", " kind: elwise\n", " status: supported\n", " backends: {}\n", " category: _special_methods\n", " example: _ + _\n", "__and__:\n", " action:\n", " input_type: bool\n", " kind: elwise\n", " status: supported\n", " backends: {}\n", " category: _special_methods\n", " example: _ & _\n", "\n" ] } ], "source": [ "import yaml\n", "print(yaml.dump(out)[:344])\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# uncomment to dump\n", "#yaml.dump(out, open(\"../../siuba/spec/series.yml\", \"w\"))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/Users/machow/Dropbox/Repo/siuba/siuba/spec/series.yml'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pkg_resources\n", "\n", "# will go here (on my filesystem)\n", "pkg_resources.resource_filename(\"siuba.spec\", \"series.yml\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 460 ms, sys: 12.6 ms, total: 472 ms\n", "Wall time: 600 ms\n" ] } ], "source": [ "%%time\n", "spec = yaml.load(open(\"../../siuba/spec/series.yml\"), Loader = yaml.SafeLoader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join with airtable\n", "\n", "As a reminder, an entry from the yaml spec so far is shown below..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'example': '_.nunique()',\n", " 'category': 'computations',\n", " 'backends': {'postgresql': {'flags': ['no_mutate']}},\n", " 'action': {'status': 'supported', 'kind': 'agg'}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_spec = out\n", "raw_spec[\"nunique\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is useful, but I had tracked other information on the airtable, like..\n", "\n", "* **priority**: how important is this to implement?\n", "* **version deprecated**: when was this deprecated?\n", "* **result length**: eg does it return a single value, a value for each group, or something else?\n", "\n", "For now, I'll pull out priority, and will likely just keep the other info in the airtable. I would rather start with less rather than more (and wrap up faster in the process)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
methoddata
0__invert__{'example': '~_', 'category': '_special_method...
1__and__{'example': '_ & _', 'category': '_special_met...
2__or__{'example': '_ | _', 'category': '_special_met...
3__xor__{'example': '_ ^ _', 'category': '_special_met...
4__neg__{'example': '-_', 'category': '_special_method...
\n", "
" ], "text/plain": [ " method data\n", "0 __invert__ {'example': '~_', 'category': '_special_method...\n", "1 __and__ {'example': '_ & _', 'category': '_special_met...\n", "2 __or__ {'example': '_ | _', 'category': '_special_met...\n", "3 __xor__ {'example': '_ ^ _', 'category': '_special_met...\n", "4 __neg__ {'example': '-_', 'category': '_special_method..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# read yaml spec into a dataframe, so we can join w/ airtable\n", "data = pd.DataFrame([{'method': k, 'data': v} for k, v in raw_spec.items()])\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from airtable import Airtable\n", "import os\n", "\n", "# note, airtable API key is in my environment\n", "airtable = Airtable('appErTNqCFXn6stSH', 'methods')\n", "\n", "air_entries = airtable.get_all()\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcreatedTimecategorymethod_namesupport_categoryop_typemin_data_arityNameversion_addedresult_lengthnoteversion_deprecatedmax_data_arity
0rec09hioM01yPbBQB2020-01-20T18:31:22.000ZComputations / descriptive statsmindoneaggregation1.0minNaNNaNNaNNaNNaN
1rec0eOvXaPwn0KqKs2020-01-20T18:31:22.000ZTimedelta methodscat.as_orderedpriority-lowNaNNaNcat.as_orderedNaNNaNNaNNaNNaN
2rec0myu0wfZxDHVCK2020-01-20T18:31:22.000ZTimedelta methodsstr.findalldoneelementwiseNaNstr.findallNaNNaNNaNNaNNaN
3rec0qdwOY8z1A9AsO2020-01-20T18:31:22.000ZReshaping, sortingargmindeprecatedNaNNaNargminv0.21.0NaNNaNNaNNaN
4rec0vjO0Mni6FXWa72020-01-20T18:31:22.000ZTimedelta methodsstr.isdecimaldoneelementwiseNaNstr.isdecimalNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " id createdTime \\\n", "0 rec09hioM01yPbBQB 2020-01-20T18:31:22.000Z \n", "1 rec0eOvXaPwn0KqKs 2020-01-20T18:31:22.000Z \n", "2 rec0myu0wfZxDHVCK 2020-01-20T18:31:22.000Z \n", "3 rec0qdwOY8z1A9AsO 2020-01-20T18:31:22.000Z \n", "4 rec0vjO0Mni6FXWa7 2020-01-20T18:31:22.000Z \n", "\n", " category method_name support_category \\\n", "0 Computations / descriptive stats min done \n", "1 Timedelta methods cat.as_ordered priority-low \n", "2 Timedelta methods str.findall done \n", "3 Reshaping, sorting argmin deprecated \n", "4 Timedelta methods str.isdecimal done \n", "\n", " op_type min_data_arity Name version_added result_length \\\n", "0 aggregation 1.0 min NaN NaN \n", "1 NaN NaN cat.as_ordered NaN NaN \n", "2 elementwise NaN str.findall NaN NaN \n", "3 NaN NaN argmin v0.21.0 NaN \n", "4 elementwise NaN str.isdecimal NaN NaN \n", "\n", " note version_deprecated max_data_arity \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "air_df = pd.json_normalize(air_entries)\n", "air_df.columns = air_df.columns.str.replace(\"fields.\", \"\")\n", "air_df.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('rpow',\n", " {'example': '_.rpow(_)',\n", " 'category': 'binary',\n", " 'backends': {'postgresql': {'status': 'todo'}},\n", " 'action': {'status': 'supported', 'kind': 'elwise'}}),\n", " ('combine',\n", " {'example': \"_.combine(_,'max')\",\n", " 'category': 'binary',\n", " 'backends': {},\n", " 'action': {'status': 'todo'},\n", " 'priority': 1}),\n", " ('combine_first',\n", " {'example': \"_.combine_first(_,'max')\",\n", " 'category': 'binary',\n", " 'backends': {},\n", " 'action': {'status': 'todo'},\n", " 'priority': 1})]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Pull out priority info\n", "from siuba import *\n", "prioritized = (\n", " data\n", " >> full_join(_, air_df, {\"method\": \"method_name\"})\n", " >> filter(~_.data.isna())\n", " >> select(-_.method_name, -_.createdTime, -_.id, -_.expr_frame, -_.expr_series)\n", ") \n", "\n", "new_yaml = (prioritized\n", " >> mutate(\n", " priority = _.support_category.map({\n", " 'priority-high': 3, 'priority-medium': 2, 'priority-low': 1, 'priority-zero': 0\n", " }),\n", " data = _.apply(\n", " lambda d: {\n", " **d[\"data\"],\n", " **({'priority': int(d[\"priority\"])} if not pd.isna(d[\"priority\"]) else {})},\n", " axis = 1\n", " )\n", " )\n", " >> pipe(_.set_index(\"method\").data.to_dict())\n", ")\n", "\n", "list(new_yaml.items())[109:112]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# uncomment to save yaml spec\n", "#yaml.dump(new_yaml, open(\"../../siuba/spec/series.yml\", \"w\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }