{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "YAML specification for Series method support across backends\n",
    "============================================================\n",
    "\n",
    "(drafted around 21 March 2020)\n",
    "\n",
    "Siuba aims to support as pandas Series methods across a range of backends.\n",
    "This means that users should be able to use the same method call they would on a pandas Series, and get back\n",
    "roughly the same result.\n",
    "\n",
    "Backends include:\n",
    "\n",
    "* pandas - siuba makes grouped operations more flexible and faster, while returning the same results as using the `.apply` method.\n",
    "* SQL (sqlite and postgresql) - for the exact same code used on pandas, users should be able to generate a SQL query.\n",
    "* spark (in progress) - as with SQL, users should be able to execute in spark.\n",
    "\n",
    "Note that since pandas (e.g. with the `.apply` method) is the **reference implementation**. There are three big benefits of specifying the reference as data (eg in yaml): testing conformance, documentating, tracking changes over releases.\n",
    "\n",
    "**In this document, I'll first review the value of a spec, go through siuba's current spec, and then the script for transitioning to yaml**.\n",
    "\n",
    "## More on importance\n",
    "**Testing conformance** is important because we need to trust the different backends can be swapped in. This means, that for every series method they support, there should be at least one test that they...\n",
    "\n",
    "* return the exact same result\n",
    "* in cases where the results differ (e.g. pandas returns an int, SQL returns a float), it should be marked explicitly somewhere.\n",
    "* in cases where they don't support a verb (e.g. some SQL aggregations can't be used in mutate).\n",
    "\n",
    "**Documenting** is important because there are over 300 Series methods. A compact representation of support across backends will let people work quickly.\n",
    "\n",
    "**Tracking changes over releases** is important because as time goes on, we'll likely need to react to methods being deprecated in pandas.\n",
    "\n",
    "## Why didn't I start with the YAML specification?\n",
    "\n",
    "Because there are 300+ Series methods, I wanted to prioritize a wide format, with few enclosing things `()/{}/[]`.\n",
    "I was concerned that a long document would require a lot of scanning, and would be hard to jump in to.\n",
    "\n",
    "I was also doing a lot of research / exploration on spreadsheet (actually, on airtable!).\n",
    "\n",
    "Now that things are much further along, I'm ready to pay down the technical debt, while preserving two valuable modes of interacting with the spec:\n",
    "\n",
    "* yaml <-> spreadsheet - so I can have the wide representation on the spreadsheet\n",
    "* yaml -> implementation - so it's no longer python code (better for docs), more explicit, and changes can be clearly diff'd in PRs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New YAML specification\n",
    "\n",
    "### Schema\n",
    "\n",
    "* example\n",
    "* category\n",
    "* backends:\n",
    "  - postgresql: \\<action format>\n",
    "* importance: (currently held in airtable)\n",
    "* action:\n",
    "  - status: {supported, todo, maydo, wontdo}\n",
    "  - kind: {elwise,agg,window}\n",
    "  - input_type: (default: null)\n",
    "  - result_type: (default: null)\n",
    "  - flags: [no_mutate, ...]\n",
    "\n",
    "### Case: Pandas versions\n",
    "\n",
    "Siuba supports multiple pandas versions, so the spec will contain methods that exist in one version but not another. Similar considerations appply for deprecated methods.\n",
    "\n",
    "### Case: Backend exceptions\n",
    "\n",
    "Backends may...\n",
    "\n",
    "* **be unable to implement a method**. For example, postgresql does not have nanosecond resolution, so cannot do a nanosecond method.\n",
    "* **return a different type**. For example, postgresql date operations often return floats, while pandas' return ints.\n",
    "\n",
    "To this end, the spec allows the `backends` field to override settings configured in `action`.\n",
    "\n",
    "### Requirements\n",
    "\n",
    "The spec should be used to do the following, without pulling in other data sources...\n",
    "\n",
    "* generate a support table\n",
    "* generate series method unit tests\n",
    "* generate the fast pandas grouped methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Managing contributions\n",
    "\n",
    "Because all activity is now summarized through the spec, we should be able...\n",
    "\n",
    "* to point contributors to TODO entries\n",
    "* ask them to audit maydo or wontdo issues (e.g. If I mistakenly thought something was not possible in postgres)\n",
    "\n",
    "Adding contributions will likely follow these steps...\n",
    "\n",
    "* open base dialect file (`siuba.sql.transform.py`)\n",
    "* open postgresl dialect file (`siuba.sql.dialect.postgresql.py`)\n",
    "* for todo method, add implementation to the first file if generic, otherwise the second.\n",
    "* modify `siuba.spec.series.yml` to change \"todo\" to \"supported\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "# Migration script"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Below I read the existing spec (written using siu expressions), and wrangle it into the new yaml format. This is needed, since flagging exceptions for actions on different backends were tacked on haphazardly as I went.\n",
    "\n",
    "For example:\n",
    "\n",
    "* unlike most aggregation methods, postgres's implementation of `_.nunique()` can't be used in a mutate.\n",
    "* for date functions, pandas often returns an integer where postgres returns a float.\n",
    "\n",
    "Rather than override postgres's behavior in the second case, I'd prefer to declare it. By declaring it, we can always change how we handle it later.\n",
    "\n",
    "Below I read in the siuba spec and convert it to yaml. It's messy, but it gets the job done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from siuba.spec import series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: this is very messy--but also a one way trip to the new YAML format hopefully forever\n",
    "\n",
    "PLANS = {\"Todo\", \"Maydo\", \"Wontdo\"}\n",
    "POSTGRESQL_STATUS = {\"xfail\": \"todo\", \"not_impl\": \"wontdo\", None: None}\n",
    "\n",
    "def get_postgresql_status(result):\n",
    "    status = POSTGRESQL_STATUS[result.get(\"postgresql\")]\n",
    "    \n",
    "    if status is None:# and result[\"type\"] not in PLANS:\n",
    "        if \"sql_type\" in result:\n",
    "            return {\"postgresql\": {\"result_type\": \"float\"}}\n",
    "        if \"no_mutate\" in result:\n",
    "            return {\"postgresql\": {\"flags\": [\"no_mutate\"]}}\n",
    "\n",
    "        return {}\n",
    "         \n",
    "    return {\"postgresql\": {\"status\": status}}\n",
    "\n",
    "def get_pandas_status(result):\n",
    "    return \"supported\" if result[\"type\"] not in PLANS else result[\"type\"].lower()\n",
    "\n",
    "def get_type_info2(call, method, category):\n",
    "    if call.func != \"__rshift__\":\n",
    "        raise ValueError(\"Expected first expressions was >>\")\n",
    "        \n",
    "    out = {}\n",
    "    expr, result = call.args\n",
    "    \n",
    "    #accessors = ['str', 'dt', 'cat', 'sparse']\n",
    "    #accessor = ([ameth for ameth in accessors if ameth in expr.op_vars()] + [None])[0]\n",
    "    result_dict = result.to_dict()\n",
    "    \n",
    "    # format action ----\n",
    "    action = {\n",
    "        \"status\": get_pandas_status(result_dict),\n",
    "        **result_dict\n",
    "    }\n",
    "    if action[\"type\"] not in PLANS:\n",
    "        action[\"kind\"] = action[\"type\"].lower()\n",
    "    if \"postgresql\" in action:\n",
    "        del action[\"postgresql\"]\n",
    "    if \"no_mutate\" in action:\n",
    "        del action[\"no_mutate\"]\n",
    "    if \"sql_type\" in action:\n",
    "        del action[\"sql_type\"]\n",
    "    del action[\"type\"]\n",
    "    if \"op\" in action:\n",
    "        action[\"input_type\"] = \"bool\"\n",
    "        del action[\"op\"]\n",
    "        \n",
    "    \n",
    "    # backends ---\n",
    "    backends = get_postgresql_status(result_dict)\n",
    "    \n",
    "    return dict(\n",
    "        example = str(expr),\n",
    "        category = category,\n",
    "        #expr_frame = replace_meta_args(expr, _.x, _.y, _.z),\n",
    "        #accessor = accessor[0],\n",
    "        backends = backends,\n",
    "        action = action\n",
    "    )\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "out = {}\n",
    "for category, d in series.funcs_stripped.items():\n",
    "    for name, call in d.items():\n",
    "        out[name] = get_type_info2(call, name, category)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading into pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>example</th>\n",
       "      <th>category</th>\n",
       "      <th>action.status</th>\n",
       "      <th>action.kind</th>\n",
       "      <th>action.input_type</th>\n",
       "      <th>backends.postgresql.status</th>\n",
       "      <th>backends.postgresql.flags</th>\n",
       "      <th>backends.postgresql.result_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>__invert__</td>\n",
       "      <td>~_</td>\n",
       "      <td>_special_methods</td>\n",
       "      <td>supported</td>\n",
       "      <td>elwise</td>\n",
       "      <td>bool</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>__and__</td>\n",
       "      <td>_ &amp; _</td>\n",
       "      <td>_special_methods</td>\n",
       "      <td>supported</td>\n",
       "      <td>elwise</td>\n",
       "      <td>bool</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>__or__</td>\n",
       "      <td>_ | _</td>\n",
       "      <td>_special_methods</td>\n",
       "      <td>supported</td>\n",
       "      <td>elwise</td>\n",
       "      <td>bool</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>__xor__</td>\n",
       "      <td>_ ^ _</td>\n",
       "      <td>_special_methods</td>\n",
       "      <td>supported</td>\n",
       "      <td>elwise</td>\n",
       "      <td>bool</td>\n",
       "      <td>todo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>__neg__</td>\n",
       "      <td>-_</td>\n",
       "      <td>_special_methods</td>\n",
       "      <td>supported</td>\n",
       "      <td>elwise</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>365</th>\n",
       "      <td>to_dense</td>\n",
       "      <td>_.to_dense()</td>\n",
       "      <td>io</td>\n",
       "      <td>wontdo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>366</th>\n",
       "      <td>to_string</td>\n",
       "      <td>_.to_string()</td>\n",
       "      <td>io</td>\n",
       "      <td>todo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>367</th>\n",
       "      <td>to_markdown</td>\n",
       "      <td>_.to_markdown()</td>\n",
       "      <td>io</td>\n",
       "      <td>todo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>368</th>\n",
       "      <td>to_clipboard</td>\n",
       "      <td>_.to_clipboard()</td>\n",
       "      <td>io</td>\n",
       "      <td>wontdo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>369</th>\n",
       "      <td>to_latex</td>\n",
       "      <td>_.to_latex()</td>\n",
       "      <td>io</td>\n",
       "      <td>wontdo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>370 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           method           example          category action.status  \\\n",
       "0      __invert__                ~_  _special_methods     supported   \n",
       "1         __and__             _ & _  _special_methods     supported   \n",
       "2          __or__             _ | _  _special_methods     supported   \n",
       "3         __xor__             _ ^ _  _special_methods     supported   \n",
       "4         __neg__                -_  _special_methods     supported   \n",
       "..            ...               ...               ...           ...   \n",
       "365      to_dense      _.to_dense()                io        wontdo   \n",
       "366     to_string     _.to_string()                io          todo   \n",
       "367   to_markdown   _.to_markdown()                io          todo   \n",
       "368  to_clipboard  _.to_clipboard()                io        wontdo   \n",
       "369      to_latex      _.to_latex()                io        wontdo   \n",
       "\n",
       "    action.kind action.input_type backends.postgresql.status  \\\n",
       "0        elwise              bool                        NaN   \n",
       "1        elwise              bool                        NaN   \n",
       "2        elwise              bool                        NaN   \n",
       "3        elwise              bool                       todo   \n",
       "4        elwise               NaN                        NaN   \n",
       "..          ...               ...                        ...   \n",
       "365         NaN               NaN                        NaN   \n",
       "366         NaN               NaN                        NaN   \n",
       "367         NaN               NaN                        NaN   \n",
       "368         NaN               NaN                        NaN   \n",
       "369         NaN               NaN                        NaN   \n",
       "\n",
       "    backends.postgresql.flags backends.postgresql.result_type  \n",
       "0                         NaN                             NaN  \n",
       "1                         NaN                             NaN  \n",
       "2                         NaN                             NaN  \n",
       "3                         NaN                             NaN  \n",
       "4                         NaN                             NaN  \n",
       "..                        ...                             ...  \n",
       "365                       NaN                             NaN  \n",
       "366                       NaN                             NaN  \n",
       "367                       NaN                             NaN  \n",
       "368                       NaN                             NaN  \n",
       "369                       NaN                             NaN  \n",
       "\n",
       "[370 rows x 9 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.json_normalize([{'method': k, **v} for k, v in out.items()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dumping to yaml\n",
    "\n",
    "Key questions:\n",
    "\n",
    "* should we sort alphabetically? **I think yes.** It won't group categories, but will very predictable. The YAML spec is for storing, the airtable (or DF) for browsing.\n",
    "* loading the yaml takes half a second--is that reasonable for delaying the import? No, I don't think so. But it's only loaded during testing, and for the experimental fast pandas grouped ops. Before removing their experimental status, we should use the spec to generate the fast grouped ops code (or maybe save a subset of the spec for them, which shouldn't be edited)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T:\n",
      "  action:\n",
      "    status: wontdo\n",
      "  backends: {}\n",
      "  category: attributes\n",
      "  example: _.T\n",
      "__add__:\n",
      "  action:\n",
      "    kind: elwise\n",
      "    status: supported\n",
      "  backends: {}\n",
      "  category: _special_methods\n",
      "  example: _ + _\n",
      "__and__:\n",
      "  action:\n",
      "    input_type: bool\n",
      "    kind: elwise\n",
      "    status: supported\n",
      "  backends: {}\n",
      "  category: _special_methods\n",
      "  example: _ & _\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import yaml\n",
    "print(yaml.dump(out)[:344])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to dump\n",
    "#yaml.dump(out, open(\"../../siuba/spec/series.yml\", \"w\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/Users/machow/Dropbox/Repo/siuba/siuba/spec/series.yml'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pkg_resources\n",
    "\n",
    "# will go here (on my filesystem)\n",
    "pkg_resources.resource_filename(\"siuba.spec\", \"series.yml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 460 ms, sys: 12.6 ms, total: 472 ms\n",
      "Wall time: 600 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "spec = yaml.load(open(\"../../siuba/spec/series.yml\"), Loader = yaml.SafeLoader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join with airtable\n",
    "\n",
    "As a reminder, an entry from the yaml spec so far is shown below..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'example': '_.nunique()',\n",
       " 'category': 'computations',\n",
       " 'backends': {'postgresql': {'flags': ['no_mutate']}},\n",
       " 'action': {'status': 'supported', 'kind': 'agg'}}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_spec = out\n",
    "raw_spec[\"nunique\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is useful, but I had tracked other information on the airtable, like..\n",
    "\n",
    "* **priority**: how important is this to implement?\n",
    "* **version deprecated**: when was this deprecated?\n",
    "* **result length**: eg does it return a single value, a value for each group, or something else?\n",
    "\n",
    "For now, I'll pull out priority, and will likely just keep the other info in the airtable. I would rather start with less rather than more (and wrap up faster in the process)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>__invert__</td>\n",
       "      <td>{'example': '~_', 'category': '_special_method...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>__and__</td>\n",
       "      <td>{'example': '_ &amp; _', 'category': '_special_met...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>__or__</td>\n",
       "      <td>{'example': '_ | _', 'category': '_special_met...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>__xor__</td>\n",
       "      <td>{'example': '_ ^ _', 'category': '_special_met...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>__neg__</td>\n",
       "      <td>{'example': '-_', 'category': '_special_method...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       method                                               data\n",
       "0  __invert__  {'example': '~_', 'category': '_special_method...\n",
       "1     __and__  {'example': '_ & _', 'category': '_special_met...\n",
       "2      __or__  {'example': '_ | _', 'category': '_special_met...\n",
       "3     __xor__  {'example': '_ ^ _', 'category': '_special_met...\n",
       "4     __neg__  {'example': '-_', 'category': '_special_method..."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# read yaml spec into a dataframe, so we can join w/ airtable\n",
    "data = pd.DataFrame([{'method': k, 'data': v} for k, v in raw_spec.items()])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from airtable import Airtable\n",
    "import os\n",
    "\n",
    "# note, airtable API key is in my environment\n",
    "airtable = Airtable('appErTNqCFXn6stSH', 'methods')\n",
    "\n",
    "air_entries = airtable.get_all()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>createdTime</th>\n",
       "      <th>category</th>\n",
       "      <th>method_name</th>\n",
       "      <th>support_category</th>\n",
       "      <th>op_type</th>\n",
       "      <th>min_data_arity</th>\n",
       "      <th>Name</th>\n",
       "      <th>version_added</th>\n",
       "      <th>result_length</th>\n",
       "      <th>note</th>\n",
       "      <th>version_deprecated</th>\n",
       "      <th>max_data_arity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>rec09hioM01yPbBQB</td>\n",
       "      <td>2020-01-20T18:31:22.000Z</td>\n",
       "      <td>Computations / descriptive stats</td>\n",
       "      <td>min</td>\n",
       "      <td>done</td>\n",
       "      <td>aggregation</td>\n",
       "      <td>1.0</td>\n",
       "      <td>min</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>rec0eOvXaPwn0KqKs</td>\n",
       "      <td>2020-01-20T18:31:22.000Z</td>\n",
       "      <td>Timedelta methods</td>\n",
       "      <td>cat.as_ordered</td>\n",
       "      <td>priority-low</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>cat.as_ordered</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>rec0myu0wfZxDHVCK</td>\n",
       "      <td>2020-01-20T18:31:22.000Z</td>\n",
       "      <td>Timedelta methods</td>\n",
       "      <td>str.findall</td>\n",
       "      <td>done</td>\n",
       "      <td>elementwise</td>\n",
       "      <td>NaN</td>\n",
       "      <td>str.findall</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>rec0qdwOY8z1A9AsO</td>\n",
       "      <td>2020-01-20T18:31:22.000Z</td>\n",
       "      <td>Reshaping, sorting</td>\n",
       "      <td>argmin</td>\n",
       "      <td>deprecated</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>argmin</td>\n",
       "      <td>v0.21.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>rec0vjO0Mni6FXWa7</td>\n",
       "      <td>2020-01-20T18:31:22.000Z</td>\n",
       "      <td>Timedelta methods</td>\n",
       "      <td>str.isdecimal</td>\n",
       "      <td>done</td>\n",
       "      <td>elementwise</td>\n",
       "      <td>NaN</td>\n",
       "      <td>str.isdecimal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  id               createdTime  \\\n",
       "0  rec09hioM01yPbBQB  2020-01-20T18:31:22.000Z   \n",
       "1  rec0eOvXaPwn0KqKs  2020-01-20T18:31:22.000Z   \n",
       "2  rec0myu0wfZxDHVCK  2020-01-20T18:31:22.000Z   \n",
       "3  rec0qdwOY8z1A9AsO  2020-01-20T18:31:22.000Z   \n",
       "4  rec0vjO0Mni6FXWa7  2020-01-20T18:31:22.000Z   \n",
       "\n",
       "                           category     method_name support_category  \\\n",
       "0  Computations / descriptive stats             min             done   \n",
       "1                 Timedelta methods  cat.as_ordered     priority-low   \n",
       "2                 Timedelta methods     str.findall             done   \n",
       "3                Reshaping, sorting          argmin       deprecated   \n",
       "4                 Timedelta methods   str.isdecimal             done   \n",
       "\n",
       "       op_type  min_data_arity            Name version_added result_length  \\\n",
       "0  aggregation             1.0             min           NaN           NaN   \n",
       "1          NaN             NaN  cat.as_ordered           NaN           NaN   \n",
       "2  elementwise             NaN     str.findall           NaN           NaN   \n",
       "3          NaN             NaN          argmin       v0.21.0           NaN   \n",
       "4  elementwise             NaN   str.isdecimal           NaN           NaN   \n",
       "\n",
       "  note version_deprecated  max_data_arity  \n",
       "0  NaN                NaN             NaN  \n",
       "1  NaN                NaN             NaN  \n",
       "2  NaN                NaN             NaN  \n",
       "3  NaN                NaN             NaN  \n",
       "4  NaN                NaN             NaN  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "air_df = pd.json_normalize(air_entries)\n",
    "air_df.columns = air_df.columns.str.replace(\"fields.\", \"\")\n",
    "air_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('rpow',\n",
       "  {'example': '_.rpow(_)',\n",
       "   'category': 'binary',\n",
       "   'backends': {'postgresql': {'status': 'todo'}},\n",
       "   'action': {'status': 'supported', 'kind': 'elwise'}}),\n",
       " ('combine',\n",
       "  {'example': \"_.combine(_,'max')\",\n",
       "   'category': 'binary',\n",
       "   'backends': {},\n",
       "   'action': {'status': 'todo'},\n",
       "   'priority': 1}),\n",
       " ('combine_first',\n",
       "  {'example': \"_.combine_first(_,'max')\",\n",
       "   'category': 'binary',\n",
       "   'backends': {},\n",
       "   'action': {'status': 'todo'},\n",
       "   'priority': 1})]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Pull out priority info\n",
    "from siuba import *\n",
    "prioritized = (\n",
    "    data\n",
    "    >> full_join(_, air_df, {\"method\": \"method_name\"})\n",
    "    >> filter(~_.data.isna())\n",
    "    >> select(-_.method_name, -_.createdTime, -_.id, -_.expr_frame, -_.expr_series)\n",
    ")   \n",
    "\n",
    "new_yaml = (prioritized\n",
    "  >> mutate(\n",
    "    priority = _.support_category.map({\n",
    "        'priority-high': 3, 'priority-medium': 2, 'priority-low': 1, 'priority-zero': 0\n",
    "    }),\n",
    "    data = _.apply(\n",
    "        lambda d: {\n",
    "            **d[\"data\"],\n",
    "            **({'priority': int(d[\"priority\"])} if not pd.isna(d[\"priority\"]) else {})},\n",
    "        axis = 1\n",
    "    )\n",
    "  )\n",
    "  >> pipe(_.set_index(\"method\").data.to_dict())\n",
    ")\n",
    "\n",
    "list(new_yaml.items())[109:112]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to save yaml spec\n",
    "#yaml.dump(new_yaml, open(\"../../siuba/spec/series.yml\", \"w\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}