{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lale: Library for Semi-Automated Data Science\n",
    "\n",
    "Martin Hirzel, Kiran Kate, Avi Shinnar, Guillaume Baudart, and Pari Ram\n",
    "\n",
    "5 November 2019\n",
    "\n",
    "Examples, documentation, code: https://github.com/ibm/lale\n",
    "\n",
    "<img src=\"../docs/img/lale_logo.jpg\" alt=\"logo\" width=\"140px\" align=\"left\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is the basis for a\n",
    "[talk](https://pydata.org/nyc2019/schedule/presentation/29/type-driven-automated-learning-with-lale/)\n",
    "about [Lale](https://github.com/ibm/lale).  Lale is an open-source\n",
    "Python library for semi-automated data science. Lale is compatible\n",
    "with scikit-learn, adding a simple interface to existing\n",
    "machine-learning automation tools. Lale lets you search over possible\n",
    "pipelines in just a few lines of code while remaining in control of\n",
    "your work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Value Proposition\n",
    "\n",
    "When writing machine-learning pipelines, you have a lot of decisions\n",
    "to make, such as picking transformers, estimators, and\n",
    "hyperparameters. Since some of these decisions are tricky, you will\n",
    "likely find yourself searching over many possible pipelines.\n",
    "Machine-learning automation tools help with this search.\n",
    "Unfortunately, each of these tools has its own API, and the search\n",
    "spaces are not necessarily consistent nor even correct. We have\n",
    "discovered that types (such as enum, float, or dictionary) can both\n",
    "check the correctness of, and help automatically search over,\n",
    "hyperparameters and pipeline configurations.\n",
    "\n",
    "To address this issue, we have open-sourced Lale, an open-source\n",
    "Python library for semi-automated data science. Lale is compatible\n",
    "with scikit-learn, adding a simple interface to existing\n",
    "machine-learning automation tools.\n",
    "Lale is designed to augment, but not replace, the data scientist.\n",
    "\n",
    "The **target user** of Lale is the working data scientist.\n",
    "The **scope** of Lale includes machine learning (both deep learning\n",
    "and non-DL) and data preparation. The **value** of Lale encompasses:\n",
    "\n",
    "<img src=\"img/2019-1105-three-values.png\" style=\"width:350px\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical + Continuous Dataset\n",
    "\n",
    "To demonstrate automated machine learning (AutoML), we first need\n",
    "a dataset. We will use a tabular dataset that has two kinds of\n",
    "features: categorical features (columns that can contain one of a\n",
    "small set of strings) and continuous features (numerical columns).  In\n",
    "particular, we use the `credit-g` dataset from OpenML. After fetching\n",
    "the data, we display a few rows to get a better understanding of its\n",
    "labels and features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: liac-arff>=2.4.0 in /usr/local/lib/python3.7/site-packages (2.4.0)\n",
      "\u001b[33mWARNING: You are using pip version 19.3.1; however, version 21.0.1 is available.\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install 'liac-arff>=2.4.0'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>checking_status</th>\n",
       "      <th>duration</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings_status</th>\n",
       "      <th>employment</th>\n",
       "      <th>installment_commitment</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>other_parties</th>\n",
       "      <th>residence_since</th>\n",
       "      <th>property_magnitude</th>\n",
       "      <th>age</th>\n",
       "      <th>other_payment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>existing_credits</th>\n",
       "      <th>job</th>\n",
       "      <th>num_dependents</th>\n",
       "      <th>own_telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>835</th>\n",
       "      <td>bad</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>no credits/all paid</td>\n",
       "      <td>new car</td>\n",
       "      <td>1082.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>car</td>\n",
       "      <td>48.0</td>\n",
       "      <td>bank</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>bad</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>27.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>business</td>\n",
       "      <td>3915.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>2.0</td>\n",
       "      <td>car</td>\n",
       "      <td>36.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>2.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>629</th>\n",
       "      <td>good</td>\n",
       "      <td>no checking</td>\n",
       "      <td>9.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>education</td>\n",
       "      <td>3832.0</td>\n",
       "      <td>no known savings</td>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>1.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>64.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>bad</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>18.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>1928.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>&lt;1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>2.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>31.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>684</th>\n",
       "      <td>good</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>36.0</td>\n",
       "      <td>delayed previously</td>\n",
       "      <td>business</td>\n",
       "      <td>9857.0</td>\n",
       "      <td>100&lt;=X&lt;500</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "      <td>1.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>3.0</td>\n",
       "      <td>life insurance</td>\n",
       "      <td>31.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>2.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    class checking_status  duration                  credit_history  \\\n",
       "835   bad              <0      12.0             no credits/all paid   \n",
       "192   bad        0<=X<200      27.0                   existing paid   \n",
       "629  good     no checking       9.0                   existing paid   \n",
       "559   bad        0<=X<200      18.0  critical/other existing credit   \n",
       "684  good        0<=X<200      36.0              delayed previously   \n",
       "\n",
       "                 purpose  credit_amount    savings_status employment  \\\n",
       "835              new car         1082.0              <100     1<=X<4   \n",
       "192             business         3915.0              <100     1<=X<4   \n",
       "629            education         3832.0  no known savings        >=7   \n",
       "559  furniture/equipment         1928.0              <100         <1   \n",
       "684             business         9857.0        100<=X<500     4<=X<7   \n",
       "\n",
       "     installment_commitment personal_status other_parties  residence_since  \\\n",
       "835                     4.0     male single          none              4.0   \n",
       "192                     4.0     male single          none              2.0   \n",
       "629                     1.0     male single          none              4.0   \n",
       "559                     2.0     male single          none              2.0   \n",
       "684                     1.0     male single          none              3.0   \n",
       "\n",
       "    property_magnitude   age other_payment_plans housing  existing_credits  \\\n",
       "835                car  48.0                bank     own               2.0   \n",
       "192                car  36.0                none     own               1.0   \n",
       "629        real estate  64.0                none     own               1.0   \n",
       "559        real estate  31.0                none     own               2.0   \n",
       "684     life insurance  31.0                none     own               2.0   \n",
       "\n",
       "                    job  num_dependents own_telephone foreign_worker  \n",
       "835             skilled             1.0          none            yes  \n",
       "192             skilled             2.0           yes            yes  \n",
       "629  unskilled resident             1.0          none            yes  \n",
       "559  unskilled resident             1.0          none            yes  \n",
       "684  unskilled resident             2.0           yes            yes  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import lale.datasets.openml\n",
    "import pandas as pd\n",
    "(train_X, train_y), (test_X, test_y) = lale.datasets.openml.fetch(\n",
    "    'credit-g', 'classification', preprocess=False)\n",
    "# print last five rows of labels in train_y and features in train_X\n",
    "pd.options.display.max_columns = None\n",
    "pd.concat([train_y.tail(5), train_X.tail(5)], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The labels `y` are either 0 or 1, which means that this is a binary\n",
    "classification task. The remaining 20 columns are features. Some of the\n",
    "features are categorical, such as `checking_status` or\n",
    "`credit_history`. Other features are continuous, such as `duration` or\n",
    "`credit_amount`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manual Pipeline\n",
    "\n",
    "Lale is designed to support both manual and automated data science,\n",
    "with a consistent API that works for both of these cases as well as\n",
    "for the spectrum of semi-automated cases in between. A\n",
    "machine-learning *pipeline* is a computational graph, where each\n",
    "node is an *operator* (which transforms data or makes predictions)\n",
    "and each edge is an intermediary *dataset* (outputs from previous\n",
    "operators are piped as inputs to the next operators). We first import\n",
    "some operators from scikit-learn and ask Lale to wrap them for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import Normalizer as Norm\n",
    "from sklearn.preprocessing import OneHotEncoder as OneHot\n",
    "from sklearn.linear_model import LogisticRegression as LR\n",
    "import lale\n",
    "lale.wrap_imported_operators()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell imports a couple of utility operators from Lale's\n",
    "standard library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.\n",
      "This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.\n",
      "Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.\n",
      "You can install the OpenMP library by the following command: ``brew install libomp``.\n",
      "  \"You can install the OpenMP library by the following command: ``brew install libomp``.\", UserWarning)\n",
      "/usr/local/lib/python3.7/site-packages/pyparsing.py:3190: FutureWarning: Possible set intersection at position 3\n",
      "  self.re = re.compile(self.reString)\n"
     ]
    }
   ],
   "source": [
    "from lale.lib.lale import Project\n",
    "from lale.lib.lale import ConcatFeatures as Concat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have all the operators we need, we arrange them into a\n",
    "pipeline. The `Project` operator works like the corresponding\n",
    "relational algebra primitive, picking a subset of the columns of the\n",
    "dataset. An sub-pipeline of the form `op1 >> op2` *pipes* the output\n",
    "from `op1` into `op2`. An sub-pipeline of the form `op1 & op2` causes\n",
    "both op1 *and* op2 to execute on the same data. The overall pipeline\n",
    "preprocesses numbers with a normalizer and strings with a one-hot\n",
    "encoder, then concatenates the corresponding columns and pipes the\n",
    "result into an `LR` classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"332pt\" height=\"87pt\"\n",
       " viewBox=\"0.00 0.00 332.00 87.38\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 83.3848)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-83.3848 328,-83.3848 328,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"27\" cy=\"-61.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-58.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>norm</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.normalizer.html\" xlink:title=\"norm = Norm()\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"117\" cy=\"-61.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-58.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Norm</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;norm -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>project_0&#45;&gt;norm</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-61.3848C62.0277,-61.3848 70.9665,-61.3848 79.5309,-61.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-64.8849 89.705,-61.3848 79.705,-57.8849 79.7051,-64.8849\"/>\n",
       "</g>\n",
       "<!-- concat -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>concat</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" xlink:title=\"concat = Concat\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"207\" cy=\"-39.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"207\" y=\"-36.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Concat</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm&#45;&gt;concat -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>norm&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-55.1393C151.5751,-52.9331 161.8933,-50.4109 171.5886,-48.0409\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"172.5036,-51.4204 181.3865,-45.6459 170.8414,-44.6206 172.5036,-51.4204\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_1 = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"27\" cy=\"-18.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-15.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>one_hot</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" xlink:title=\"one_hot = OneHot()\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"117\" cy=\"-18.3848\" rx=\"27\" ry=\"18.2703\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-20.5848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-9.5848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Hot</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;one_hot -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>project_1&#45;&gt;one_hot</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-18.3848C62.0277,-18.3848 70.9665,-18.3848 79.5309,-18.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-21.8849 89.705,-18.3848 79.705,-14.8849 79.7051,-21.8849\"/>\n",
       "</g>\n",
       "<!-- one_hot&#45;&gt;concat -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>one_hot&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-24.3464C151.5751,-26.4523 161.8933,-28.8599 171.5886,-31.1221\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"170.8527,-34.5444 181.3865,-33.4083 172.4434,-27.7275 170.8527,-34.5444\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node6\" class=\"node\">\n",
       "<title>lr</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" xlink:title=\"lr = LR(C=0.001)\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"297\" cy=\"-39.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"297\" y=\"-36.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat&#45;&gt;lr -->\n",
       "<g id=\"edge5\" class=\"edge\">\n",
       "<title>concat&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M234.003,-39.3848C242.0277,-39.3848 250.9665,-39.3848 259.5309,-39.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"259.7051,-42.8849 269.705,-39.3848 259.705,-35.8849 259.7051,-42.8849\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x134beec50>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "manual_trainable = (\n",
    "       (  Project(columns={'type': 'number'}) >> Norm()\n",
    "        & Project(columns={'type': 'string'}) >> OneHot())\n",
    "    >> Concat\n",
    "    >> LR(LR.enum.penalty.l2, C=0.001))\n",
    "manual_trainable.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we have manually chosen all operators and configured\n",
    "their hyperparameters. For the `LR` operator, we configured the\n",
    "penalty hyperparameter with `l2` and the regularization constant `C`\n",
    "with `0.001`. Depending on what tool you use to you view this notebook,\n",
    "you can explore these hyperparameters by hovering over\n",
    "the visualization above and observing the tooltips that pop up.\n",
    "Furthermore, each node in the visualization is a hyperlink that takes\n",
    "you to the documentation of the corresponding operator.  Calling `fit`\n",
    "on the trainable pipeline returns a trained pipeline, and calling\n",
    "`predict` on the trained pipeline returns predictions. We can use\n",
    "off-the-shelf scikit-learn metrics to evaluate the result. In this\n",
    "case, the accuracy is poor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy 70.9%\n"
     ]
    }
   ],
   "source": [
    "import sklearn.metrics\n",
    "manual_trained = manual_trainable.fit(train_X, train_y)\n",
    "manual_y = manual_trained.predict(test_X)\n",
    "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, manual_y):.1%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline Combinators\n",
    "\n",
    "As the previous example demonstrates, Lale provides combinators `>>`\n",
    "and `&` for arranging operators into a pipeline. These combinators are\n",
    "actually syntactic sugar for functions `make_pipeline` and\n",
    "`make_union`, which were inspired by the corresponding functions in\n",
    "scikit-learn. To support AutoML, Lale supports a third combinator `|`\n",
    "that specifies an algorithmic choice. Scikit-learn does not have a\n",
    "corresponding function. Various AutoML tools support algorithmic\n",
    "choice in some form or other, using their own tool-specific syntax.\n",
    "Lale's `|` combinator is syntactic sugar for Lale's `make_choice`\n",
    "function.\n",
    "\n",
    "| Lale feature            | Name | Description  | Scikit-learn feature                |\n",
    "| ----------------------- | ---- | ------------ | ----------------------------------- |\n",
    "| >> or `make_pipeline`   | pipe | feed to next | `make_pipeline`                     |\n",
    "| & or `make_union`       | and  | run both     | `make_union` or `ColumnTransformer` |\n",
    "| &#x7c; or `make_choice` | or   | choose one   | N/A (specific to given AutoML tool) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automated Pipeline\n",
    "\n",
    "Next, we will use automation to search for a better pipeline for the\n",
    "`credit-g` dataset. Specifically, we will perform combined\n",
    "hyperparameter optimization and algorithm selection (CASH). To do\n",
    "this, we first import a couple more operators to serve as choices in\n",
    "the algorithm selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xgboost import XGBClassifier as XGBoost\n",
    "from sklearn.svm import LinearSVC\n",
    "lale.wrap_imported_operators()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we arrange the pipeline that specifies the search space for\n",
    "CASH. As promised, the look-and-feel for the automated case resembles\n",
    "that for the manual case from earlier. The main difference is the use\n",
    "of the `|` combinator to provide algorithmic choice\n",
    "(`LR | XGBoost | LinearSVC`). However, there is another difference: several of the\n",
    "operators are not configured with hyperparameters. For instance, this\n",
    "pipeline writes `LR` instead of `LR(LR.penalty.l1, C=0.001)`.\n",
    "Omitting the arguments means that as the data scientist, we do not\n",
    "bind the hyperparameters by hand in this pipeline. Instead, we leave\n",
    "these bindings free for CASH to search over. Similarly, the code\n",
    "contains `Norm` instead of `Norm()` and `OneHot` instead of `OneHot()`.\n",
    "This illustrates that automated hyperparameter tuning can be used not\n",
    "only on the final classifier but also inside of nested preprocessing\n",
    "sub-pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"351pt\" height=\"185pt\"\n",
       " viewBox=\"0.00 0.00 350.78 185.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 181)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-181 346.7826,-181 346.7826,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<g id=\"clust1\" class=\"cluster\">\n",
       "<title>cluster:choice</title>\n",
       "<g id=\"a_clust1\"><a xlink:title=\"choice = lr | xg_boost | linear_svc\">\n",
       "<polygon fill=\"#7ec0ee\" stroke=\"#000000\" points=\"262,-8 262,-169 334.7826,-169 334.7826,-8 262,-8\"/>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-153.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Choice</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"27\" cy=\"-143\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-139.7\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>norm</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.normalizer.html\" xlink:title=\"norm = Norm\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"#000000\" cx=\"117\" cy=\"-143\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-139.7\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Norm</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;norm -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>project_0&#45;&gt;norm</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-143C62.0277,-143 70.9665,-143 79.5309,-143\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-146.5001 89.705,-143 79.705,-139.5001 79.7051,-146.5001\"/>\n",
       "</g>\n",
       "<!-- concat -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>concat</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" xlink:title=\"concat = Concat\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"207\" cy=\"-121\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"207\" y=\"-117.7\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Concat</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm&#45;&gt;concat -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>norm&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-136.7545C151.5751,-134.5483 161.8933,-132.0261 171.5886,-129.6561\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"172.5036,-133.0356 181.3865,-127.2611 170.8414,-126.2358 172.5036,-133.0356\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_1 = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"#000000\" cx=\"27\" cy=\"-100\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-96.7\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>one_hot</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" xlink:title=\"one_hot = OneHot\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"#000000\" cx=\"117\" cy=\"-100\" rx=\"27\" ry=\"18.2703\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-102.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-91.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Hot</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;one_hot -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>project_1&#45;&gt;one_hot</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-100C62.0277,-100 70.9665,-100 79.5309,-100\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-103.5001 89.705,-100 79.705,-96.5001 79.7051,-103.5001\"/>\n",
       "</g>\n",
       "<!-- one_hot&#45;&gt;concat -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>one_hot&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-105.9616C151.5751,-108.0675 161.8933,-110.4751 171.5886,-112.7373\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"170.8527,-116.1596 181.3865,-115.0235 172.4434,-109.3427 170.8527,-116.1596\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node6\" class=\"node\">\n",
       "<title>lr</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" xlink:title=\"lr = LR\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"#000000\" cx=\"298.3913\" cy=\"-121\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-117.7\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat&#45;&gt;lr -->\n",
       "<g id=\"edge5\" class=\"edge\">\n",
       "<title>concat&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M234.4204,-121C242.8099,-121 252.1833,-121 261.1126,-121\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"252.0004,-124.5011 262,-121 251.9996,-117.5011 252.0004,-124.5011\"/>\n",
       "</g>\n",
       "<!-- xg_boost -->\n",
       "<g id=\"node7\" class=\"node\">\n",
       "<title>xg_boost</title>\n",
       "<g id=\"a_node7\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.xgboost.xgb_classifier.html\" xlink:title=\"xg_boost = XGBoost\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"#000000\" cx=\"298.3913\" cy=\"-78\" rx=\"27\" ry=\"18.2703\"/>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-80.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">XG&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-69.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Boost</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- linear_svc -->\n",
       "<g id=\"node8\" class=\"node\">\n",
       "<title>linear_svc</title>\n",
       "<g id=\"a_node8\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.linear_svc.html\" xlink:title=\"linear_svc = LinearSVC\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"#000000\" cx=\"298.3913\" cy=\"-34\" rx=\"28.283\" ry=\"18.2703\"/>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-36.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Linear&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"298.3913\" y=\"-25.2\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">SVC</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x134b88fd0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "auto_planned = (\n",
    "       (  Project(columns={'type': 'number'}) >> Norm\n",
    "        & Project(columns={'type': 'string'}) >> OneHot)\n",
    "    >> Concat\n",
    "    >> (LR | XGBoost | LinearSVC))\n",
    "auto_planned.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The colors in the visualization indicate the particular mix of manual\n",
    "vs. automated bindings and will be explained in more detail below.\n",
    "For now, we look at how to actually invoke an AutoML tool from Lale.\n",
    "Lale provides bindings for multiple such tools; here we use the\n",
    "popular hyperopt open-source library. Specifically,\n",
    "`Hyperopt` takes the `auto_planned` pipeline from above as\n",
    "an argument, along with optional specifications for the number of\n",
    "cross-validation folds and trials to run. Calling `fit` yields a\n",
    "trained pipeline. The code that uses that trained pipeline for\n",
    "prediction and evaluation is the same as in the manual use case we saw\n",
    "before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[15:01:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:01:51] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:01:52] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:11] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "100%|██████████| 10/10 [01:01<00:00,  6.12s/trial, best loss: -0.7537102284860132]\n",
      "accuracy 71.2%\n"
     ]
    }
   ],
   "source": [
    "from lale.lib.lale.hyperopt import Hyperopt\n",
    "auto_optimizer = Hyperopt(estimator=auto_planned, cv=3, max_evals=10)\n",
    "auto_trained = auto_optimizer.fit(train_X, train_y)\n",
    "auto_y = auto_trained.predict(test_X)\n",
    "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, auto_y):.1%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predictive performance is better, and in fact, it is in the same\n",
    "ball-park as the state-of-the-art performance reported for this\n",
    "dataset in OpenML. Since we left various choices to automation, at\n",
    "this point, we probably want to inspect what the automation chose\n",
    "for us. One way to do that is by visualizing the pipeline as before.\n",
    "This reveals that hyperopt chose `XGBoost` instead of the scikit-learn\n",
    "classifiers. Also, by hovering over the nodes in the visualization, we\n",
    "can explore how hyperopt configured the hyperparameters. Note that all\n",
    "nodes are visualized in white to indicate they are fully trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"332pt\" height=\"87pt\"\n",
       " viewBox=\"0.00 0.00 332.00 87.38\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 83.3848)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-83.3848 328,-83.3848 328,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"27\" cy=\"-61.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-58.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>norm</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.normalizer.html\" xlink:title=\"norm = Norm(norm=&#39;max&#39;)\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"117\" cy=\"-61.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-58.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Norm</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;norm -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>project_0&#45;&gt;norm</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-61.3848C62.0277,-61.3848 70.9665,-61.3848 79.5309,-61.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-64.8849 89.705,-61.3848 79.705,-57.8849 79.7051,-64.8849\"/>\n",
       "</g>\n",
       "<!-- concat -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>concat</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" xlink:title=\"concat = Concat()\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"207\" cy=\"-39.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"207\" y=\"-36.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Concat</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- norm&#45;&gt;concat -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>norm&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-55.1393C151.5751,-52.9331 161.8933,-50.4109 171.5886,-48.0409\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"172.5036,-51.4204 181.3865,-45.6459 170.8414,-44.6206 172.5036,-51.4204\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_1 = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"27\" cy=\"-18.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-15.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>one_hot</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" xlink:title=\"one_hot = OneHot()\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"117\" cy=\"-18.3848\" rx=\"27\" ry=\"18.2703\"/>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-20.5848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"117\" y=\"-9.5848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">Hot</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;one_hot -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>project_1&#45;&gt;one_hot</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M54.003,-18.3848C62.0277,-18.3848 70.9665,-18.3848 79.5309,-18.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"79.7051,-21.8849 89.705,-18.3848 79.705,-14.8849 79.7051,-21.8849\"/>\n",
       "</g>\n",
       "<!-- one_hot&#45;&gt;concat -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>one_hot&#45;&gt;concat</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M142.5497,-24.3464C151.5751,-26.4523 161.8933,-28.8599 171.5886,-31.1221\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"170.8527,-34.5444 181.3865,-33.4083 172.4434,-27.7275 170.8527,-34.5444\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node6\" class=\"node\">\n",
       "<title>lr</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" xlink:title=\"lr = LR(fit_intercept=False, intercept_scaling=0.38106476479749274, max_iter=422, multi_class=&#39;multinomial&#39;, solver=&#39;saga&#39;, tol=0.004851580781685862)\">\n",
       "<ellipse fill=\"#ffffff\" stroke=\"#000000\" cx=\"297\" cy=\"-39.3848\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"297\" y=\"-36.0848\" font-family=\"Times,serif\" font-size=\"11.00\" fill=\"#000000\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat&#45;&gt;lr -->\n",
       "<g id=\"edge5\" class=\"edge\">\n",
       "<title>concat&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M234.003,-39.3848C242.0277,-39.3848 250.9665,-39.3848 259.5309,-39.3848\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"259.7051,-42.8849 269.705,-39.3848 259.705,-35.8849 259.7051,-42.8849\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x134c045c0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "auto_trained.get_pipeline().visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to inspect the results of automation is to pretty-print\n",
    "the trained pipeline back as Python source code. This lets us look at\n",
    "the output of automation in the same syntax we used to specify\n",
    "the input to automation in the first place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "project_0 = Project(columns={\"type\": \"number\"})\n",
       "norm = Norm(norm=\"max\")\n",
       "project_1 = Project(columns={\"type\": \"string\"})\n",
       "lr = LR(\n",
       "    fit_intercept=False,\n",
       "    intercept_scaling=0.38106476479749274,\n",
       "    max_iter=422,\n",
       "    multi_class=\"multinomial\",\n",
       "    solver=\"saga\",\n",
       "    tol=0.004851580781685862,\n",
       ")\n",
       "pipeline = ((project_0 >> norm) & (project_1 >> OneHot())) >> Concat() >> lr\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "auto_trained.get_pipeline().pretty_print(ipython_display=True, show_imports=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bindings as Lifecycle\n",
    "\n",
    "The previous example already alluded to the notion of bindings and\n",
    "that those bindings are reflected in node visualization colors.\n",
    "Bindings here relate to the mathematical notion of free or bound\n",
    "variables in formulas. *Bindings as lifecycle* is one of the fundamental\n",
    "novel concepts around which we designed Lale. Lale distinguishes two\n",
    "kinds of operators: *individual operators* are data-science primitives\n",
    "such as data preprocessors or predictive models, whereas *pipelines*\n",
    "are compositions of operators. Each operator, whether individual or\n",
    "composite, can be in one of four lifecycle states: meta-model,\n",
    "planned, trainable, and trained. The lifecycle state of an operator is\n",
    "defined by which bindings it has. Specifically:\n",
    "\n",
    "- *Meta-model*: Individual operators at the meta-model state have\n",
    "  schemas and priors for their hyperparameters. Pipelines at the\n",
    "  meta-model state have steps (which are other operators) and a\n",
    "  grammar.\n",
    "\n",
    "- *Planned*: To get a planned pipeline, we need to *arrange* it by\n",
    "  binding a specific graph topology. This topology is consistent with\n",
    "  but more concrete than the steps and grammar from the meta-model\n",
    "  state.\n",
    "\n",
    "- *Trainable*: To get a trainable individual operator, we need to\n",
    "  initialize the concrete bindings for its hyperparameters.\n",
    "  Scikit-learn does this using `__init__`; Lale emulates that syntax\n",
    "  using `__call__`. To get a trainable pipeline, we need to bind\n",
    "  concrete operator choices given by the `|` combinator.\n",
    "\n",
    "- *Trained*: To get a trained individual operator, we need to *fit* it\n",
    "  to the data, thus binding its learnable coefficients. A trained\n",
    "  pipeline is like a trainable pipeline where all steps are trained.\n",
    "  More generally, the lifecycle state of a pipeline is limited by the\n",
    "  least upper bound of the states of its steps.\n",
    "\n",
    "<img src=\"img/2019-1105-bindings.png\" style=\"width:450px\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interested reader can explore Lale's concept of bindings as\n",
    "lifecycle in more detail in our\n",
    "[paper](https://arxiv.org/pdf/1906.03957.pdf).\n",
    "\n",
    "The above diagram is actually a Venn diagram. Each lifecycle state is\n",
    "a subset of its predecessor. For instance, a trained operator has all\n",
    "the bindings of a planned operator and in addition also binds learned\n",
    "coefficients. Since a trained operator has all bindings required for\n",
    "being a trainable operator, it can be used where a trainable operator\n",
    "is expected. Thus, the set of trained operators is a subset of the set\n",
    "of trainable operators. This relationship between the lifecycle states is\n",
    "essential for offering a flexible semi-automated data science\n",
    "experience."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semi-Automated Data Science\n",
    "\n",
    "Why would data scientists prefer semi-automation over full automation?\n",
    "The main reason is to excert control over what kind of pipelines they\n",
    "find. Below are several scenarios motivating semi-automated data\n",
    "science.\n",
    "\n",
    "| Manual control over automation      | Examples |\n",
    "| ----------------------------------- | -------- |\n",
    "| Restrict available operator choices | Interpretable, or based on licenses, or based on GPU requirements, ... |\n",
    "| Tweak graph topology                | Custom preprocessing, or multi-modal data, or fairness mitigation, ... |\n",
    "| Tweak hyperparametrer schemas       | Adjust range for continuous, or restrict choices for categorical, ... |\n",
    "| Expand available operator choices   | Wrap existing library, or write your own operators, ... |\n",
    "\n",
    "In fact, we envision that this will often happen via trial-and-error,\n",
    "where you as the data scientist specify a first planned pipeline, let\n",
    "AutoML do its search, then inspect the results, edit your code, and\n",
    "try again. You already saw some of the Lale features for inspecting\n",
    "the results of automation above. Lale further supports this\n",
    "workflow by letting you specify *partial bindings* of some\n",
    "hyperparameters while allowing other to remain free, and by letting\n",
    "you *freeze* an operator at the trainable or trained state.\n",
    "\n",
    "<img src=\"img/2019-1105-semi-automated.png\" style=\"width:700px\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 20 Newsgroups Dataset\n",
    "\n",
    "For the next examples in this notebook, we will need a different\n",
    "dataset. The following code fetches the 20 newsgroups data, using a\n",
    "function that comes with scikit-learn. Then, it prints a\n",
    "representative sample of the labels and features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y</th>\n",
       "      <th>X</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>rec.autos</td>\n",
       "      <td>From: lerxst@wam.umd.edu (where's my thing)\\nS...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "      <td>From: guykuo@carson.u.washington.edu (Guy Kuo)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "      <td>From: twillis@ec.ecn.purdue.edu (Thomas E Will...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>comp.graphics</td>\n",
       "      <td>From: jgreen@amber (Joe Green)\\nSubject: Re: W...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sci.space</td>\n",
       "      <td>From: jcm@head-cfa.harvard.edu (Jonathan McDow...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       y                                                  X\n",
       "0              rec.autos  From: lerxst@wam.umd.edu (where's my thing)\\nS...\n",
       "1  comp.sys.mac.hardware  From: guykuo@carson.u.washington.edu (Guy Kuo)...\n",
       "2  comp.sys.mac.hardware  From: twillis@ec.ecn.purdue.edu (Thomas E Will...\n",
       "3          comp.graphics  From: jgreen@amber (Joe Green)\\nSubject: Re: W...\n",
       "4              sci.space  From: jcm@head-cfa.harvard.edu (Jonathan McDow..."
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sklearn.datasets\n",
    "news = sklearn.datasets.fetch_20newsgroups()\n",
    "news_X, news_y = news.data, news.target\n",
    "pd.DataFrame({'y': [news.target_names[i] for i in news_y], 'X': news_X}).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The labels `y` are strings taken from a set of 20 different newsgroup\n",
    "names. That means that this is a 20-way classification dataset. The\n",
    "features `X` consist of a single string with a message posted to one\n",
    "of the newsgroups. The task is to use the message to predict which\n",
    "group it was posted on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constraints in Scikit-learn\n",
    "\n",
    "As implied by the title of this notebook and the corresponding talk,\n",
    "Lale is built around the concept of types. In programming languages, a\n",
    "*type* specifies a set of valid values for a variable (e.g., for a\n",
    "hyperparameter).  While types are in the foreground of some\n",
    "programming languages, Python keeps types more in the background.\n",
    "Python is dynamically typed, and libraries such as\n",
    "scikit-learn rely more on exceptions than on types for their error\n",
    "checking. To demonstrate this, we first import some scikit-learn\n",
    "modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn.feature_extraction.text\n",
    "import sklearn.pipeline\n",
    "import sklearn.linear_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create a scikit-learn pipeline, using fully qualified names\n",
    "to ensure and clarify that we are not using any Lale facilities. The\n",
    "pipeline consists of just two operators, a TFIDF transformer for\n",
    "extracting features from the message text followed by a\n",
    "logistic regression for classifying the message by newsgroup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "no error detected yet\n"
     ]
    }
   ],
   "source": [
    "sklearn_misconfigured = sklearn.pipeline.make_pipeline(\n",
    "    sklearn.feature_extraction.text.TfidfVectorizer(),\n",
    "    sklearn.linear_model.LogisticRegression(solver='sag', penalty='l1'))\n",
    "print('no error detected yet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code actually contains a mistake in the hyperparameters for\n",
    "`LogisticRegression`. Unfortunately, scikit-learn does not detect this\n",
    "mistake when the hyperparameters are configured in the code. Instead,\n",
    "it delays its error checking until we attempt to fit the pipeline to\n",
    "data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.45 s, sys: 92.9 ms, total: 3.54 s\n",
      "Wall time: 3.59 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import sys\n",
    "try:\n",
    "    sklearn_misconfigured.fit(news_X, news_y)\n",
    "except ValueError as e:\n",
    "    print(e, file=sys.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above output demonstrates that scikit-learn takes a few seconds to\n",
    "report the error. This is because it first trains the TFIDF\n",
    "transformer on the data, and only when that is done, it attempts to\n",
    "train the `LogisticRegression`, triggering the exception. While a few\n",
    "seconds are no big deal, the 20 newsgroup dataset is small, and for larger\n",
    "datasets, the delay is larger. Furthermore, while in this case, the\n",
    "erroneous code is not far away from the code that triggers the error\n",
    "report, for larger code bases, that distance is also further. The\n",
    "error message is nice and clear about the erroneous hyperpareters, but\n",
    "does not mention which operator was misconfigured (here,\n",
    "`LogisticRegression`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constraints in AutoML\n",
    "\n",
    "The above example illustrates an exception caused by manually\n",
    "misconfigured hyperparameters. Tools for automated machine learning\n",
    "usually run many trials with different hyperparameter configurations.\n",
    "Just as in the manual case, in general, some of these automated trials\n",
    "may raise exceptions.\n",
    "\n",
    "**Solution 1:** Unconstrained search space\n",
    "\n",
    "- {solver: \\[linear, sag, lbfgs\\], penalty: \\[l1, l2\\]}\n",
    "- catch exception (after some time)\n",
    "- return made-up loss `np.float.max`\n",
    "\n",
    "This first solution has the benefit of being simple, but the drawback\n",
    "that it may waste computational resources on training upstream\n",
    "operators before detecting erroneous downstream operators. It also\n",
    "leads to a larger-than-necessary search space. While we can use a high\n",
    "loss to steer the AutoML tool away from poorly performing points in\n",
    "the search space, in our experiments, we have encountered cases where\n",
    "this adversely affects convergence.\n",
    "\n",
    "**Solution 2:** Constrained search space\n",
    "\n",
    "- {solver: \\[linear, sag, lbfgs\\], penalty: \\[l1, l2]\\} **and** (**if** solver: [sag, lbfgs] **then** penalty: [l2])}\n",
    "- no exceptions (no time wasted)\n",
    "- no made-up loss\n",
    "\n",
    "The second solution is the one we advocate in Lale. The Lale library\n",
    "contains hyperparameter schemas for a large number of operators. These\n",
    "schemas are type specifications that encode not just the valid values\n",
    "for each hyperparameter in isolation, but also constraints cutting\n",
    "across multiple hyperparameters such as `solver` and `penalty` in the\n",
    "example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constraints in Lale\n",
    "\n",
    "To demonstrate how Lale handles constraints, we first import Lale's\n",
    "wrapper for TFIDF. We need not import LogisticRegression here since we\n",
    "already imported it earlier in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.sklearn import TfidfVectorizer as Tfidf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lale uses JSON Schema to express types. JSON Schema is a widely\n",
    "supported and adopted standard proposal for specifying the valid\n",
    "structure of JSON documents. By using JSON Schema for hyperparameters,\n",
    "we avoided the need to implement any custom schema validation code.\n",
    "Instead, we simply put the concrete hyperparameters into a JSON\n",
    "document and then use an off-the-shelf validator to check that against\n",
    "the schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 23 ms, sys: 2.6 ms, total: 25.6 ms\n",
      "Wall time: 25.5 ms\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 or no penalties.\n",
      "Schema of constraint 1: {\n",
      "    \"description\": \"The newton-cg, sag, and lbfgs solvers support only l2 or no penalties.\",\n",
      "    \"anyOf\": [\n",
      "        {\n",
      "            \"type\": \"object\",\n",
      "            \"properties\": {\n",
      "                \"solver\": {\"not\": {\"enum\": [\"newton-cg\", \"sag\", \"lbfgs\"]}}\n",
      "            },\n",
      "        },\n",
      "        {\n",
      "            \"type\": \"object\",\n",
      "            \"properties\": {\"penalty\": {\"enum\": [\"l2\", \"none\"]}},\n",
      "        },\n",
      "    ],\n",
      "}\n",
      "Value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import jsonschema\n",
    "try:\n",
    "    lale_misconfigured = Tfidf >> LR(LR.enum.solver.sag, LR.enum.penalty.l1)\n",
    "except jsonschema.ValidationError as e:\n",
    "    print(e.message, file=sys.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output demonstrates that the error check happened more than 100\n",
    "times faster than in the previous example. This is because the code\n",
    "did not need to train TFIDF. The error check also happened closer to\n",
    "the root cause of the error, so the code dinstance is smaller.\n",
    "Furthermore, the error message indicates\n",
    "not just the wrong hyperparameters but also which operator was\n",
    "misconfigured (here, `LR`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Schemas as Documentation\n",
    "\n",
    "The above code example demonstrates that Lale uses JSON Schema for\n",
    "error checking. But we can do better than just using schemas for one\n",
    "purpose. Since all Lale operators carry hyperparameter schemas,\n",
    "we can also use those same schemas for interactive documentation.  The\n",
    "following code illustrates that by inspecting the schema of a\n",
    "continuous hyperparameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'description': 'Number of trees to fit.',\n",
       " 'type': 'integer',\n",
       " 'default': 100,\n",
       " 'minimumForOptimizer': 50,\n",
       " 'maximumForOptimizer': 1000}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "XGBoost.hyperparam_schema('n_estimators')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, the same interactive documentation approach also works for\n",
    "categorical hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'description': 'Specify which booster to use.',\n",
       " 'enum': ['gbtree', 'gblinear', 'dart', None],\n",
       " 'default': None}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "XGBoost.hyperparam_schema('booster')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the same schemas for two purposes (validation and documentation)\n",
    "makes them stronger. As a user, you can be confident that the\n",
    "documentation you read is in sync with the error checks in the code.\n",
    "Furthermore, while schemas check hyperparameters, the reverse is also\n",
    "true: when tests reveal mistakes in the schemas themselves, they can\n",
    "be corrected in a single location, thus improving both validation and\n",
    "documentation.  However, we can do even better than using the same\n",
    "schemas for two purposes: we can also use them for a third purpose\n",
    "(automated search)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Types as Search Spaces\n",
    "\n",
    "The following picture illustrates how Lale uses types for AutoML.\n",
    "The data scientist is shown on the left. They interact with a pipeline\n",
    "at either the planned or trainable lifecycle state, or at some mixed\n",
    "state as discussed above under semi-automated AutoML.  Every\n",
    "individual operator in Lale carries types in the form of a schema.\n",
    "Users rarely have to write these schemas by hand, since the Lale\n",
    "library already includes schemas for, at last count, 133 individual\n",
    "operators.  Given a planned pipeline, when the data scientist kicks\n",
    "off a search, Lale automatically generates a search space appropriate\n",
    "for the given combined algorithm selection and hyperparameter\n",
    "optimization tool, e.g., hyperopt.  For each trial, the\n",
    "CASH tool then acquires a point in the search space, e.g., using an\n",
    "acquisition function in Bayesian optimization. Then, Lale\n",
    "automatically decodes that search point into a trainable Lale\n",
    "pipeline. By construction, this trainable pipeline validates against\n",
    "the schemas. We can train and score the pipeline, usually with\n",
    "cross-validation, to return a loss value to the CASH tool. If this\n",
    "is the best pipeline found so far, the CASH tool will remember it\n",
    "as the incumbent. When the search terminates, the CASH tool returns\n",
    "the final incumbent, minimizing the loss. \n",
    "*Types as search spaces* is another fundamental\n",
    "novel concept around which we designed Lale, described in more detail\n",
    "in our [paper](https://arxiv.org/pdf/1906.03957.pdf).\n",
    "\n",
    "<img src=\"img/2019-1105-search-spaces.png\" style=\"width:700px\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customizing Schemas\n",
    "\n",
    "While you can use Lale operators with their schemas as-is, you can\n",
    "also customize the schemas to excert more control over the automation.\n",
    "For instance, you might want to reduce the number of trees in an\n",
    "XGBoost forest to reduce memory consumption or to improve\n",
    "explainability. Or you might want to hand-pick one of the boosters\n",
    "to reduce the search space and thus hopefully speed up the search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lale.schemas as schemas\n",
    "Grove = XGBoost.customize_schema(\n",
    "    n_estimators=schemas.Int(minimum=2, maximum=6),\n",
    "    booster=schemas.Enum(['gbtree'], default='gbtree'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As this example demonstrates, Lale provides a simple Python API for\n",
    "writing schemas, which it then converts to JSON Schema internally.\n",
    "The result of customization is a new copy of the operator that can be\n",
    "used in the same way as any other operator in Lale. In particular,\n",
    "it can be part of a pipeline as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "grove_planned = ( Project(columns={'type': 'number'}) >> Norm\n",
    "                & Project(columns={'type': 'string'}) >> OneHot\n",
    "                ) >> Concat >> Grove"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given this new planned pipeline, we use hyperopt as before to search for a\n",
    "good trained pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[15:02:56] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:58] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:59] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:02:59] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:00] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:01] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:02] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:04] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:05] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:09] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:10] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:11] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:17] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:18] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "[15:03:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[15:03:20] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "100%|██████████| 10/10 [00:24<00:00,  2.49s/trial, best loss: -0.7313420884048686]\n",
      "[15:03:20] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n",
      "accuracy 73.0%\n"
     ]
    }
   ],
   "source": [
    "grove_optimizer = Hyperopt(estimator=grove_planned, cv=3, max_evals=10, verbose=True)\n",
    "grove_trained = grove_optimizer.fit(train_X, train_y)\n",
    "grove_y = grove_trained.predict(test_X)\n",
    "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, grove_y):.1%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result shows that the predictive performance is not quite as good\n",
    "as before. As a data scientist, you can weigh that against other\n",
    "needs, and possibly experiment more. Of course, you can also display\n",
    "the result of automation, e.g., by pretty-printing it back as code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "project_0 = Project(columns={\"type\": \"number\"})\n",
       "norm = Norm(norm=\"l1\")\n",
       "project_1 = Project(columns={\"type\": \"string\"})\n",
       "grove = Grove(\n",
       "    gamma=0.8726303533099419,\n",
       "    learning_rate=0.7606845221791637,\n",
       "    max_depth=2,\n",
       "    min_child_weight=11,\n",
       "    n_estimators=5,\n",
       "    reg_alpha=0.5980470775121279,\n",
       "    reg_lambda=0.2546844052569046,\n",
       "    subsample=0.8142720284737895,\n",
       ")\n",
       "pipeline = (\n",
       "    ((project_0 >> norm) & (project_1 >> OneHot())) >> Concat() >> grove\n",
       ")\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "grove_trained.get_pipeline().pretty_print(ipython_display=True, show_imports=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Compatible Interoperability\n",
    "\n",
    "Scikit-learn has a big following, which is well-earned: it is one of the\n",
    "most complete and most usable machine-learning libraries available.\n",
    "In developing Lale, we spent a lot of effort on maintaining\n",
    "scikit-learn compatibility. Earlier on, you already saw that Lale\n",
    "pipelines work with off-the-shelf scikit-learn functions such as\n",
    "metrics, and with other libraries such as XGBoost. To demonstrate\n",
    "interoperability with additional libraries, we ran experiments with\n",
    "pipelines from various different data modalities. For the movies\n",
    "review text dataset, the best pipeline used a PyTorch implementation\n",
    "of the BERT embedding.  For the car tabular dataset, the best pipeline\n",
    "used a Java implementation of the J48 decision tree. For the CIFAR-10\n",
    "images dataset, the best pipeline used a PyTorch implementation of a\n",
    "ResNet50 neural network. And finally, for the epilepsy time-series\n",
    "dataset, the best pipeline used a window transformer and voting\n",
    "operator pair that was written for the task and then made available in\n",
    "the Lale library. That demonstrates that Lale can use operators from\n",
    "other libraries beyond scikit-learn and even from other languages than\n",
    "Python. It also demonstrates that for some tasks, this interoperability\n",
    "is necessary for better predictive performance.\n",
    "\n",
    "<img src=\"img/2019-1105-interop.png\" style=\"width:550px\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "Please check out the Lale github repository for examples,\n",
    "documentation, and code: https://github.com/ibm/lale\n",
    "\n",
    "This notebook started by listing three values that Lale provides:\n",
    "automation, interoperability, and usability. Then, this notebook\n",
    "introduced three fundamental concepts around which Lale is designed:\n",
    "bindings as lifecycle, types as search spaces, and scikit-learn\n",
    "compatible interoperability. As the following picture illustrates, the\n",
    "three concepts sit at the intersection of the three values.\n",
    "\n",
    "<img src=\"img/2019-1105-summary.png\" style=\"width:350px\" align=\"left\">"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}