{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lale's AutoPipeline Operator\n",
    "\n",
    "The `lale.lib.lale.AutoPipeline` operator automatically creates a pipeline for a dataset.\n",
    "It is designed for simplicity, requiring minimum configuration to get started.\n",
    "You can use it for an initial experiment with new data. See also the\n",
    "[API documentation](https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.auto_pipeline.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "This demonstration uses the [credit-g](https://www.openml.org/d/31) dataset from OpenML.\n",
    "The dataset has both categorical features, represented as strings, and\n",
    "numeric features. For illustration purposes, we also add some missing values,\n",
    "represented as `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train_X.shape (670, 20)\n"
     ]
    }
   ],
   "source": [
    "import lale.datasets.openml\n",
    "import lale.helpers\n",
    "(orig_train_X, train_y), (test_X, test_y) = \\\n",
    "    lale.datasets.openml.fetch('credit-g', 'classification', preprocess=False)\n",
    "train_X = lale.helpers.add_missing_values(orig_train_X, seed=42)\n",
    "print(f'train_X.shape {train_X.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Printing the last few samples of the training data reveals\n",
    "`credit_amount=NaN` for sample number 763."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>checking_status</th>\n",
       "      <th>duration</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings_status</th>\n",
       "      <th>employment</th>\n",
       "      <th>installment_commitment</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>other_parties</th>\n",
       "      <th>residence_since</th>\n",
       "      <th>property_magnitude</th>\n",
       "      <th>age</th>\n",
       "      <th>other_payment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>existing_credits</th>\n",
       "      <th>job</th>\n",
       "      <th>num_dependents</th>\n",
       "      <th>own_telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>763</th>\n",
       "      <td>bad</td>\n",
       "      <td>no checking</td>\n",
       "      <td>21.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>new car</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no known savings</td>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>no known property</td>\n",
       "      <td>30.0</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>1.0</td>\n",
       "      <td>high qualif/self emp/mgmt</td>\n",
       "      <td>1.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>835</th>\n",
       "      <td>bad</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>no credits/all paid</td>\n",
       "      <td>new car</td>\n",
       "      <td>1082.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>car</td>\n",
       "      <td>48.0</td>\n",
       "      <td>bank</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>bad</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>27.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>business</td>\n",
       "      <td>3915.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>2.0</td>\n",
       "      <td>car</td>\n",
       "      <td>36.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>2.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>629</th>\n",
       "      <td>good</td>\n",
       "      <td>no checking</td>\n",
       "      <td>9.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>education</td>\n",
       "      <td>3832.0</td>\n",
       "      <td>no known savings</td>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>1.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>64.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>559</th>\n",
       "      <td>bad</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>18.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>1928.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>&lt;1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>2.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>31.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>684</th>\n",
       "      <td>good</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>36.0</td>\n",
       "      <td>delayed previously</td>\n",
       "      <td>business</td>\n",
       "      <td>9857.0</td>\n",
       "      <td>100&lt;=X&lt;500</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "      <td>1.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>3.0</td>\n",
       "      <td>life insurance</td>\n",
       "      <td>31.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>2.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    class checking_status  duration                  credit_history  \\\n",
       "763   bad     no checking      21.0  critical/other existing credit   \n",
       "835   bad              <0      12.0             no credits/all paid   \n",
       "192   bad        0<=X<200      27.0                   existing paid   \n",
       "629  good     no checking       9.0                   existing paid   \n",
       "559   bad        0<=X<200      18.0  critical/other existing credit   \n",
       "684  good        0<=X<200      36.0              delayed previously   \n",
       "\n",
       "                 purpose  credit_amount    savings_status employment  \\\n",
       "763              new car            NaN  no known savings        >=7   \n",
       "835              new car         1082.0              <100     1<=X<4   \n",
       "192             business         3915.0              <100     1<=X<4   \n",
       "629            education         3832.0  no known savings        >=7   \n",
       "559  furniture/equipment         1928.0              <100         <1   \n",
       "684             business         9857.0        100<=X<500     4<=X<7   \n",
       "\n",
       "     installment_commitment personal_status other_parties  residence_since  \\\n",
       "763                     4.0     male single          none              4.0   \n",
       "835                     4.0     male single          none              4.0   \n",
       "192                     4.0     male single          none              2.0   \n",
       "629                     1.0     male single          none              4.0   \n",
       "559                     2.0     male single          none              2.0   \n",
       "684                     1.0     male single          none              3.0   \n",
       "\n",
       "    property_magnitude   age other_payment_plans   housing  existing_credits  \\\n",
       "763  no known property  30.0                none  for free               1.0   \n",
       "835                car  48.0                bank       own               2.0   \n",
       "192                car  36.0                none       own               1.0   \n",
       "629        real estate  64.0                none       own               1.0   \n",
       "559        real estate  31.0                none       own               2.0   \n",
       "684     life insurance  31.0                none       own               2.0   \n",
       "\n",
       "                           job  num_dependents own_telephone foreign_worker  \n",
       "763  high qualif/self emp/mgmt             1.0           yes            yes  \n",
       "835                    skilled             1.0          none            yes  \n",
       "192                    skilled             2.0           yes            yes  \n",
       "629         unskilled resident             1.0          none            yes  \n",
       "559         unskilled resident             1.0          none            yes  \n",
       "684         unskilled resident             2.0           yes            yes  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.options.display.max_columns = None\n",
    "pd.concat([train_y.tail(6), train_X.tail(6)], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sklearn Interface\n",
    "\n",
    "We designed the AutoPipeline operator to follow sklearn's init-fit-predict\n",
    "conventions to make it easy to use for anyone familiar with sklearn.\n",
    "\n",
    "- During initialization, you can configure the behavior of the operator.\n",
    "  Here, we set `prediction_type='classification'`; AutoPipeline also supports\n",
    "  `'regression'`. The `max_opt_time` is the timeout in seconds for finding a\n",
    "  pipeline for the dataset.\n",
    "- The call to `fit` initiates training, which tries out various pipelines\n",
    "  on the dataset.\n",
    "- Finally, after training, `predict` makes predictions using the best found\n",
    "  pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.lale import AutoPipeline\n",
    "trainable = AutoPipeline(prediction_type='classification', max_opt_time=90)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 29.9 s, sys: 1.59 s, total: 31.5 s\n",
      "Wall time: 1min 39s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "trained = trainable.fit(train_X, train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted = trained.predict(test_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result of `predict` can then be either used directly (e.g., printed),\n",
    "or passed to other sklearn functions (e.g., for scoring)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "last 10 predictions: ['bad' 'bad' 'good' 'bad' 'good' 'bad']\n",
      "accuracy 73.9%\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "print(f'last 10 predictions: {predicted[-6:]}')\n",
    "print(f'accuracy {accuracy_score(predicted, test_y):.1%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspecting Results\n",
    "\n",
    "After training, you can look at a leaderboard of all the pipelines tried\n",
    "during the search by calling the `summary` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loss</th>\n",
       "      <th>time</th>\n",
       "      <th>log_loss</th>\n",
       "      <th>status</th>\n",
       "      <th>tid</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gbt_all</th>\n",
       "      <td>-0.705970</td>\n",
       "      <td>2.426831</td>\n",
       "      <td>0.721055</td>\n",
       "      <td>ok</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>-0.704478</td>\n",
       "      <td>2.649456</td>\n",
       "      <td>0.561390</td>\n",
       "      <td>ok</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p1</th>\n",
       "      <td>-0.698507</td>\n",
       "      <td>1.591414</td>\n",
       "      <td>0.570718</td>\n",
       "      <td>ok</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>baseline</th>\n",
       "      <td>-0.695522</td>\n",
       "      <td>0.077895</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ok</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>-0.695522</td>\n",
       "      <td>2.758111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ok</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p0</th>\n",
       "      <td>-0.686567</td>\n",
       "      <td>1.758013</td>\n",
       "      <td>0.616135</td>\n",
       "      <td>ok</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>-0.683582</td>\n",
       "      <td>1.777791</td>\n",
       "      <td>0.765601</td>\n",
       "      <td>ok</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gbt_num</th>\n",
       "      <td>-0.641791</td>\n",
       "      <td>1.130019</td>\n",
       "      <td>0.768635</td>\n",
       "      <td>ok</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>new</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              loss      time  log_loss status  tid\n",
       "name                                              \n",
       "gbt_all  -0.705970  2.426831  0.721055     ok  NaN\n",
       "p2       -0.704478  2.649456  0.561390     ok  2.0\n",
       "p1       -0.698507  1.591414  0.570718     ok  1.0\n",
       "baseline -0.695522  0.077895       NaN     ok  NaN\n",
       "p4       -0.695522  2.758111       NaN     ok  4.0\n",
       "p0       -0.686567  1.758013  0.616135     ok  0.0\n",
       "p3       -0.683582  1.777791  0.765601     ok  3.0\n",
       "gbt_num  -0.641791  1.130019  0.768635     ok  NaN\n",
       "p5             NaN       NaN       NaN    new  5.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trained.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `get_pipeline` method lets you retrieve any pipeline by name.\n",
    "By default, when no name is specified, it returns the best pipeline\n",
    "found, in other words, the pipeline with the lowest loss in the\n",
    "leaderboard.\n",
    "You can call `predict` on that pipeline in typical sklearn fashion.\n",
    "Furthermore, you can inspect that pipeline by calling `visualize`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"470pt\" height=\"111pt\"\n",
       " viewBox=\"0.00 0.00 470.46 111.08\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 107.083)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-107.083 466.458,-107.083 466.458,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\"><title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;}, drop_columns=lale.lib.lale.categorical(max_values=5))\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"120.406\" cy=\"-83.2843\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"120.406\" y=\"-80.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- simple_imputer_0 -->\n",
       "<g id=\"node2\" class=\"node\"><title>simple_imputer_0</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.simple_imputer.html\" xlink:title=\"simple_imputer_0 = SimpleImputer()\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"218.631\" cy=\"-83.2843\" rx=\"30.3115\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"218.631\" y=\"-86.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Simple&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"218.631\" y=\"-74.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Imputer</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;simple_imputer_0 -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>project_0&#45;&gt;simple_imputer_0</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M147.446,-83.2843C156.88,-83.2843 167.76,-83.2843 178.094,-83.2843\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"178.173,-86.7844 188.173,-83.2843 178.173,-79.7844 178.173,-86.7844\"/>\n",
       "</g>\n",
       "<!-- concat_features -->\n",
       "<g id=\"node6\" class=\"node\"><title>concat_features</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.concat_features.html\" xlink:title=\"concat_features = ConcatFeatures\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"319.685\" cy=\"-55.2843\" rx=\"33.4697\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.685\" y=\"-58.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Concat&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"319.685\" y=\"-46.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Features</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- simple_imputer_0&#45;&gt;concat_features -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>simple_imputer_0&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M246.698,-75.6331C256.785,-72.7818 268.449,-69.4848 279.411,-66.3859\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"280.546,-69.7024 289.217,-63.6141 278.642,-62.9663 280.546,-69.7024\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\"><title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" xlink:title=\"project_1 = Project(columns=lale.lib.lale.categorical(max_values=5))\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"27\" cy=\"-28.2843\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-25.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- simple_imputer_1 -->\n",
       "<g id=\"node4\" class=\"node\"><title>simple_imputer_1</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.simple_imputer.html\" xlink:title=\"simple_imputer_1 = SimpleImputer(strategy=&#39;most_frequent&#39;)\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"120.406\" cy=\"-28.2843\" rx=\"30.3115\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"120.406\" y=\"-31.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Simple&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"120.406\" y=\"-19.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Imputer</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;simple_imputer_1 -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>project_1&#45;&gt;simple_imputer_1</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M54.4383,-28.2843C62.3101,-28.2843 71.1151,-28.2843 79.6399,-28.2843\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"79.8176,-31.7844 89.8176,-28.2843 79.8175,-24.7844 79.8176,-31.7844\"/>\n",
       "</g>\n",
       "<!-- one_hot_encoder -->\n",
       "<g id=\"node5\" class=\"node\"><title>one_hot_encoder</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" xlink:title=\"one_hot_encoder = OneHotEncoder(handle_unknown=&#39;ignore&#39;)\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"218.631\" cy=\"-28.2843\" rx=\"31.6406\" ry=\"28.0702\"/>\n",
       "<text text-anchor=\"middle\" x=\"218.631\" y=\"-37.4843\" font-family=\"Times,serif\" font-size=\"11.00\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"218.631\" y=\"-25.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Hot&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"218.631\" y=\"-13.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Encoder</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- simple_imputer_1&#45;&gt;one_hot_encoder -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>simple_imputer_1&#45;&gt;one_hot_encoder</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M151.062,-28.2843C159.07,-28.2843 167.883,-28.2843 176.402,-28.2843\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"176.574,-31.7844 186.574,-28.2843 176.574,-24.7844 176.574,-31.7844\"/>\n",
       "</g>\n",
       "<!-- one_hot_encoder&#45;&gt;concat_features -->\n",
       "<g id=\"edge5\" class=\"edge\"><title>one_hot_encoder&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M249.35,-36.3852C258.753,-38.9481 269.296,-41.822 279.268,-44.5401\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"278.418,-47.936 288.986,-47.1891 280.259,-41.1824 278.418,-47.936\"/>\n",
       "</g>\n",
       "<!-- xgb_classifier -->\n",
       "<g id=\"node7\" class=\"node\"><title>xgb_classifier</title>\n",
       "<g id=\"a_node7\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.xgboost.xgb_classifier.html\" xlink:title=\"xgb_classifier = XGBClassifier()\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"425.688\" cy=\"-55.2843\" rx=\"36.5405\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"425.688\" y=\"-58.4843\" font-family=\"Times,serif\" font-size=\"11.00\">XGB&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"425.688\" y=\"-46.4843\" font-family=\"Times,serif\" font-size=\"11.00\">Classifier</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat_features&#45;&gt;xgb_classifier -->\n",
       "<g id=\"edge6\" class=\"edge\"><title>concat_features&#45;&gt;xgb_classifier</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M353.022,-55.2843C361.165,-55.2843 370.063,-55.2843 378.733,-55.2843\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"378.748,-58.7844 388.748,-55.2843 378.748,-51.7844 378.748,-58.7844\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7f68c6cb1cf8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "best_found = trained.get_pipeline()\n",
    "best_found.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The visualization reveals the operators from which the pipeline is\n",
    "composed. Since our example dataset contains both numeric features and\n",
    "categorical features, the pipeline contains two preprocessing paths,\n",
    "using `Project` to keep only the relevant columns.\n",
    "When you hover the mouse pointer over an operator in the visualization,\n",
    "a tooltip shows how it is configured. Furthermore, if you click on an\n",
    "operator in the visualization, you get to a documentation page\n",
    "for that operator.\n",
    "You can also pretty-print any pipeline found during the search back as\n",
    "Python code to better understand how its operators were configured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "from lale.lib.lale import Project\n",
       "from sklearn.impute import SimpleImputer\n",
       "from sklearn.preprocessing import OneHotEncoder\n",
       "from lale.lib.lale import ConcatFeatures\n",
       "from xgboost import XGBClassifier\n",
       "import lale\n",
       "\n",
       "lale.wrap_imported_operators()\n",
       "project_0 = Project(\n",
       "    columns={\"type\": \"number\"},\n",
       "    drop_columns=lale.lib.lale.categorical(max_values=5),\n",
       ")\n",
       "project_1 = Project(columns=lale.lib.lale.categorical(max_values=5))\n",
       "simple_imputer_1 = SimpleImputer(strategy=\"most_frequent\")\n",
       "one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\")\n",
       "pipeline = (\n",
       "    (\n",
       "        (project_0 >> SimpleImputer())\n",
       "        & (project_1 >> simple_imputer_1 >> one_hot_encoder)\n",
       "    )\n",
       "    >> ConcatFeatures\n",
       "    >> XGBClassifier()\n",
       ")\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "best_found.pretty_print(ipython_display=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Steps\n",
    "\n",
    "You can try out the AutoPipeline operator on your own data.\n",
    "Alternatively, if you want more control, you also plan your own pipelines,\n",
    "then use Lale to do automated algorithm selection and hyperparameter tuning\n",
    "on them. Check out the other [example\n",
    "notebooks](https://github.com/IBM/lale/tree/master/examples) for Lale,\n",
    "and in particular, the [introductory\n",
    "guide](https://nbviewer.jupyter.org/github/IBM/lale/blob/master/examples/docs_guide_for_sklearn_users.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}