{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AI Automation for AI Fairness\n",
    "\n",
    "When AI models contribute to high-impact decisions such as whether or\n",
    "not someone gets a loan, we want them to be fair.\n",
    "Unfortunately, in current practice, AI models are often optimized\n",
    "primarily for accuracy, with little consideration for fairness.  This\n",
    "notebook gives a hands-on example for how AI Automation can help build AI\n",
    "models that are both accurate and fair.\n",
    "This notebook is written for data scientists who have some familiarity\n",
    "with Python. No prior knowledge of AI Automation or AI Fairness is\n",
    "required, we will introduce the relevant concepts as we get to them.\n",
    "\n",
    "Bias in data leads to bias in models. AI models are increasingly\n",
    "consulted for consequential decisions about people, in domains\n",
    "including credit loans, hiring and retention, penal justice, medical,\n",
    "and more. Often, the model is trained from past decisions made by\n",
    "humans. If the decisions used for training were discriminatory, then\n",
    "your trained model will be too, unless you are careful. Being careful\n",
    "about bias is something you should do as a data scientist.\n",
    "Fortunately, you do not have to grapple with this issue alone.  You\n",
    "can consult others about ethics. You can also ask yourself how your AI\n",
    "model may affect your (or your institution's) reputation. And\n",
    "ultimately, you must follow applicable laws and regulations.\n",
    "\n",
    "_AI Fairness_ can be measured via several metrics, and you need to\n",
    "select the appropriate metrics based on the circumstances.  For\n",
    "illustration purposes, this notebook uses one particular fairness\n",
    "metric called _disparate impact_. Disparate impact is defined as the\n",
    "ratio of the rate of favorable outcome for the unprivileged group to\n",
    "that of the privileged group. To make this definition more concrete,\n",
    "consider the case where a favorable outcome means getting a loan, the\n",
    "unprivileged group is women, and the privileged group is men.  Then if\n",
    "your AI model were to let women get a loan in 30% of the cases and men\n",
    "in 60% of the cases, the disparate impact would be 30% / 60% = 0.5,\n",
    "indicating a gender bias towards men.  The ideal value for disparate\n",
    "impact is 1, and you could define fairness for this metric as a band\n",
    "around 1, e.g., from 0.8 to 1.25.\n",
    "\n",
    "To get the best performance out of your AI model, you must experiment\n",
    "with its configuration. This means searching a high-dimensional space\n",
    "where some options are categorical, some are continuous, and some are\n",
    "even conditional. No configuration is optimal for all domains let\n",
    "alone all metrics, and searching them all by hand is impossible. In\n",
    "fact, in a high-dimensional space, even exhaustively enumerating all\n",
    "the valid combinations soon becomes impractical.  Fortunately, you can\n",
    "use tools to automate the search, thus making you more productive at\n",
    "finding good models quickly. These productivity and quality\n",
    "improvements become compounded when you have to do the search over.\n",
    "\n",
    "_AI Automation_ is a technology that assists data scientists in\n",
    "building AI models by automating some of the tedious steps. One AI\n",
    "automation technique is _algorithm selection_ , which automatically\n",
    "chooses among alternative algorithms for a particular task. Another AI\n",
    "automation technique is _hyperparameter tuning_ , which automatically\n",
    "configures the arguments of AI algorithms. You can use AI automation\n",
    "to optimize for a variety of metrics.  This notebook shows you how to use AI\n",
    "automation to optimize for both accuracy and for fairness as measured\n",
    "by disparate impact.\n",
    "\n",
    "This [Jupyter](https://jupyter.org/)\n",
    "notebook uses the following open-source Python libraries. \n",
    "[AIF360](https://aif360.mybluemix.net/) \n",
    "is a collection of fairness metrics and bias mitigation algorithms.\n",
    "The [pandas](https://pandas.pydata.org/) and\n",
    "[scikit-learn](https://scikit-learn.org/) libraries support\n",
    "data analysis and machine learning with data structures and a\n",
    "comprehensive collection of AI algorithms.\n",
    "The [hyperopt](http://hyperopt.github.io/hyperopt/) library\n",
    "implements both algorithm selection and hyperparameter tuning for\n",
    "AI automation.\n",
    "And [Lale](https://github.com/IBM/lale) is a library for\n",
    "semi-automated data science; this notebook uses Lale as the backbone\n",
    "for putting the other libraries together.\n",
    "\n",
    "Our starting point is a dataset and a task. For illustration\n",
    "purposes, we picked [credit-g](https://www.openml.org/d/31), also\n",
    "known as the German Credit dataset. Each row describes a person\n",
    "using several features that may help evaluate them as a potential\n",
    "loan applicant. The task is to classify people into either\n",
    "good or bad credit risks. We load the version of the dataset from\n",
    "OpenML along with some fairness metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Possible set intersection at position 3\n"
     ]
    }
   ],
   "source": [
    "from lale.lib.aif360 import fetch_creditg_df\n",
    "all_X, all_y, fairness_info = fetch_creditg_df()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see what the dataset looks like, we can use off-the-shelf\n",
    "functionality from pandas for inspecting a few\n",
    "rows.  The creditg dataset has a single label column, `class`, to be\n",
    "predicted as the outcome, which can be `good` or `bad`. Some of the\n",
    "feature columns are numbers, others are categoricals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>checking_status</th>\n",
       "      <th>duration</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings_status</th>\n",
       "      <th>employment</th>\n",
       "      <th>installment_commitment</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>other_parties</th>\n",
       "      <th>residence_since</th>\n",
       "      <th>property_magnitude</th>\n",
       "      <th>age</th>\n",
       "      <th>other_payment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>existing_credits</th>\n",
       "      <th>job</th>\n",
       "      <th>num_dependents</th>\n",
       "      <th>own_telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>good</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>radio/tv</td>\n",
       "      <td>1169.0</td>\n",
       "      <td>no known savings</td>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>67.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>2.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>bad</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>48.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>radio/tv</td>\n",
       "      <td>5951.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>female div/dep/mar</td>\n",
       "      <td>none</td>\n",
       "      <td>2.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>22.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>good</td>\n",
       "      <td>no checking</td>\n",
       "      <td>12.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>education</td>\n",
       "      <td>2096.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "      <td>2.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>3.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>49.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>2.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>good</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>7882.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "      <td>2.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>guarantor</td>\n",
       "      <td>4.0</td>\n",
       "      <td>life insurance</td>\n",
       "      <td>45.0</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>2.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>bad</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>delayed previously</td>\n",
       "      <td>new car</td>\n",
       "      <td>4870.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>3.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>no known property</td>\n",
       "      <td>53.0</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>2.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>2.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>good</td>\n",
       "      <td>no checking</td>\n",
       "      <td>12.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>1736.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>female div/dep/mar</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>real estate</td>\n",
       "      <td>31.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>unskilled resident</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>good</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>used car</td>\n",
       "      <td>3857.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male div/sep</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>life insurance</td>\n",
       "      <td>40.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>high qualif/self emp/mgmt</td>\n",
       "      <td>1.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>good</td>\n",
       "      <td>no checking</td>\n",
       "      <td>12.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>radio/tv</td>\n",
       "      <td>804.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>car</td>\n",
       "      <td>38.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>bad</td>\n",
       "      <td>&lt;0</td>\n",
       "      <td>45.0</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>radio/tv</td>\n",
       "      <td>1845.0</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>no known property</td>\n",
       "      <td>23.0</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>good</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>45.0</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>used car</td>\n",
       "      <td>4576.0</td>\n",
       "      <td>100&lt;=X&lt;500</td>\n",
       "      <td>unemployed</td>\n",
       "      <td>3.0</td>\n",
       "      <td>male single</td>\n",
       "      <td>none</td>\n",
       "      <td>4.0</td>\n",
       "      <td>car</td>\n",
       "      <td>27.0</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1.0</td>\n",
       "      <td>skilled</td>\n",
       "      <td>1.0</td>\n",
       "      <td>none</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    class checking_status  duration                  credit_history  \\\n",
       "0    good              <0       6.0  critical/other existing credit   \n",
       "1     bad        0<=X<200      48.0                   existing paid   \n",
       "2    good     no checking      12.0  critical/other existing credit   \n",
       "3    good              <0      42.0                   existing paid   \n",
       "4     bad              <0      24.0              delayed previously   \n",
       "..    ...             ...       ...                             ...   \n",
       "995  good     no checking      12.0                   existing paid   \n",
       "996  good              <0      30.0                   existing paid   \n",
       "997  good     no checking      12.0                   existing paid   \n",
       "998   bad              <0      45.0                   existing paid   \n",
       "999  good        0<=X<200      45.0  critical/other existing credit   \n",
       "\n",
       "                 purpose  credit_amount    savings_status  employment  \\\n",
       "0               radio/tv         1169.0  no known savings         >=7   \n",
       "1               radio/tv         5951.0              <100      1<=X<4   \n",
       "2              education         2096.0              <100      4<=X<7   \n",
       "3    furniture/equipment         7882.0              <100      4<=X<7   \n",
       "4                new car         4870.0              <100      1<=X<4   \n",
       "..                   ...            ...               ...         ...   \n",
       "995  furniture/equipment         1736.0              <100      4<=X<7   \n",
       "996             used car         3857.0              <100      1<=X<4   \n",
       "997             radio/tv          804.0              <100         >=7   \n",
       "998             radio/tv         1845.0              <100      1<=X<4   \n",
       "999             used car         4576.0        100<=X<500  unemployed   \n",
       "\n",
       "     installment_commitment     personal_status other_parties  \\\n",
       "0                       4.0         male single          none   \n",
       "1                       2.0  female div/dep/mar          none   \n",
       "2                       2.0         male single          none   \n",
       "3                       2.0         male single     guarantor   \n",
       "4                       3.0         male single          none   \n",
       "..                      ...                 ...           ...   \n",
       "995                     3.0  female div/dep/mar          none   \n",
       "996                     4.0        male div/sep          none   \n",
       "997                     4.0         male single          none   \n",
       "998                     4.0         male single          none   \n",
       "999                     3.0         male single          none   \n",
       "\n",
       "     residence_since property_magnitude   age other_payment_plans   housing  \\\n",
       "0                4.0        real estate  67.0                none       own   \n",
       "1                2.0        real estate  22.0                none       own   \n",
       "2                3.0        real estate  49.0                none       own   \n",
       "3                4.0     life insurance  45.0                none  for free   \n",
       "4                4.0  no known property  53.0                none  for free   \n",
       "..               ...                ...   ...                 ...       ...   \n",
       "995              4.0        real estate  31.0                none       own   \n",
       "996              4.0     life insurance  40.0                none       own   \n",
       "997              4.0                car  38.0                none       own   \n",
       "998              4.0  no known property  23.0                none  for free   \n",
       "999              4.0                car  27.0                none       own   \n",
       "\n",
       "     existing_credits                        job  num_dependents  \\\n",
       "0                 2.0                    skilled             1.0   \n",
       "1                 1.0                    skilled             1.0   \n",
       "2                 1.0         unskilled resident             2.0   \n",
       "3                 1.0                    skilled             2.0   \n",
       "4                 2.0                    skilled             2.0   \n",
       "..                ...                        ...             ...   \n",
       "995               1.0         unskilled resident             1.0   \n",
       "996               1.0  high qualif/self emp/mgmt             1.0   \n",
       "997               1.0                    skilled             1.0   \n",
       "998               1.0                    skilled             1.0   \n",
       "999               1.0                    skilled             1.0   \n",
       "\n",
       "    own_telephone foreign_worker  \n",
       "0             yes            yes  \n",
       "1            none            yes  \n",
       "2            none            yes  \n",
       "3            none            yes  \n",
       "4            none            yes  \n",
       "..            ...            ...  \n",
       "995          none            yes  \n",
       "996           yes            yes  \n",
       "997          none            yes  \n",
       "998           yes            yes  \n",
       "999          none            yes  \n",
       "\n",
       "[1000 rows x 21 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.options.display.max_columns = None\n",
    "pd.concat([all_y, all_X], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fairness_info` is a JSON object that specifies metadata you\n",
    "need for measuring and mitigating fairness. The `favorable_labels`\n",
    "attribute indicates that when the `class` column contains the value\n",
    "`good`, that is considered a positive outcome.\n",
    "A _protected attribute_ is a feature that partitions the population\n",
    "into groups whose outcome should have parity.\n",
    "Values in the `personal_status` column that indicate that the indidual\n",
    "is `male` are considered privileged, and so are values in the\n",
    "`age` column that indicate that the individual is between 26 and 1000\n",
    "years old."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"favorable_labels\": [\"good\"],\n",
       "    \"protected_attributes\": [\n",
       "        {\n",
       "            \"feature\": \"personal_status\",\n",
       "            \"reference_group\": [\n",
       "                \"male div/sep\", \"male mar/wid\", \"male single\",\n",
       "            ],\n",
       "        },\n",
       "        {\"feature\": \"age\", \"reference_group\": [[26, 1000]]},\n",
       "    ],\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import lale.pretty_print\n",
    "lale.pretty_print.ipython_display(fairness_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A best practice for any machine-learning experiments is to split\n",
    "the data into a training set and a hold-out set. Doing so helps \n",
    "detect and prevent over-fitting. The fairness information induces\n",
    "groups in the dataset by outcomes and by privileged groups. We\n",
    "want the distribution of these groups to be similar for the training\n",
    "set and the holdout set. Therefore, we split the data in a\n",
    "stratified way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.aif360 import fair_stratified_train_test_split\n",
    "train_X, test_X, train_y, test_y = fair_stratified_train_test_split(\n",
    "    all_X, all_y, **fairness_info, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the `disparate_impact` metric to measure how biased the\n",
    "training data and the test data are. At 0.75 and 0.73, they are far\n",
    "from the ideal value of 1.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "disparate impact of training data 0.75, test data 0.73\n"
     ]
    }
   ],
   "source": [
    "from lale.lib.aif360 import disparate_impact\n",
    "disparate_impact_scorer = disparate_impact(**fairness_info)\n",
    "print(\"disparate impact of training data {:.2f}, test data {:.2f}\".format(\n",
    "    disparate_impact_scorer.score_data(X=train_X, y_pred=train_y),\n",
    "    disparate_impact_scorer.score_data(X=test_X, y_pred=test_y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we look at how to train a classifier that is optimized for both\n",
    "accuracy and disparate impact, we will set a baseline, by training a\n",
    "pipeline that is only optimized for accuracy. For this purpose, we\n",
    "import a few algorithms from scikit-learn and Lale:\n",
    "`Project` picks a subset of the feature columns,\n",
    "`OneHotEncoder` turns categoricals into numbers,\n",
    "`ConcatFeatures` combines sets of feature columns,\n",
    "and the three interpretable classifiers `LR`, `Tree`, and `KNN`\n",
    "make predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.lale import Project\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from lale.lib.lale import ConcatFeatures\n",
    "from sklearn.linear_model import LogisticRegression as LR\n",
    "from sklearn.tree import DecisionTreeClassifier as Tree\n",
    "from sklearn.neighbors import KNeighborsClassifier as KNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use AI Automation, we need to define a _search space_ ,\n",
    "which is a set of possible machine learning pipelines and\n",
    "their associated hyperparameters. The following code\n",
    "uses Lale to define a search space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"370pt\" height=\"185pt\"\n",
       " viewBox=\"0.00 0.00 370.11 185.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 181)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-181 366.108,-181 366.108,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<g id=\"clust1\" class=\"cluster\"><title>cluster:choice</title>\n",
       "<g id=\"a_clust1\"><a xlink:title=\"choice = lr | tree | knn\">\n",
       "<polygon fill=\"#7ec0ee\" stroke=\"black\" points=\"284.108,-8 284.108,-169 354.108,-169 354.108,-8 284.108,-8\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.108\" y=\"-153.8\" font-family=\"Times,serif\" font-size=\"14.00\">Choice</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\"><title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"27\" cy=\"-147\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-144.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder -->\n",
       "<g id=\"node2\" class=\"node\"><title>one_hot_encoder</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"one_hot_encoder = OneHotEncoder(handle_unknown=&#39;ignore&#39;)\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"121.82\" cy=\"-147\" rx=\"31.6406\" ry=\"28.0702\"/>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-156.2\" font-family=\"Times,serif\" font-size=\"11.00\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-144.2\" font-family=\"Times,serif\" font-size=\"11.00\">Hot&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-132.2\" font-family=\"Times,serif\" font-size=\"11.00\">Encoder</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;one_hot_encoder -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>project_0&#45;&gt;one_hot_encoder</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M54.3502,-147C62.1996,-147 70.997,-147 79.5579,-147\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"79.8033,-150.5 89.8033,-147 79.8032,-143.5 79.8033,-150.5\"/>\n",
       "</g>\n",
       "<!-- concat_features -->\n",
       "<g id=\"node4\" class=\"node\"><title>concat_features</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"concat_features = ConcatFeatures\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"222.874\" cy=\"-120\" rx=\"33.4697\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"222.874\" y=\"-123.2\" font-family=\"Times,serif\" font-size=\"11.00\">Concat&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"222.874\" y=\"-111.2\" font-family=\"Times,serif\" font-size=\"11.00\">Features</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder&#45;&gt;concat_features -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>one_hot_encoder&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M152.539,-138.899C161.942,-136.336 172.485,-133.462 182.457,-130.744\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"183.448,-134.102 192.175,-128.095 181.607,-127.348 183.448,-134.102\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\"><title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_1 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"121.82\" cy=\"-94\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-91.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;concat_features -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>project_1&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M147.305,-100.427C157.868,-103.2 170.493,-106.513 182.326,-109.619\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"181.455,-113.009 192.016,-112.163 183.232,-106.239 181.455,-113.009\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node5\" class=\"node\"><title>lr</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"lr = LR\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"319.108\" cy=\"-120\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.108\" y=\"-117.2\" font-family=\"Times,serif\" font-size=\"11.00\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat_features&#45;&gt;lr -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>concat_features&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M256.304,-120C264.464,-120 273.278,-120 281.622,-120\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"274.108,-123.5 284.108,-120 274.108,-116.5 274.108,-123.5\"/>\n",
       "</g>\n",
       "<!-- tree -->\n",
       "<g id=\"node6\" class=\"node\"><title>tree</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.decision_tree_classifier.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"tree = Tree\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"319.108\" cy=\"-77\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.108\" y=\"-74.2\" font-family=\"Times,serif\" font-size=\"11.00\">Tree</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- knn -->\n",
       "<g id=\"node7\" class=\"node\"><title>knn</title>\n",
       "<g id=\"a_node7\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.k_neighbors_classifier.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"knn = KNN\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"319.108\" cy=\"-34\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.108\" y=\"-31.2\" font-family=\"Times,serif\" font-size=\"11.00\">KNN</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7fa968e30be0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import lale\n",
    "lale.wrap_imported_operators()\n",
    "prep_to_numbers = (\n",
    "    (Project(columns={\"type\": \"string\"}) >> OneHotEncoder(handle_unknown=\"ignore\"))\n",
    "    & Project(columns={\"type\": \"number\"})\n",
    "    ) >> ConcatFeatures\n",
    "planned_orig = prep_to_numbers >> (LR | Tree | KNN)\n",
    "planned_orig.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The call to `wrap_imported_operators` augments the algorithms\n",
    "that were imported from scikit-learn with metadata about\n",
    "their hyperparameters.\n",
    "The Lale combinator `>>` pipes the output from one operator to\n",
    "the next one, creating a dataflow edge in the pipeline.\n",
    "The Lale combinator `&` enables multiple sub-pipelines to run\n",
    "on the same data.\n",
    "Here, `prep_to_numbers` projects string columns and one-hot encodes them;\n",
    "projects numeric columns and leaves them unmodified; and\n",
    "finally concatenates both sets of columns back together.\n",
    "The Lale combinator `|` indicates\n",
    "algorithmic choice: `(LR | Tree | KNN)` indicates that\n",
    "it is up to the AI Automation to decide which of the three different\n",
    "classifiers to use. Note that the classifiers are\n",
    "not configured\n",
    "with concrete hyperparameters, since those will be left for the\n",
    "AI automation to choose instead.\n",
    "The search space is encapsulated in the object `planned_orig`.\n",
    "\n",
    "We will use hyperopt to select the algorithms and to tune their\n",
    "hyperparameters. Lale provides a `Hyperopt` operator that\n",
    "turns a search space such as the one specified above into an\n",
    "optimization problem for the hyperopt tool. After 10 trials, we get back\n",
    "the model that performed best for the default optimization\n",
    "objective, which is accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [00:32<00:00,  3.25s/trial, best loss: -0.7492859812086269]\n"
     ]
    },
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"354pt\" height=\"107pt\"\n",
       " viewBox=\"0.00 0.00 354.11 107.28\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 103.284)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-103.284 350.108,-103.284 350.108,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node1\" class=\"node\"><title>project_0</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"27\" cy=\"-71\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"27\" y=\"-68.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder -->\n",
       "<g id=\"node2\" class=\"node\"><title>one_hot_encoder</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"one_hot_encoder = OneHotEncoder(handle_unknown=&#39;ignore&#39;)\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"121.82\" cy=\"-71\" rx=\"31.6406\" ry=\"28.0702\"/>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-80.2\" font-family=\"Times,serif\" font-size=\"11.00\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-68.2\" font-family=\"Times,serif\" font-size=\"11.00\">Hot&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-56.2\" font-family=\"Times,serif\" font-size=\"11.00\">Encoder</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;one_hot_encoder -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>project_0&#45;&gt;one_hot_encoder</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M54.3502,-71C62.1996,-71 70.997,-71 79.5579,-71\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"79.8033,-74.5001 89.8033,-71 79.8032,-67.5001 79.8033,-74.5001\"/>\n",
       "</g>\n",
       "<!-- concat_features -->\n",
       "<g id=\"node4\" class=\"node\"><title>concat_features</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"concat_features = ConcatFeatures()\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"222.874\" cy=\"-44\" rx=\"33.4697\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"222.874\" y=\"-47.2\" font-family=\"Times,serif\" font-size=\"11.00\">Concat&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"222.874\" y=\"-35.2\" font-family=\"Times,serif\" font-size=\"11.00\">Features</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder&#45;&gt;concat_features -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>one_hot_encoder&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M152.539,-62.8991C161.942,-60.3361 172.485,-57.4623 182.457,-54.7441\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"183.448,-58.1019 192.175,-52.0952 181.607,-51.3483 183.448,-58.1019\"/>\n",
       "</g>\n",
       "<!-- project_1 -->\n",
       "<g id=\"node3\" class=\"node\"><title>project_1</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_1 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"121.82\" cy=\"-18\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"121.82\" y=\"-15.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_1&#45;&gt;concat_features -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>project_1&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M147.305,-24.427C157.868,-27.1996 170.493,-30.5135 182.326,-33.6193\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"181.455,-37.0092 192.016,-36.1628 183.232,-30.2386 181.455,-37.0092\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node5\" class=\"node\"><title>lr</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"lr = LR(fit_intercept=False, intercept_scaling=0.3240599822843736, max_iter=839, solver=&#39;newton&#45;cg&#39;, tol=0.009200093064280898)\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"319.108\" cy=\"-44\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"319.108\" y=\"-41.2\" font-family=\"Times,serif\" font-size=\"11.00\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat_features&#45;&gt;lr -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>concat_features&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M256.304,-44C264.464,-44 273.278,-44 281.622,-44\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"281.856,-47.5001 291.856,-44 281.856,-40.5001 281.856,-47.5001\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7faa0fc0b5c0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from lale.lib.lale import Hyperopt\n",
    "best_estimator = planned_orig.auto_configure(\n",
    "    train_X, train_y, optimizer=Hyperopt, cv=3, max_evals=10)\n",
    "best_estimator.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As shown by the visualization, the search found a pipeline\n",
    "with an LR classifier.\n",
    "Inspecting the hyperparameters reveals which values\n",
    "worked best for the 10 trials on the dataset at hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "project_0 = Project(columns={\"type\": \"string\"})\n",
       "one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\")\n",
       "project_1 = Project(columns={\"type\": \"number\"})\n",
       "lr = LR(\n",
       "    fit_intercept=False,\n",
       "    intercept_scaling=0.3240599822843736,\n",
       "    max_iter=839,\n",
       "    solver=\"newton-cg\",\n",
       "    tol=0.009200093064280898,\n",
       ")\n",
       "pipeline = (\n",
       "    ((project_0 >> one_hot_encoder) & project_1) >> ConcatFeatures() >> lr\n",
       ")\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "best_estimator.pretty_print(ipython_display=True, show_imports=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the accuracy score metric from scikit-learn to measure\n",
    "how well the pipeline accomplishes the objective for which it\n",
    "was trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy 72.7%\n"
     ]
    }
   ],
   "source": [
    "import sklearn.metrics\n",
    "accuracy_scorer = sklearn.metrics.make_scorer(sklearn.metrics.accuracy_score)\n",
    "print(f'accuracy {accuracy_scorer(best_estimator, test_X, test_y):.1%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, we would like our model to be not just accurate but also fair.\n",
    "We can use the same `disparate_impact_scorer` from before to evaluate\n",
    "the fairness of `best_estimator`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "disparate impact 0.76\n"
     ]
    }
   ],
   "source": [
    "print(f'disparate impact {disparate_impact_scorer(best_estimator, test_X, test_y):.2f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model is biased, which is no surprise, since it was trained\n",
    "from biased data. We would prefer a\n",
    "model that is much more fair. The AIF360 toolkit provides several\n",
    "algorithms for mitigating fairness problems. One of them is\n",
    "`DisparateImpactRemover`, which modifies the features that are\n",
    "not the protected attribute in such a way that it is hard to\n",
    "predict the protected attribute from them. We use a Lale version\n",
    "of `DisparateImpactRemover` that wraps the corresponding AIF360\n",
    "algorithm for AI Automation. This algorithm has a hyperparameter\n",
    "`repair_level` that we will tune with hyperparameter optimization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"description\": \"Repair amount from 0 = none to 1 = full.\",\n",
       "    \"type\": \"number\",\n",
       "    \"minimum\": 0,\n",
       "    \"maximum\": 1,\n",
       "    \"default\": 1,\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from lale.lib.aif360 import DisparateImpactRemover\n",
    "lale.pretty_print.ipython_display(\n",
    "    DisparateImpactRemover.hyperparam_schema('repair_level'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compose the bias mitigation algorithm in a pipeline with\n",
    "a choice of classifiers as before.\n",
    "In the visualization, light blue indicates trainable operators\n",
    "and dark blue indicates that automation must make a choice before\n",
    "the operators can be trained. Compared to the earlier pipeline,\n",
    "we pass the data preparation sub-pipeline as an argument to `DisparateImpactRemover`,\n",
    "since that fairness mitigator needs numerical data to work on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"394pt\" height=\"229pt\"\n",
       " viewBox=\"0.00 0.00 394.11 229.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 225)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-225 390.108,-225 390.108,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<g id=\"clust1\" class=\"cluster\"><title>cluster:disparate_impact_remover</title>\n",
       "<g id=\"a_clust1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.aif360.disparate_impact_remover.html#lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"disparate_impact_remover = DisparateImpactRemover(favorable_labels=[&#39;good&#39;], protected_attributes=[{&#39;reference_group&#39;: [&#39;male div/sep&#39;, &#39;male mar/wid&#39;, &#39;male single&#39;], &#39;feature&#39;: &#39;personal_status&#39;}, {&#39;feature&#39;: &#39;age&#39;, &#39;reference_group&#39;: [[26, 1000]]}], preparation=pipeline_0)\">\n",
       "<polygon fill=\"#b0e2ff\" stroke=\"black\" points=\"8,-59 8,-213 296.108,-213 296.108,-59 8,-59\"/>\n",
       "<text text-anchor=\"middle\" x=\"152.054\" y=\"-197.8\" font-family=\"Times,serif\" font-size=\"14.00\">DisparateImpactRemover</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<g id=\"clust2\" class=\"cluster\"><title>cluster:pipeline_0</title>\n",
       "<g id=\"a_clust2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.aif360.disparate_impact_remover.html#lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"pipeline_0 = ...\">\n",
       "<path fill=\"#b0e2ff\" stroke=\"black\" d=\"M28,-67C28,-67 276.108,-67 276.108,-67 282.108,-67 288.108,-73 288.108,-79 288.108,-79 288.108,-170 288.108,-170 288.108,-176 282.108,-182 276.108,-182 276.108,-182 28,-182 28,-182 22,-182 16,-176 16,-170 16,-170 16,-79 16,-79 16,-73 22,-67 28,-67\"/>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<g id=\"clust3\" class=\"cluster\"><title>cluster:choice</title>\n",
       "<g id=\"a_clust3\"><a xlink:title=\"choice = lr | tree | knn\">\n",
       "<polygon fill=\"#7ec0ee\" stroke=\"black\" points=\"308.108,-8 308.108,-169 378.108,-169 378.108,-8 308.108,-8\"/>\n",
       "<text text-anchor=\"middle\" x=\"343.108\" y=\"-153.8\" font-family=\"Times,serif\" font-size=\"14.00\">Choice</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project -->\n",
       "<g id=\"node1\" class=\"node\"><title>project</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"51\" cy=\"-146\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"51\" y=\"-143.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder -->\n",
       "<g id=\"node2\" class=\"node\"><title>one_hot_encoder</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"one_hot_encoder = OneHotEncoder(handle_unknown=&#39;ignore&#39;)\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"145.82\" cy=\"-146\" rx=\"31.6406\" ry=\"28.0702\"/>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-155.2\" font-family=\"Times,serif\" font-size=\"11.00\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-143.2\" font-family=\"Times,serif\" font-size=\"11.00\">Hot&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-131.2\" font-family=\"Times,serif\" font-size=\"11.00\">Encoder</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project&#45;&gt;one_hot_encoder -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>project&#45;&gt;one_hot_encoder</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M78.3502,-146C86.1996,-146 94.997,-146 103.558,-146\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"103.803,-149.5 113.803,-146 103.803,-142.5 103.803,-149.5\"/>\n",
       "</g>\n",
       "<!-- concat_features -->\n",
       "<g id=\"node4\" class=\"node\"><title>concat_features</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"concat_features = ConcatFeatures\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"246.874\" cy=\"-120\" rx=\"33.4697\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"246.874\" y=\"-123.2\" font-family=\"Times,serif\" font-size=\"11.00\">Concat&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"246.874\" y=\"-111.2\" font-family=\"Times,serif\" font-size=\"11.00\">Features</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder&#45;&gt;concat_features -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>one_hot_encoder&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M176.539,-138.199C185.942,-135.731 196.485,-132.964 206.457,-130.346\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"207.391,-133.72 216.175,-127.795 205.614,-126.949 207.391,-133.72\"/>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node3\" class=\"node\"><title>project_0</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"145.82\" cy=\"-93\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-90.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;concat_features -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>project_0&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M171.051,-99.6049C181.701,-102.508 194.483,-105.992 206.445,-109.253\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"205.666,-112.668 216.235,-111.921 207.507,-105.914 205.666,-112.668\"/>\n",
       "</g>\n",
       "<!-- lr -->\n",
       "<g id=\"node5\" class=\"node\"><title>lr</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"lr = LR\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"343.108\" cy=\"-120\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"343.108\" y=\"-117.2\" font-family=\"Times,serif\" font-size=\"11.00\">LR</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat_features&#45;&gt;lr -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>concat_features&#45;&gt;lr</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M296.096,-120C299.298,-120 302.493,-120 305.622,-120\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"298.108,-123.5 308.108,-120 298.108,-116.5 298.108,-123.5\"/>\n",
       "</g>\n",
       "<!-- tree -->\n",
       "<g id=\"node6\" class=\"node\"><title>tree</title>\n",
       "<g id=\"a_node6\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.decision_tree_classifier.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"tree = Tree\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"343.108\" cy=\"-77\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"343.108\" y=\"-74.2\" font-family=\"Times,serif\" font-size=\"11.00\">Tree</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- knn -->\n",
       "<g id=\"node7\" class=\"node\"><title>knn</title>\n",
       "<g id=\"a_node7\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.k_neighbors_classifier.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"knn = KNN\">\n",
       "<ellipse fill=\"#7ec0ee\" stroke=\"black\" cx=\"343.108\" cy=\"-34\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"343.108\" y=\"-31.2\" font-family=\"Times,serif\" font-size=\"11.00\">KNN</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7faa0fb8b198>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "di_remover = DisparateImpactRemover(\n",
    "    **fairness_info, preparation=prep_to_numbers)\n",
    "planned_fairer = di_remover >> (LR | Tree | KNN)\n",
    "planned_fairer.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides changing the planned pipeline to use a fairness mitigation\n",
    "operator, we should also change the optimization objective. We need\n",
    "a scoring function that blends accuracy with disparate impact.\n",
    "While you could define this scorer yourself, Lale also provides a\n",
    "pre-defined version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.aif360 import accuracy_and_disparate_impact\n",
    "combined_scorer = accuracy_and_disparate_impact(**fairness_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fairness metrics can be more unstable than accuracy, because they depend\n",
    "not just on the distribution of labels, but also on the distribution of\n",
    "privileged and unprivileged groups as defined by the protected attributes.\n",
    "In AI Automation, k-fold cross validation helps reduce overfitting.\n",
    "To get more stable results, we will stratify these k folds by both labels\n",
    "and groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.lib.aif360 import FairStratifiedKFold\n",
    "fair_cv = FairStratifiedKFold(**fairness_info, n_splits=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we have all the pieces in place to use AI Automation\n",
    "on our `planned_fairer` pipeline for both accuracy and\n",
    "disparate impact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [01:23<00:00,  8.39s/trial, best loss: 0.15000066730728168]\n"
     ]
    }
   ],
   "source": [
    "trained_fairer = planned_fairer.auto_configure(\n",
    "    train_X, train_y, optimizer=Hyperopt, cv=fair_cv,\n",
    "    max_evals=10, scoring=combined_scorer, best_score=1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with any trained model, we can evaluate and visualize the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy 71.2%\n",
      "disparate impact 1.00\n"
     ]
    },
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: cluster:(root) Pages: 1 -->\n",
       "<svg width=\"378pt\" height=\"178pt\"\n",
       " viewBox=\"0.00 0.00 378.11 178.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 174)\">\n",
       "<title>cluster:(root)</title>\n",
       "<g id=\"a_graph0\"><a xlink:title=\"(root) = ...\">\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-174 374.108,-174 374.108,4 -4,4\"/>\n",
       "</a>\n",
       "</g>\n",
       "<g id=\"clust1\" class=\"cluster\"><title>cluster:disparate_impact_remover</title>\n",
       "<g id=\"a_clust1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.aif360.disparate_impact_remover.html#lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"disparate_impact_remover = DisparateImpactRemover(favorable_labels=[&#39;good&#39;], protected_attributes=[{&#39;reference_group&#39;: [&#39;male div/sep&#39;, &#39;male mar/wid&#39;, &#39;male single&#39;], &#39;name&#39;: &#39;lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover&#39;, &#39;feature&#39;: &#39;personal_status&#39;}, {&#39;feature&#39;: &#39;age&#39;, &#39;referenc...)\">\n",
       "<polygon fill=\"white\" stroke=\"black\" points=\"8,-8 8,-162 296.108,-162 296.108,-8 8,-8\"/>\n",
       "<text text-anchor=\"middle\" x=\"152.054\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">DisparateImpactRemover</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<g id=\"clust2\" class=\"cluster\"><title>cluster:pipeline_0</title>\n",
       "<g id=\"a_clust2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.aif360.disparate_impact_remover.html#lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"pipeline_0 = ...\">\n",
       "<path fill=\"#b0e2ff\" stroke=\"black\" d=\"M28,-16C28,-16 276.108,-16 276.108,-16 282.108,-16 288.108,-22 288.108,-28 288.108,-28 288.108,-119 288.108,-119 288.108,-125 282.108,-131 276.108,-131 276.108,-131 28,-131 28,-131 22,-131 16,-125 16,-119 16,-119 16,-28 16,-28 16,-22 22,-16 28,-16\"/>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project -->\n",
       "<g id=\"node1\" class=\"node\"><title>project</title>\n",
       "<g id=\"a_node1\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project = Project(columns={&#39;type&#39;: &#39;string&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"51\" cy=\"-95\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"51\" y=\"-92.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder -->\n",
       "<g id=\"node2\" class=\"node\"><title>one_hot_encoder</title>\n",
       "<g id=\"a_node2\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.one_hot_encoder.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"one_hot_encoder = OneHotEncoder(handle_unknown=&#39;ignore&#39;)\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"145.82\" cy=\"-95\" rx=\"31.6406\" ry=\"28.0702\"/>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-104.2\" font-family=\"Times,serif\" font-size=\"11.00\">One&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-92.2\" font-family=\"Times,serif\" font-size=\"11.00\">Hot&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-80.2\" font-family=\"Times,serif\" font-size=\"11.00\">Encoder</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project&#45;&gt;one_hot_encoder -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>project&#45;&gt;one_hot_encoder</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M78.3502,-95C86.1996,-95 94.997,-95 103.558,-95\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"103.803,-98.5001 113.803,-95 103.803,-91.5001 103.803,-98.5001\"/>\n",
       "</g>\n",
       "<!-- concat_features -->\n",
       "<g id=\"node4\" class=\"node\"><title>concat_features</title>\n",
       "<g id=\"a_node4\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.rasl.concat_features.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"concat_features = ConcatFeatures()\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"246.874\" cy=\"-69\" rx=\"33.4697\" ry=\"19.6\"/>\n",
       "<text text-anchor=\"middle\" x=\"246.874\" y=\"-72.2\" font-family=\"Times,serif\" font-size=\"11.00\">Concat&#45;</text>\n",
       "<text text-anchor=\"middle\" x=\"246.874\" y=\"-60.2\" font-family=\"Times,serif\" font-size=\"11.00\">Features</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- one_hot_encoder&#45;&gt;concat_features -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>one_hot_encoder&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M176.539,-87.1991C185.942,-84.7311 196.485,-81.9637 206.457,-79.3462\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"207.391,-82.7196 216.175,-76.7954 205.614,-75.9489 207.391,-82.7196\"/>\n",
       "</g>\n",
       "<!-- project_0 -->\n",
       "<g id=\"node3\" class=\"node\"><title>project_0</title>\n",
       "<g id=\"a_node3\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.project.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"project_0 = Project(columns={&#39;type&#39;: &#39;number&#39;})\">\n",
       "<ellipse fill=\"#b0e2ff\" stroke=\"black\" cx=\"145.82\" cy=\"-42\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"145.82\" y=\"-39.2\" font-family=\"Times,serif\" font-size=\"11.00\">Project</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- project_0&#45;&gt;concat_features -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>project_0&#45;&gt;concat_features</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M171.051,-48.6049C181.701,-51.5078 194.483,-54.992 206.445,-58.2526\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"205.666,-61.668 216.235,-60.9211 207.507,-54.9144 205.666,-61.668\"/>\n",
       "</g>\n",
       "<!-- knn -->\n",
       "<g id=\"node5\" class=\"node\"><title>knn</title>\n",
       "<g id=\"a_node5\"><a xlink:href=\"https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.k_neighbors_classifier.html\" target=\"_blank\" rel=\"noopener noreferrer\" xlink:title=\"knn = KNN(algorithm=&#39;ball_tree&#39;, metric=&#39;manhattan&#39;, n_neighbors=93)\">\n",
       "<ellipse fill=\"white\" stroke=\"black\" cx=\"343.108\" cy=\"-69\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"343.108\" y=\"-66.2\" font-family=\"Times,serif\" font-size=\"11.00\">KNN</text>\n",
       "</a>\n",
       "</g>\n",
       "</g>\n",
       "<!-- concat_features&#45;&gt;knn -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>concat_features&#45;&gt;knn</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M296.096,-69C299.298,-69 302.493,-69 305.622,-69\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"305.856,-72.5001 315.856,-69 305.856,-65.5001 305.856,-72.5001\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7faa0f9d8198>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(f'accuracy {accuracy_scorer(trained_fairer, test_X, test_y):.1%}')\n",
    "print(f'disparate impact {disparate_impact_scorer(trained_fairer, test_X, test_y):.2f}')\n",
    "trained_fairer.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the result demonstrates, the best model found by AI Automation\n",
    "has similar accuracy and better disparate impact than the one we saw\n",
    "before. Also, it has tuned the repair level and\n",
    "has picked and tuned a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "project = Project(columns={\"type\": \"string\"})\n",
       "one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\")\n",
       "project_0 = Project(columns={\"type\": \"number\"})\n",
       "disparate_impact_remover = DisparateImpactRemover(\n",
       "    favorable_labels=[\"good\"],\n",
       "    protected_attributes=[\n",
       "        {\n",
       "            \"reference_group\": [\n",
       "                \"male div/sep\", \"male mar/wid\", \"male single\",\n",
       "            ],\n",
       "            \"name\": \"lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\",\n",
       "            \"feature\": \"personal_status\",\n",
       "        },\n",
       "        {\n",
       "            \"feature\": \"age\",\n",
       "            \"reference_group\": [[26, 1000]],\n",
       "            \"name\": \"lale.lib.aif360.disparate_impact_remover.DisparateImpactRemover\",\n",
       "        },\n",
       "    ],\n",
       "    preparation=((project >> one_hot_encoder) & project_0)\n",
       "    >> ConcatFeatures(),\n",
       "    repair_level=0.6701479482689345,\n",
       ")\n",
       "knn = KNN(algorithm=\"ball_tree\", metric=\"manhattan\", n_neighbors=93)\n",
       "pipeline = disparate_impact_remover >> knn\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trained_fairer.pretty_print(ipython_display=True, show_imports=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results may vary by dataset and search space.\n",
    "\n",
    "In summary, this blog post showed you how to use AI Automation\n",
    "from Lale, while incorporating a fairness mitigation technique\n",
    "into the pipeline and a fairness metric into the objective.\n",
    "Of course, this blog post only scratches the surface of what can\n",
    "be done with AI Automation and AI Fairness. We encourage you to\n",
    "check out the open-source projects Lale and AIF360 and use them\n",
    "to build your own fair and accurate models!\n",
    "\n",
    "- Lale: https://github.com/IBM/lale\n",
    "- AIF360: https://aif360.mybluemix.net/\n",
    "- API documentation: [lale.lib.aif360](https://lale.readthedocs.io/en/latest/modules/lale.lib.aif360.html#module-lale.lib.aif360)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}