{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bias scan using Multi-Dimensional Subset Scan (MDSS)\n",
    "\n",
    "\"Identifying Significant Predictive Bias in Classifiers\" https://arxiv.org/abs/1611.08292\n",
    "\n",
    "The goal of bias scan is to identify a subgroup(s) that has significantly more predictive bias than would be expected from an unbiased classifier. There are $\\prod_{m=1}^{M}\\left(2^{|X_{m}|}-1\\right)$ unique subgroups from a dataset with $M$ features, with each feature having $|X_{m}|$ discretized values, where a subgroup is any $M$-dimension\n",
    "Cartesian set product, between subsets of feature-values from each feature --- excluding the empty set. Bias scan mitigates this computational hurdle by approximately identifing the most statistically biased subgroup in linear time (rather than exponential).\n",
    "\n",
    "\n",
    "We define the statistical measure of predictive bias function, $score_{bias}(S)$ as a likelihood ratio score and a function of a given subgroup $S$. The null hypothesis is that the given prediction's odds are correct for all subgroups in\n",
    "\n",
    "$\\mathcal{D}$: $H_{0}:odds(y_{i})=\\frac{\\hat{p}_{i}}{1-\\hat{p}_{i}}\\ \\forall i\\in\\mathcal{D}$.\n",
    "\n",
    "The alternative hypothesis assumes some constant multiplicative bias in the odds for some given subgroup $S$:\n",
    "\n",
    "\n",
    "$H_{1}:\\ odds(y_{i})=q\\frac{\\hat{p}_{i}}{1-\\hat{p}_{i}},\\ \\text{where}\\ q>1\\ \\forall i\\in S\\ \\mbox{and}\\ q=1\\ \\forall i\\notin S.$\n",
    "\n",
    "In the classification setting, each observation's likelihood is Bernoulli distributed and assumed independent. This results in the following scoring function for a subgroup $S$\n",
    "\n",
    "\\begin{align*}\n",
    "score_{bias}(S)= & \\max_{q}\\log\\prod_{i\\in S}\\frac{Bernoulli(\\frac{q\\hat{p}_{i}}{1-\\hat{p}_{i}+q\\hat{p}_{i}})}{Bernoulli(\\hat{p}_{i})}\\\\\n",
    "= & \\max_{q}\\log(q)\\sum_{i\\in S}y_{i}-\\sum_{i\\in S}\\log(1-\\hat{p}_{i}+q\\hat{p}_{i}).\n",
    "\\end{align*}\n",
    "Our bias scan is thus represented as: $S^{*}=FSS(\\mathcal{D},\\mathcal{E},F_{score})=MDSS(\\mathcal{D},\\hat{p},score_{bias})$.\n",
    "\n",
    "where $S^{*}$ is the detected most anomalous subgroup, $FSS$ is one of several subset scan algorithms for different problem settings, $\\mathcal{D}$ is a dataset with outcomes $Y$ and discretized features $\\mathcal{X}$, $\\mathcal{E}$ are a set of expectations or 'normal' values for $Y$, and $F_{score}$ is an expectation-based scoring statistic that measures the amount of anomalousness between subgroup observations and their expectations.\n",
    "\n",
    "Predictive bias emphasizes comparable predictions for a subgroup and its observations and Bias scan provides a more general method that can detect and characterize such bias, or poor classifier fit, in the larger space of all possible subgroups, without a priori specification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Usage\n",
    "\n",
    "MDScan currently supports three scoring functions. These scoring functions usage are described below:\n",
    "- *BerkJones*: Non-parametric scoring function. To be used for all of the four types of outcomes supported - binary, continuous, nominal, ordinal.\n",
    "- *Bernoulli*: Parametric scoring function. To used for two of the four types of outcomes supported - binary and nominal.\n",
    "- *Guassian*: Parametric scoring function. To used for one of the four types of outcomes supported - continuous.\n",
    "- *Poisson*: Parametric scoring function. To be used for three of the four types of outcomes supported - binary, continuous, and ordinal.\n",
    "\n",
    "Note, non-parametric scoring functions can only be used for datasets where the expectations are constant or none.\n",
    "\n",
    "The type of outcomes must be provided using the mode keyword argument. The definition for the four types of outcomes supported are provided below:\n",
    "- Binary: Yes/no outcomes. Outcomes must 0 or 1.\n",
    "- Continuous: Continuous outcomes. Outcomes could be any real number.\n",
    "- Nominal: Multiclass outcomes with no rank or order between them. Outcomes must be a finite set of integers with dimensionality <= 10.\n",
    "- Ordinal: Multiclass outcomes that are ranked in a specific order. Outcomes must be positive integers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from aif360.detectors.mdss_detector import bias_scan\n",
    "from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions import load_preproc_data_compas\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll demonstrate finding the most anomalous subset with bias scan using the compas dataset. We can specify subgroups to be scored or scan for the most anomalous subgroup. Bias scan allows us to decide if we aim to identify bias as `higher` than expected probabilities or `lower` than expected probabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compas Dataset\n",
    "This is a binary classification use case where the favorable label is 0 and the scoring function is the default bernoulli."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(0)\n",
    "\n",
    "dataset_orig = load_preproc_data_compas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset has the categorical features one-hot encoded so we'll modify the dataset to convert them back \n",
    "to the categorical featues because scanning one-hot encoded features may find subgroups that are not meaningful eg. a subgroup with 2 race values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_orig_df = pd.DataFrame(dataset_orig.features, columns=dataset_orig.feature_names)\n",
    "\n",
    "age_cat = np.argmax(dataset_orig_df[['age_cat=Less than 25', 'age_cat=25 to 45', \n",
    "                                     'age_cat=Greater than 45']].values, axis=1).reshape(-1, 1)\n",
    "priors_count = np.argmax(dataset_orig_df[['priors_count=0', 'priors_count=1 to 3', \n",
    "                                          'priors_count=More than 3']].values, axis=1).reshape(-1, 1)\n",
    "c_charge_degree = np.argmax(dataset_orig_df[['c_charge_degree=F', 'c_charge_degree=M']].values, axis=1).reshape(-1, 1)\n",
    "\n",
    "features = np.concatenate((dataset_orig_df[['sex', 'race']].values, age_cat, priors_count, \\\n",
    "                           c_charge_degree, dataset_orig.labels), axis=1)\n",
    "feature_names = ['sex', 'race', 'age_cat', 'priors_count', 'c_charge_degree']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(features, columns=feature_names + ['two_year_recid'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>age_cat</th>\n",
       "      <th>priors_count</th>\n",
       "      <th>c_charge_degree</th>\n",
       "      <th>two_year_recid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sex  race  age_cat  priors_count  c_charge_degree  two_year_recid\n",
       "0  0.0   0.0      1.0           0.0              0.0             1.0\n",
       "1  0.0   0.0      0.0           2.0              0.0             1.0\n",
       "2  0.0   1.0      1.0           2.0              0.0             1.0\n",
       "3  1.0   1.0      1.0           0.0              1.0             0.0\n",
       "4  0.0   1.0      1.0           0.0              0.0             0.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### training\n",
    "We'll train a simple classifier to predict the probability of the outcome"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "X = df.drop('two_year_recid', axis = 1)\n",
    "y = df['two_year_recid']\n",
    "clf = LogisticRegression(solver='lbfgs', C=1.0, penalty='l2')\n",
    "clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the probability scores we use are the probabilities of the favorable label, which is 0 in this case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "probs = pd.Series(clf.predict_proba(X)[:,0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### bias scan\n",
    "We can scan for a privileged and unprivileged subset using bias scan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=X,observations=y,expectations=probs,favorable_value=0, overpredicted=True)\n",
    "unprivileged_subset = bias_scan(data=X,observations=y,expectations=probs,favorable_value=0,overpredicted=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'age_cat': [1.0], 'priors_count': [0.0, 1.0, 2.0], 'sex': [1.0], 'race': [1.0], 'c_charge_degree': [0.0]}, 7.9086)\n",
      "({'race': [0.0], 'age_cat': [1.0, 2.0], 'priors_count': [1.0], 'c_charge_degree': [0.0, 1.0]}, 7.0227)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "dff = X.copy()\n",
    "dff['observed'] = y \n",
    "dff['probabilities'] = 1 - probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected priviledged group has a size of 147, we observe 0.5374149659863946 as the average risk of recidivism, but our model predicts 0.3827815971689547'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"Our detected priviledged group has a size of {}, we observe {} as the average risk of recidivism, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['observed'].mean(), temp_df['probabilities'].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected priviledged group has a size of 732, we observe 0.3770491803278688 as the average risk of recidivism, but our model predicts 0.44470388217799317'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"Our detected priviledged group has a size of {}, we observe {} as the average risk of recidivism, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['observed'].mean(), temp_df['probabilities'].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adult Dataset\n",
    "This is a binary classification use case where the favorable label is 1 and the scoring function is the berk jones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>workclass</th>\n",
       "      <th>education</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>native_country</th>\n",
       "      <th>age_bin</th>\n",
       "      <th>education_num_bin</th>\n",
       "      <th>hours_per_week_bin</th>\n",
       "      <th>capital_gain_bin</th>\n",
       "      <th>capital_loss_bin</th>\n",
       "      <th>observed</th>\n",
       "      <th>expectation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Private</td>\n",
       "      <td>11th</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>17-27</td>\n",
       "      <td>1-8</td>\n",
       "      <td>40-44</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.236226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>37-47</td>\n",
       "      <td>9</td>\n",
       "      <td>45-99</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.236226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Local-gov</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Protective-serv</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>28-36</td>\n",
       "      <td>12-16</td>\n",
       "      <td>40-44</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.236226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Private</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>United-States</td>\n",
       "      <td>37-47</td>\n",
       "      <td>10-11</td>\n",
       "      <td>40-44</td>\n",
       "      <td>7298-7978</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.236226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>?</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>?</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>United-States</td>\n",
       "      <td>17-27</td>\n",
       "      <td>10-11</td>\n",
       "      <td>1-39</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.236226</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    workclass      education       marital_status          occupation  \\\n",
       "0     Private           11th        Never-married   Machine-op-inspct   \n",
       "1     Private        HS-grad   Married-civ-spouse     Farming-fishing   \n",
       "2   Local-gov     Assoc-acdm   Married-civ-spouse     Protective-serv   \n",
       "3     Private   Some-college   Married-civ-spouse   Machine-op-inspct   \n",
       "4           ?   Some-college        Never-married                   ?   \n",
       "\n",
       "  relationship    race      sex  native_country age_bin education_num_bin  \\\n",
       "0    Own-child   Black     Male   United-States   17-27               1-8   \n",
       "1      Husband   White     Male   United-States   37-47                 9   \n",
       "2      Husband   White     Male   United-States   28-36             12-16   \n",
       "3      Husband   Black     Male   United-States   37-47             10-11   \n",
       "4    Own-child   White   Female   United-States   17-27             10-11   \n",
       "\n",
       "  hours_per_week_bin capital_gain_bin capital_loss_bin  observed  expectation  \n",
       "0              40-44                0                0         0     0.236226  \n",
       "1              45-99                0                0         0     0.236226  \n",
       "2              40-44                0                0         1     0.236226  \n",
       "3              40-44        7298-7978                0         1     0.236226  \n",
       "4               1-39                0                0         0     0.236226  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv('https://gist.githubusercontent.com/Viktour19/b690679802c431646d36f7e2dd117b9e/raw/d8f17bf25664bd2d9fa010750b9e451c4155dd61/adult_autostrat.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that for the adult dataset, the positive label is 1 and thus the expectations provided is the probability of the earning >50k i.e label 1 and the favorable label is 1 which is the default for binary classification tasks. Since we would be using scoring function BerkJones, we also need to pass in an alpha value. Alpha can be interpreted as what proportion of the data you expect to have the favorable value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = data.drop(['observed','expectation'], axis = 1)\n",
    "probs = data['expectation']\n",
    "y = data['observed']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=X, observations=y, scoring='BerkJones', expectations=probs, overpredicted=True,penalty=50, alpha = .24)\n",
    "unprivileged_subset = bias_scan(data=X,observations=y, scoring='BerkJones', expectations=probs, overpredicted=False,penalty=50, alpha = .24)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'relationship': [' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried'], 'capital_gain_bin': ['0']}, 932.4812)\n",
      "({'education_num_bin': ['12-16'], 'marital_status': [' Married-civ-spouse']}, 1041.1901)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "dff = X.copy()\n",
    "dff['observed'] = y \n",
    "dff['probabilities'] = probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 8532, we observe 0.0472 as the average probability of earning >50k, but our model predicts 0.2362'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the average probability of earning >50k, but our model predicts {}\"\\\n",
    ".format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected unprivileged group has a size of 2430, we observe 0.6996 as the average probability of earning >50k, but our model predicts 0.2362'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]\n",
    "\n",
    "\"Our detected unprivileged group has a size of {}, we observe {} as the average probability of earning >50k, but our model predicts {}\"\\\n",
    ".format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Insurance Costs\n",
    "This is a regression use case where the favorable value is 0 and the scoring function is Gaussian."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1338, 7)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/insurance.csv')\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in ['bmi','age']:\n",
    "        data[col] = pd.qcut(data[col], 10, duplicates='drop')\n",
    "        data[col] = data[col].apply(lambda x: str(round(x.left, 2)) + ' - ' + str(round(x.right,2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = data.drop('charges', axis = 1)\n",
    "X = features.copy()\n",
    "\n",
    "for feature in X.columns:\n",
    "    X[feature] = X[feature].astype('category').cat.codes\n",
    "\n",
    "y = data['charges']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "reg = LinearRegression()\n",
    "reg.fit(X, y)\n",
    "y_pred = pd.Series(reg.predict(X))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=features, observations=y, expectations=y_pred, scoring = 'Gaussian', \n",
    "                            overpredicted=True, penalty=1e10, mode ='continuous', favorable_value='low')\n",
    "\n",
    "unprivileged_subset = bias_scan(data=features, observations=y, expectations=y_pred, scoring = 'Gaussian', \n",
    "                            overpredicted=False, penalty=1e10, mode ='continuous', favorable_value='low')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'bmi': ['15.96 - 22.99', '22.99 - 25.33', '25.33 - 27.36'], 'smoker': ['no']}, 2384.5786)\n",
      "({'bmi': ['15.96 - 22.99', '22.99 - 25.33', '25.33 - 27.36', '27.36 - 28.8'], 'smoker': ['yes']}, 3927.8765)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 321, we observe 7844.840295856697 as the mean insurance costs, but our model predicts 5420.49326277455'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = data[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = data.loc[to_choose].copy()\n",
    "temp_y = y_pred.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the mean insurance costs, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['charges'].mean(), temp_y.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 115, we observe 21148.37389617392 as the mean insurance costs, but our model predicts 29694.035319112852'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = data[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = data.loc[to_choose].copy()\n",
    "temp_y = y_pred.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the mean insurance costs, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['charges'].mean(), temp_y.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hospitalization Time\n",
    "This is an ordinal, multiclass classification use case where the favorable value is 1 and the scoring function is Poisson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(29980, 22)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/hospital.csv')\n",
    "data = data[data['Length of Stay'] != '120 +'].fillna('Unknown')\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = data.drop(['Length of Stay'], axis = 1)\n",
    "y = pd.to_numeric(data['Length of Stay'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=X, observations=y, scoring = 'Poisson', favorable_value = 'low', overpredicted=True, penalty=50, mode ='ordinal')\n",
    "unprivileged_subset = bias_scan(data=X, observations=y, scoring = 'Poisson', favorable_value = 'low', overpredicted=False, penalty=50, mode ='ordinal')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'APR Severity of Illness Description': ['Extreme']}, 11180.5386)\n",
      "({'Patient Disposition': ['Home or Self Care', 'Left Against Medical Advice', 'Short-term Hospital'], 'APR Severity of Illness Description': ['Minor', 'Moderate'], 'APR MDC Code': [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 21]}, 9950.881)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "dff = X.copy()\n",
    "dff['observed'] = y \n",
    "dff['predicted'] = y.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 1900, we observe 15.2216 as the average number of days spent in the hospital, but our model predicts 5.4231'"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the average number of days spent in the hospital, but our model predicts {}\"\\\n",
    ".format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['predicted'].mean(),4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected unprivileged group has a size of 14620, we observe 2.8301 as the average number of days spent in the hospital, but our model predicts 5.4231'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = dff.loc[to_choose]\n",
    "\n",
    "\"Our detected unprivileged group has a size of {}, we observe {} as the average number of days spent in the hospital, but our model predicts {}\"\\\n",
    ".format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['predicted'].mean(),4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Temperature Dataset\n",
    "This is a regression use case where the favorable value is the higher temperatures and the scoring function is Berk Jones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Summary</th>\n",
       "      <th>PrecipType</th>\n",
       "      <th>Humidity</th>\n",
       "      <th>WindSpeed</th>\n",
       "      <th>Visibility</th>\n",
       "      <th>Pressure</th>\n",
       "      <th>DailySummary</th>\n",
       "      <th>Temperature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Partly Cloudy</td>\n",
       "      <td>rain</td>\n",
       "      <td>0.89</td>\n",
       "      <td>14.1197</td>\n",
       "      <td>15.8263</td>\n",
       "      <td>1015.13</td>\n",
       "      <td>Partly cloudy throughout the day.</td>\n",
       "      <td>9.472222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Partly Cloudy</td>\n",
       "      <td>rain</td>\n",
       "      <td>0.86</td>\n",
       "      <td>14.2646</td>\n",
       "      <td>15.8263</td>\n",
       "      <td>1015.63</td>\n",
       "      <td>Partly cloudy throughout the day.</td>\n",
       "      <td>9.355556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mostly Cloudy</td>\n",
       "      <td>rain</td>\n",
       "      <td>0.89</td>\n",
       "      <td>3.9284</td>\n",
       "      <td>14.9569</td>\n",
       "      <td>1015.94</td>\n",
       "      <td>Partly cloudy throughout the day.</td>\n",
       "      <td>9.377778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Partly Cloudy</td>\n",
       "      <td>rain</td>\n",
       "      <td>0.83</td>\n",
       "      <td>14.1036</td>\n",
       "      <td>15.8263</td>\n",
       "      <td>1016.41</td>\n",
       "      <td>Partly cloudy throughout the day.</td>\n",
       "      <td>8.288889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Mostly Cloudy</td>\n",
       "      <td>rain</td>\n",
       "      <td>0.83</td>\n",
       "      <td>11.0446</td>\n",
       "      <td>15.8263</td>\n",
       "      <td>1016.51</td>\n",
       "      <td>Partly cloudy throughout the day.</td>\n",
       "      <td>8.755556</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Summary PrecipType  Humidity  WindSpeed  Visibility  Pressure  \\\n",
       "0  Partly Cloudy       rain      0.89    14.1197     15.8263   1015.13   \n",
       "1  Partly Cloudy       rain      0.86    14.2646     15.8263   1015.63   \n",
       "2  Mostly Cloudy       rain      0.89     3.9284     14.9569   1015.94   \n",
       "3  Partly Cloudy       rain      0.83    14.1036     15.8263   1016.41   \n",
       "4  Mostly Cloudy       rain      0.83    11.0446     15.8263   1016.51   \n",
       "\n",
       "                        DailySummary  Temperature  \n",
       "0  Partly cloudy throughout the day.     9.472222  \n",
       "1  Partly cloudy throughout the day.     9.355556  \n",
       "2  Partly cloudy throughout the day.     9.377778  \n",
       "3  Partly cloudy throughout the day.     8.288889  \n",
       "4  Partly cloudy throughout the day.     8.755556  "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/weatherHistory.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Binning the continuous features since bias scan support only categorical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in ['Humidity','WindSpeed','Visibility','Pressure']:\n",
    "        data[col] = pd.qcut(data[col], 10, duplicates='drop')\n",
    "        data[col] = data[col].apply(lambda x: str(round(x.left, 2)) + ' - ' + str(round(x.right,2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = data.drop('Temperature', axis = 1)\n",
    "y = data['Temperature']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=features, observations=y, favorable_value = 'high',\n",
    "                            scoring = 'BerkJones', overpredicted=True, penalty=50, mode ='continuous', alpha = .4)\n",
    "\n",
    "unprivileged_subset = bias_scan(data=features, observations=y, favorable_value = 'high',\n",
    "                    scoring = 'BerkJones', overpredicted=False, penalty=50, mode ='continuous', alpha = .4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'Pressure': ['-0.0 - 1007.07', '1018.17 - 1020.0', '1020.0 - 1022.42', '1022.42 - 1026.61', '1026.61 - 1046.38'], 'Humidity': ['0.72 - 0.78', '0.78 - 0.83', '0.83 - 0.87', '0.87 - 0.92', '0.92 - 0.95', '0.95 - 1.0']}, 6907.8227)\n",
      "({'Visibility': ['9.9 - 9.98', '9.98 - 10.05', '10.05 - 11.04', '11.04 - 11.45', '11.45 - 15.15', '15.15 - 15.83', '15.83 - 16.1'], 'PrecipType': ['rain'], 'Pressure': ['-0.0 - 1007.07', '1007.07 - 1010.68', '1010.68 - 1012.95', '1012.95 - 1014.8', '1014.8 - 1016.45', '1016.45 - 1018.17', '1018.17 - 1020.0', '1020.0 - 1022.42']}, 19962.4291)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 31607, we observe 5.155584909121915 as the mean temperature, but our model predicts 11.932678437519867'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = data[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = data.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the mean temperature, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['Temperature'].mean(), y.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected unprivileged group has a size of 55642, we observe 16.773802762911167 as the mean temperature, but our model predicts 11.932678437519867'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = data[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = data.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected unprivileged group has a size of {}, we observe {} as the mean temperature, but our model predicts {}\"\\\n",
    ".format(len(temp_df), temp_df['Temperature'].mean(), y.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Iris Dataset\n",
    "This is an nominal, multiclass classification use case where the favorable value is a flower specie and the scoring function is Bernoulli."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SepalLengthCm</th>\n",
       "      <th>SepalWidthCm</th>\n",
       "      <th>PetalLengthCm</th>\n",
       "      <th>PetalWidthCm</th>\n",
       "      <th>Species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species\n",
       "0            5.1           3.5            1.4           0.2  Iris-setosa\n",
       "1            4.9           3.0            1.4           0.2  Iris-setosa\n",
       "2            4.7           3.2            1.3           0.2  Iris-setosa\n",
       "3            4.6           3.1            1.5           0.2  Iris-setosa\n",
       "4            5.0           3.6            1.4           0.2  Iris-setosa"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_data = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/Iris.csv').drop('Id', axis = 1)\n",
    "iris_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in iris_data.columns:\n",
    "    if col != 'Species':\n",
    "        iris_data[col] = pd.qcut(iris_data[col], 10, duplicates='drop')\n",
    "        iris_data[col] = iris_data[col].apply(lambda x: str(round(x.left, 2)) + ' - ' + str(round(x.right,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Training simple model on data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = iris_data.drop('Species', axis = 1)\n",
    "for col in X.columns:\n",
    "    X[col] = X[col].cat.codes\n",
    "\n",
    "y = iris_data['Species']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "clf_2 = LogisticRegression(C=1e-3)\n",
    "clf_2.fit(X, y)\n",
    "iris_data['Prediction'] = clf_2.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = iris_data.drop(['Species','Prediction'], axis = 1)\n",
    "expectations = pd.DataFrame(clf_2.predict_proba(X), columns=clf_2.classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bias scan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=features, observations=y, expectations=expectations, scoring = 'Bernoulli', \n",
    "                                favorable_value = 'Iris-virginica', overpredicted=True, penalty=.05, mode ='nominal')\n",
    "unprivileged_subset = bias_scan(data=features, observations=y, expectations=expectations, scoring = 'Bernoulli', \n",
    "                                favorable_value = 'Iris-virginica', overpredicted=False, penalty=.005, mode ='nominal')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'PetalLengthCm': ['1.0 - 1.4', '1.4 - 1.5', '1.5 - 1.7', '1.7 - 3.9', '3.9 - 4.35', '4.35 - 4.64'], 'PetalWidthCm': ['0.1 - 0.2', '0.2 - 0.4', '0.4 - 1.16', '1.16 - 1.3', '1.3 - 1.5']}, 20.0508)\n",
      "({'SepalLengthCm': ['4.8 - 5.0', '5.6 - 5.8', '6.1 - 6.3', '6.3 - 6.52', '6.52 - 6.9', '6.9 - 7.9'], 'PetalWidthCm': ['1.5 - 1.8', '1.8 - 1.9', '1.9 - 2.2', '2.2 - 2.5'], 'PetalLengthCm': ['4.35 - 4.64', '5.0 - 5.32', '5.32 - 5.8', '5.8 - 6.9']}, 22.101)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)\n",
    "print(unprivileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 88, we observe 0 as the count of Iris-virginica, but our model predicts 50'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = iris_data[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = iris_data.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the count of Iris-virginica, but our model predicts {}\"\\\n",
    ".format(len(temp_df), (temp_df['Species'] == 'Iris-virginica').sum(), (temp_df['Prediction'] == 'Iris-setosa').sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected unprivileged group has a size of 39, we observe 39 as the count of Iris-virginica, but our model predicts 38'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = iris_data[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "temp_df = iris_data.loc[to_choose].copy()\n",
    "\n",
    "\"Our detected unprivileged group has a size of {}, we observe {} as the count of Iris-virginica, but our model predicts {}\"\\\n",
    ".format(len(temp_df), (temp_df['Species'] == 'Iris-virginica').sum(), (temp_df['Prediction'] == 'Iris-virginica').sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming we want to scan for the second most privileged group, we can remove the records that belongs to the most privileged_subset and then rescan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_choose = iris_data[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
    "X_filtered = iris_data[~to_choose]\n",
    "y_filtered = y[~to_choose]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "privileged_subset = bias_scan(data=X_filtered.drop(['Species','Prediction'], axis = 1), observations=y_filtered, \n",
    "                            favorable_value = 'Iris-virginica', scoring = 'Bernoulli', overpredicted=True, penalty=1e-6, mode = 'nominal')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "({'PetalLengthCm': ['1.0 - 1.4', '1.4 - 1.5', '1.5 - 1.7', '1.7 - 3.9', '3.9 - 4.35', '4.35 - 4.64']}, 36.0207)\n"
     ]
    }
   ],
   "source": [
    "print(privileged_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Our detected privileged group has a size of 89, we observe 0 as the count of Iris-virginica, but our model predicts 4'"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_choose = X_filtered[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
    "temp_df = X_filtered.loc[to_choose]\n",
    "\n",
    "\"Our detected privileged group has a size of {}, we observe {} as the count of Iris-virginica, but our model predicts {}\"\\\n",
    ".format(len(temp_df), (temp_df['Species'] == 'Iris-virginica').sum(), (temp_df['Prediction'] == 'Iris-virginica').sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In summary, this notebook explains how to use the new mdss bias scan interface in aif360.detectors to scan for bias, even for tasks beyond binary classification, using the concepts of over-predictions and under-predictions."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "a7b8e4082fc046e7b321ebd13577b0b02bbec122b09da65f91f262e840b142f2"
  },
  "kernelspec": {
   "display_name": "aif360",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}