{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example of usage model from sklift.models in sklearn.pipeline\n",
    "\n",
    "<br>\n",
    "<center>\n",
    "    <a href=\"https://colab.research.google.com/github/maks-sh/scikit-uplift/blob/master/notebooks/pipeline_usage_EN.ipynb\">\n",
    "        <img src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n",
    "    </a>\n",
    "    <br>\n",
    "    <b><a href=\"https://github.com/maks-sh/scikit-uplift/\">SCIKIT-UPLIFT REPO</a> | </b>\n",
    "    <b><a href=\"https://scikit-uplift.readthedocs.io/en/latest/\">SCIKIT-UPLIFT DOCS</a></b>\n",
    "    <br>\n",
    "    <b><a href=\"https://nbviewer.jupyter.org/github/maks-sh/scikit-uplift/blob/master/notebooks/pipeline_usage_RU.ipynb\">RUSSIAN VERSION</a></b>\n",
    "\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-26T12:44:35.435852Z",
     "start_time": "2020-04-26T12:44:35.239050Z"
    }
   },
   "source": [
    "This is a simple example on how to use [sklift.models](https://scikit-uplift.readthedocs.io/en/latest/api/models.html) with [sklearn.pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline).\n",
    "\n",
    "The data is taken from [MineThatData E-Mail Analytics And Data Mining Challenge dataset by Kevin Hillstrom](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html).\n",
    "\n",
    "This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test:\n",
    "* 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.\n",
    "* 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.\n",
    "* 1/3 were randomly chosen to not receive an e-mail campaign.\n",
    "\n",
    "During a period of two weeks following the e-mail campaign, results were tracked. The task is to tell the world if the Mens or Womens e-mail campaign was successful.\n",
    "\n",
    "The full description of the dataset can be found at the [link](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html).\n",
    "\n",
    "Firstly, install the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:19.201984Z",
     "start_time": "2020-05-02T16:59:19.196511Z"
    }
   },
   "outputs": [],
   "source": [
    "!pip install scikit-uplift==0.1.2 xgboost==1.0.2 category_encoders==2.1.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Secondly, load the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:34.848414Z",
     "start_time": "2020-05-02T16:59:19.206449Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('./content/Hilstorm.csv', <http.client.HTTPMessage at 0x117de0438>)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import urllib.request\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "csv_path = '/content/Hilstorm.csv'\n",
    "url = 'http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv'\n",
    "urllib.request.urlretrieve(url, csv_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For simplicity of the example, we will leave only two user segments:\n",
    "* those who were sent an e-mail advertising campaign with women's products;\n",
    "* those who were not sent out the ad campaign.\n",
    "\n",
    "We will use the `visit` variable as the target variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:35.421058Z",
     "start_time": "2020-05-02T16:59:34.851896Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of the dataset before processing: (64000, 12)\n",
      "Shape of the dataset after processing: (42693, 10)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>recency</th>\n",
       "      <th>history_segment</th>\n",
       "      <th>history</th>\n",
       "      <th>mens</th>\n",
       "      <th>womens</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>newbie</th>\n",
       "      <th>channel</th>\n",
       "      <th>visit</th>\n",
       "      <th>treatment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>142.44</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>0</td>\n",
       "      <td>Phone</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6</td>\n",
       "      <td>3) $200 - $350</td>\n",
       "      <td>329.08</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Rural</td>\n",
       "      <td>1</td>\n",
       "      <td>Web</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>180.65</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>1</td>\n",
       "      <td>Web</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>1) $0 - $100</td>\n",
       "      <td>45.34</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Urban</td>\n",
       "      <td>0</td>\n",
       "      <td>Web</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>134.83</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>0</td>\n",
       "      <td>Phone</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   recency history_segment  history  mens  womens   zip_code  newbie channel  \\\n",
       "0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone   \n",
       "1        6  3) $200 - $350   329.08     1       1      Rural       1     Web   \n",
       "2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web   \n",
       "4        2    1) $0 - $100    45.34     1       0      Urban       0     Web   \n",
       "5        6  2) $100 - $200   134.83     0       1  Surburban       0   Phone   \n",
       "\n",
       "   visit  treatment  \n",
       "0      0          1  \n",
       "1      0          0  \n",
       "2      0          1  \n",
       "4      0          1  \n",
       "5      1          1  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "dataset = pd.read_csv(csv_path)\n",
    "print(f'Shape of the dataset before processing: {dataset.shape}')\n",
    "dataset = dataset[dataset['segment']!='Mens E-Mail']\n",
    "dataset.loc[:, 'treatment'] = dataset['segment'].map({\n",
    "    'Womens E-Mail': 1,\n",
    "    'No E-Mail': 0\n",
    "})\n",
    "\n",
    "dataset = dataset.drop(['segment', 'conversion', 'spend'], axis=1)\n",
    "print(f'Shape of the dataset after processing: {dataset.shape}')\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Divide all the data into a training and validation sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:36.153705Z",
     "start_time": "2020-05-02T16:59:35.425904Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "Xyt_tr, Xyt_val = train_test_split(dataset, test_size=0.5, random_state=42)\n",
    "\n",
    "X_tr = Xyt_tr.drop(['visit', 'treatment'], axis=1)\n",
    "y_tr = Xyt_tr['visit']\n",
    "treat_tr = Xyt_tr['treatment']\n",
    "\n",
    "X_val = Xyt_val.drop(['visit', 'treatment'], axis=1)\n",
    "y_val = Xyt_val['visit']\n",
    "treat_val = Xyt_val['treatment']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select categorical features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:36.173267Z",
     "start_time": "2020-05-02T16:59:36.155955Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['history_segment', 'zip_code', 'channel']\n"
     ]
    }
   ],
   "source": [
    "cat_cols = X_tr.select_dtypes(include='object').columns.tolist()\n",
    "print(cat_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the necessary objects and combining them into a pipieline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:36.269784Z",
     "start_time": "2020-05-02T16:59:36.179904Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from category_encoders import CatBoostEncoder\n",
    "from sklift.models import ClassTransformation\n",
    "from xgboost import XGBClassifier\n",
    "\n",
    "\n",
    "encoder = CatBoostEncoder(cols=cat_cols)\n",
    "estimator = XGBClassifier(max_depth=2, random_state=42)\n",
    "ct = ClassTransformation(estimator=estimator)\n",
    "\n",
    "my_pipeline = Pipeline([\n",
    "    ('encoder', encoder),\n",
    "    ('model', ct)\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Train pipeline as usual, but adding the treatment column in the step model as a parameter `model__treatment`.</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:37.436032Z",
     "start_time": "2020-05-02T16:59:36.275110Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py:354: UserWarning: It is recommended to use this approach on treatment balanced data. Current sample size is unbalanced.\n",
      "  self._final_estimator.fit(Xt, y, **fit_params)\n"
     ]
    }
   ],
   "source": [
    "my_pipeline = my_pipeline.fit(\n",
    "    X=X_tr,\n",
    "    y=y_tr,\n",
    "    model__treatment=treat_tr\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-26T18:07:44.970856Z",
     "start_time": "2020-04-26T18:07:44.964624Z"
    }
   },
   "source": [
    "Predict the uplift and calculate the uplift@30%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-02T16:59:37.583995Z",
     "start_time": "2020-05-02T16:59:37.438748Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "uplift@30%: 0.0661\n"
     ]
    }
   ],
   "source": [
    "from sklift.metrics import uplift_at_k\n",
    "\n",
    "\n",
    "uplift_predictions = my_pipeline.predict(X_val)\n",
    "\n",
    "uplift_30 = uplift_at_k(y_val, uplift_predictions, treat_val, strategy='overall')\n",
    "print(f'uplift@30%: {uplift_30:.4f}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}