{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Пример использование подходов из sklift.models в sklearn.pipeline\n", "\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", " SCIKIT-UPLIFT REPO | \n", " SCIKIT-UPLIFT DOCS\n", "
\n", " ENGLISH VERSION\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В данном ноутбуке рассмотрим простой пример применения одного из подходов прогнозирования uplift в [sklearn.pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline).\n", "\n", "Данные для примера взяты из [MineThatData E-Mail Analytics And Data Mining Challenge dataset by Kevin Hillstrom](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html). Этот набор данных содержит 64 000 клиентов, которые в последний раз совершали покупки в течение двенадцати месяцев. Среди клиентов была проведена рекламная кампания с помощью email рассылки:\n", "\n", "* 1/3 клиентов были выбраны случайным образом для получения электронного письма, рекламирующего мужскую продукцию;\n", "* 1/3 клиентов были выбраны случайным образом для получения электронного письма, рекламирующего женскую продукцию;\n", "* С оставшейся 1/3 коммуникацию не проводили.\n", "\n", "Для каждого клиента из выборки замерили факт перехода по ссылке в письме, факт совершения покупки и сумму трат за две недели, следущими после получения письма.\n", "\n", "Полное описание датасета можной найти по [ссылке](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html).\n", "\n", "Установим необходимые библиотеки:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:00:49.841653Z", "start_time": "2020-05-02T17:00:49.835556Z" } }, "outputs": [], "source": [ "!pip install scikit-uplift==0.1.2 xgboost==1.0.2 category_encoders==2.1.0" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-04-26T14:28:36.188277Z", "start_time": "2020-04-26T14:28:36.106561Z" } }, "source": [ "Загрузим данные:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:04.035180Z", "start_time": "2020-05-02T17:00:49.846452Z" } }, "outputs": [ { "data": { "text/plain": [ "('./content/Hilstorm.csv', )" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import urllib.request\n", "import pandas as pd\n", "\n", "\n", "csv_path = '/content/Hilstorm.csv'\n", "url = 'http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv'\n", "urllib.request.urlretrieve(url, csv_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для простоты примера оставим только два сегмента пользователей:\n", "* тем, кому рассылалась по электронной почте рекламная кампания с участием женских товаров;\n", "* тем, кому не рассылалась рекламная кампания.\n", "\n", "В качестве целевой переменной будем использовать переменную `visit`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:04.476684Z", "start_time": "2020-05-02T17:01:04.038420Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Размер датасета до обработки: (64000, 12)\n", "Размер датасета после обработки: (42693, 10)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
recencyhistory_segmenthistorymenswomenszip_codenewbiechannelvisittreatment
0102) $100 - $200142.4410Surburban0Phone01
163) $200 - $350329.0811Rural1Web00
272) $100 - $200180.6501Surburban1Web01
421) $0 - $10045.3410Urban0Web01
562) $100 - $200134.8301Surburban0Phone11
\n", "
" ], "text/plain": [ " recency history_segment history mens womens zip_code newbie channel \\\n", "0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone \n", "1 6 3) $200 - $350 329.08 1 1 Rural 1 Web \n", "2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web \n", "4 2 1) $0 - $100 45.34 1 0 Urban 0 Web \n", "5 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone \n", "\n", " visit treatment \n", "0 0 1 \n", "1 0 0 \n", "2 0 1 \n", "4 0 1 \n", "5 1 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd; pd.set_option('display.max_columns', None)\n", "\n", "\n", "%matplotlib inline\n", "\n", "dataset = pd.read_csv(csv_path)\n", "print(f'Размер датасета до обработки: {dataset.shape}')\n", "dataset = dataset[dataset['segment']!='Mens E-Mail']\n", "dataset.loc[:, 'treatment'] = dataset['segment'].map({\n", " 'Womens E-Mail': 1,\n", " 'No E-Mail': 0\n", "})\n", "\n", "dataset = dataset.drop(['segment', 'conversion', 'spend'], axis=1)\n", "print(f'Размер датасета после обработки: {dataset.shape}')\n", "dataset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Разобъем все данные на обучающую и валидационную выборку:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:05.037943Z", "start_time": "2020-05-02T17:01:04.480545Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "\n", "Xyt_tr, Xyt_val = train_test_split(dataset, test_size=0.5, random_state=42)\n", "\n", "X_tr = Xyt_tr.drop(['visit', 'treatment'], axis=1)\n", "y_tr = Xyt_tr['visit']\n", "treat_tr = Xyt_tr['treatment']\n", "\n", "X_val = Xyt_val.drop(['visit', 'treatment'], axis=1)\n", "y_val = Xyt_val['visit']\n", "treat_val = Xyt_val['treatment']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select categorical features:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:05.052632Z", "start_time": "2020-05-02T17:01:05.040798Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['history_segment', 'zip_code', 'channel']\n" ] } ], "source": [ "cat_cols = X_tr.select_dtypes(include='object').columns.tolist()\n", "print(cat_cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Создадим нужные объекты и объединим их в pipieline." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:05.144900Z", "start_time": "2020-05-02T17:01:05.059079Z" } }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from category_encoders import CatBoostEncoder\n", "from sklift.models import ClassTransformation\n", "from xgboost import XGBClassifier\n", "\n", "\n", "encoder = CatBoostEncoder(cols=cat_cols)\n", "estimator = XGBClassifier(max_depth=2, random_state=42)\n", "ct = ClassTransformation(estimator=estimator)\n", "\n", "my_pipeline = Pipeline([\n", " ('encoder', encoder),\n", " ('model', ct)\n", "])" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-04-26T18:02:52.236917Z", "start_time": "2020-04-26T18:02:52.110138Z" } }, "source": [ "Обучать pipeline будем как обычно, но колонку treatment добавим как параметр шага model: `model__treatment`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:06.298101Z", "start_time": "2020-05-02T17:01:05.149361Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py:354: UserWarning: It is recommended to use this approach on treatment balanced data. Current sample size is unbalanced.\n", " self._final_estimator.fit(Xt, y, **fit_params)\n" ] } ], "source": [ "my_pipeline = my_pipeline.fit(\n", " X=X_tr,\n", " y=y_tr,\n", " model__treatment=treat_tr\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Предскажем uplift и посчитаем uplift@30%" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T17:01:06.418688Z", "start_time": "2020-05-02T17:01:06.300295Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "uplift@30%: 0.0661\n" ] } ], "source": [ "from sklift.metrics import uplift_at_k\n", "\n", "\n", "uplift_predictions = my_pipeline.predict(X_val)\n", "\n", "uplift_30 = uplift_at_k(y_val, uplift_predictions, treat_val, strategy='overall')\n", "print(f'uplift@30%: {uplift_30:.4f}')" ] } ], "metadata": { "kernelspec": { "display_name": "python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }