{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# H2O Uplift Distributed Random Forest \n", "\n", "### Author: Veronika Maurerova veronika.maurerova@h2o.ai\n", "\n", "## Modeling Uplift \n", "\n", "Distributed Uplift Random Forest (Uplift DRF) is a classification tool for modeling uplift - the incremental impact of a treatment. This tool is very useful for example in marketing or in medicine. This machine learning approach is inspired by the A/B testing method. \n", "\n", "To model uplift, the analyst needs to collect data specifically - before the experiment, the objects are divided usually into two groups: \n", "\n", "- **treatment group**: receive some kind of treatment (for example customer get some type of discount) \n", "- **control group**: is separated from the treatment (customers in this group get no discount). \n", "\n", "Then the data are prepared and an analyst can gather information about the response - for example, whether customers bought a product, patients recovered from the disease, or similar. \n", "\n", "## Uplift approaches \n", "\n", "There are several approaches to model uplift: \n", "\n", "- Meta-learner algorithms\n", "- Instrumental variables algorithms\n", "- Neural-networks-based algorithms\n", "- Tree-based algorithms \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree Based Uplift Algorithm\n", "\n", "Tree-based algorithm means in every tree, it takes information about treatment/control group assignment and information about response directly into a decision about splitting a node. The uplift score is the criterion to make a decision similar to the Gini coefficient in the standard decision tree. \n", "\n", "**Uplift metric to decide best split**\n", "\n", "The goal is to maximize the differences between the class distributions in the treatment and control sets, so the splitting criteria are based on distribution divergences. The distribution divergence is calculated based on the ``uplift_metric`` parameter. In H2O-3, three ``uplift_metric`` types are supported:\n", "\n", "- **Kullback-Leibler divergence** (``uplift_metric=\"KL\"``) - uses logarithms to calculate divergence, asymmetric, widely used, tends to infinity values (if treatment or control group distributions contain zero values). \n", "\n", "$ KL(P, Q) = \\sum_{i=0}^{N} p_i \\log{\\frac{p_i}{q_i}}$\n", "\n", "- **Squared Euclidean distance** (``uplift_metric=\"euclidean\"``) - symmetric and stable distribution, does not tend to infinity values. \n", "\n", "$ E(P, Q) = \\sum_{i=0}^{N} (p_i-q_i)^2$\n", "\n", "\n", "- **Chi-squared divergence** (``uplift_metric=\"chi_squared\"``) - Euclidean divergence normalized by control group distribution. Asymmetric and also tends to infinity values (if control group distribution contains zero values). \n", "\n", "$X^2(P, Q) = \\sum_{i=0}^{N} \\frac{(p_i-q_i)^2}{q_i}$\n", "\n", "where:\n", "\n", "- $P$ is treatment group distribution\n", "\n", "- $Q$ is control group distribution\n", "\n", "In a tree node the result value for a split is sum: $metric(P, Q) + metric(1-P, 1-Q)$. \n", "\n", "For the split gain value, the result within the node is normalized using a Gini coefficient (Eclidean or ChiSquared) or entropy (KL) for each distribution before and after the split.\n", "\n", "\n", "**Uplift score in each leaf is calculated as:**\n", "\n", "- $TP = (TY1 + 1) / (T + 2)$\n", "- $CP = (CY1 + 1) / (C + 2)$\n", "- $uplift\\_score = TP - CP $\n", "\n", "where:\n", "- $T$ how many observations in a leaf are from the treatment group (how many data rows in a leaf have ``treatment_column`` label == 1) \n", "- $C$ how many observations in a leaf are from the control group (how many data rows in the leaf have ``treatment_column`` label == 0)\n", "- $TY1$ how many observations in a leaf are from the treatment group and respond to the offer (how many data rows in the leaf have ``treatment_column`` label == 1 and ``response_column`` label == 1)\n", "- $CY1$ how many observations in a leaf are from the control group and respond to the offer (how many data rows in the leaf have ``treatment_column`` label == 0 and ``response_column`` label == 1)\n", "\n", "**Note**: A higher uplift score means more observations from treatment group respond to the offer than from control group. Which means offered treatment has positive effect. The uplift score can be negative, if more observations from control group respond to the offer without treatment.\n", "\n", "
\n", "
\n", "\n", "![Difference between SDT and Uplift DT](https://blog.h2o.ai/wp-content/uploads/2022/01/tree.png)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## H2O Implementation (Major release 3.36)\n", "\n", "The H2O-3 implementation of Uplift DRF is based on DRF because the principle of training is similar to DRF. It is tree based uplift algorithm. Uplift DRF generates a forest of classification uplift trees, rather than a single classification tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Classification takes the average prediction over all of their trees to make a final prediction. \n", "\n", "Currently, in H2O-3 only binomial trees are supported, as well as the uplift curve metric and Area Under Uplift curve (AUUC) metric, normalized AUUC, and the Qini value. We are working on adding also regression trees and more metrics, for example, Qini coefficient, and more. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start H2O-3" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "versionFromGradle='3.37.0',projectVersion='3.37.0.99999',branch='master',lastCommitHash='a1c95a407aec53a6cbc551484bd02d7d80b3bcb6',gitDescribe='jenkins-master-5950-dirty',compiledOn='2022-09-13 10:48:53',compiledBy='kurkami'\n" ] } ], "source": [ "import h2o\n", "from h2o.estimators.uplift_random_forest import H2OUpliftRandomForestEstimator\n", "\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import matplotlib.style as style\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data\n", "\n", "To demonstrate how Uplift DRF works, Criteo dataset is used. \n", "\n", "**Source:**\n", "\n", "Diemert Eustache, Betlei Artem} and Renaudin, Christophe and Massih-Reza, Amini, \"A Large Scale Benchmark for Uplift Modeling\", ACM, Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018, https://ailab.criteo.com/criteo-uplift-prediction-dataset/.\n", "\n", "\n", "\n", "**Description:**\n", "\n", "- The dataset was created by The Criteo AI Lab\n", "- Consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions).\n", "- Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks).\n", "- The global treatment ratio is 84.6%.\n", "\n", "**Detailed description of the columns:**\n", "\n", "- **f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11**: feature values (dense, float)\n", "- **treatment**: treatment group (1 = treated, 0 = control)\n", "- **conversion**: whether a conversion occured for this user (binary, label)\n", "- **visit**: whether a visit occured for this user (binary, label)\n", "- **exposure**: treatment effect, whether the user has been effectively exposed (binary)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
f0f1f2f3f4f5f6f7f8f9f10f11treatmentconversionvisitexposure
012.61636510.0596548.9764294.67988210.2805254.1154530.2944434.8338153.95539613.1900565.300375-0.1686791000
112.61636510.0596549.0026894.67988210.2805254.1154530.2944434.8338153.95539613.1900565.300375-0.1686791000
212.61636510.0596548.9647754.67988210.2805254.1154530.2944434.8338153.95539613.1900565.300375-0.1686791000
312.61636510.0596549.0028014.67988210.2805254.1154530.2944434.8338153.95539613.1900565.300375-0.1686791000
412.61636510.0596549.0379994.67988210.2805254.1154530.2944434.8338153.95539613.1900565.300375-0.1686791000
\n", "
" ], "text/plain": [ " f0 f1 f2 f3 f4 f5 f6 \\\n", "0 12.616365 10.059654 8.976429 4.679882 10.280525 4.115453 0.294443 \n", "1 12.616365 10.059654 9.002689 4.679882 10.280525 4.115453 0.294443 \n", "2 12.616365 10.059654 8.964775 4.679882 10.280525 4.115453 0.294443 \n", "3 12.616365 10.059654 9.002801 4.679882 10.280525 4.115453 0.294443 \n", "4 12.616365 10.059654 9.037999 4.679882 10.280525 4.115453 0.294443 \n", "\n", " f7 f8 f9 f10 f11 treatment conversion \\\n", "0 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 \n", "1 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 \n", "2 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 \n", "3 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 \n", "4 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 \n", "\n", " visit exposure \n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "control_name = \"control\"\n", "treatment_column = \"treatment\"\n", "response_column = \"visit\"\n", "feature_cols = [\"f\"+str(x) for x in range(0,12)]\n", "\n", "df = pd.read_csv(\"/home/0xdiag/bigdata/server/criteo/criteo-uplift-v2.1.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data\n", "\n", "Inspiration from: https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook\n", "\n", "To modeling uplift the treatment and control group data have to have similar distribution. In real world usually the control group is smaller than the treatment group. It is also a case of Crieteo dataset and we have to rebalanced the data to have a similar size." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of samples: 13979592\n", "The dataset is largely imbalanced: \n", "1 0.85\n", "0 0.15\n", "Name: treatment, dtype: float64\n", "Percentage of users that visit: 4.7%\n", "Percentage of users that convert: 0.29%\n", "Percentage of visitors that convert: 6.21%\n" ] } ], "source": [ "print('Total number of samples: {}'.format(len(df)))\n", "print('The dataset is largely imbalanced: ')\n", "print(df['treatment'].value_counts(normalize = True))\n", "print('Percentage of users that visit: {}%'.format(100*round(df['visit'].mean(),4)))\n", "print('Percentage of users that convert: {}%'.format(100*round(df['conversion'].mean(),4)))\n", "print('Percentage of visitors that convert: {}%'.format(100*round(df[df[\"visit\"]==1][\"conversion\"].mean(),4)))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class 0: 2096937\n", "Class 1: 11882655\n", "Proportion: 6 : 1\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Print proportion of a binary column\n", "# https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook\n", "def print_proportion(df, column):\n", " fig = plt.figure(figsize = (10,6))\n", " target_count = df[column].value_counts()\n", " print('Class 0:', target_count[0])\n", " print('Class 1:', target_count[1])\n", " print('Proportion:', int(round(target_count[1] / target_count[0])), ': 1')\n", " target_count.plot(kind='bar', title='Treatment Class Distribution', color=['#2077B4', '#FF7F0E'], fontsize = 15)\n", " plt.xticks(rotation=0) \n", " \n", "print_proportion(df, treatment_column)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(11183673, 16)\n", "(2795919, 16)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['treatment'])\n", "print(train_df.shape)\n", "print(test_df.shape)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "del(df)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class 0: 1677550\n", "Class 1: 9506123\n", "Proportion: 6 : 1\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAk0AAAF6CAYAAAAEbWzQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVsklEQVR4nO3debCld13n8c83C0uEBEgamCCk2QKCDo7EQhhFEIY9YKEshpkpUCfDVLkNW4CZGoOCsogU68RWIKVsArInUTbZBKzpsCmBMCzBkADphHQWEgKS7/zxPBfuXLpzf52+p+9J9+tV1dX3nuec5/nec1M37/49zz2nujsAAFyzgzZ7AACA6wLRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AUuhqrZWVVfVIZs9y4qqelxVvXsD9/fZqrrP/PHJVfWaDdz3M6vqLzZqf8CPEk2wYFV1+ao/V1fVlas+f9wGHufUqnr2Ru1vD489FDxVdWxVvamqLqyqS6rqM1X1pKo6eF/NumqWU6vqu1V12fznn6vqj6vqiJX7dPdru/sBg/ta97nv7rt29wf2cvRU1X2q6mtr9v1H3f2be7tvYPdEEyxYd99o5U+Sf0ly/KrbXrtyv2VaYVmEqrp9kn9Mcm6Sn+ruI5I8KslxSW68SWM9v7tvnGRLkick+bkk/1BVP7aRB9nfv7dwoBBNsElWVguq6qSq+kaSV1fVQVX19Kr6UlVdVFVvrKqbrXrMm6rqG/MqzYeq6q7z7ScmeVySp80rWO+cbz+nqp46r+h8u6peWVW3qKoz5tWV91bVTVft/+eq6qNVtbOqPr1yKmne9oGq+sOq+of5se+uqqPmzR+a/945H/+eu/iSn5Xko939pO7+epJ099ndfUJ379zF8/OEqvrcfKwvV9V/XbXtqKp61zznt6rqw1V10LztpKo6b37c2VV1v/W+F939ne7+P0kenuTITAGVqnp8VX1k/riq6kVVdUFVXVpV/1RVP7nOc39SVX0myber6pD5tvuvOvQNquqv51k/UVV3W/U1dlXdYdXnp1bVs+egOyPJ0atWLI+uNaf7qurhNZ0O3Dl/735i1bZzquop838Xl8wz3GC95wkOdKIJNtctk9wsyTFJTkzy20l+OckvJjk6ycVJXr7q/mckuWOSmyf5RJLXJkl3b5s/fv68gnX8qsf8SpL/kOTYJMfP+3hmptWVg5L8TpJU1a2SnJbk2fNMT0nyN1W1ZdW+TsgUFDdPcr35Pkly7/nvm8zH/9guvtb7J3nz2NOSJLkgycOSHD4f80VV9TPzticn+dr8Ndxi/nq6qu6U5LeS/Oy8gvTAJOeMHrC7L0vyniS/sIvND8j0dR6b5Igkj05y0TrP/a8leWim5+Vfd7HPRyR5U6bn+3VJ3lZVh64z47eTPDjJ+atWLM9ffZ+qOjbJ65P8Xqbn6PQk76yq662626OTPCjJbZP82ySPv6bjAvsomqrqVfO/zv558P6Prqqz5n8lvW7R88EmujrJ73f3Vd19ZZInJvkf3f217r4qyclJfnXl9E53v6q7L1u17W616hqc3Xhpd3+zu89L8uEk/9jdn+zu7yR5a5J/N9/vPyY5vbtP7+6ru/s9SbYneciqfb26u78wz/rGJD+9B1/rkUm+Pnrn7j6tu7/Ukw8meXd+GDPfS/JvkhzT3d/r7g/39Eaa309y/SR3qapDu/uc7v7SHsyYJOdnipi1vpfpNOKdk1R3f25lxewavKS7z52fr105s7vf3N3fS/KnSW6Q6RTh3npMktO6+z3zvv8kyQ2T3GvNbOd397eSvDN79r2EA9K+Wmk6NdO/aNZVVXdM8owk/76775rpX0qwv9oxx8uKY5K8dT6lsjPJ5zKFwC2q6uCqeu586u7S/HAF5ahcs2+u+vjKXXx+o1XHftTKsefj/3ymOFnxjVUfX7HqsSMuWrOva1RVD66qj8+n33ZmireVr/UFSb6Y5N3zqbunJ0l3fzHTz4yTk1xQVW+oqqP3YMYkuVWSb629sbvfn+RlmVb+LqiqbVV1+Dr7Ond0e3dfnWn1bE/n3ZWjk3x1zb7PzfS1rdib7yUckPZJNHX3h7Lmh1BV3b6q/raqzpyvR7jzvOm/JHl5d188P/aCfTEjbJJe8/m5SR7c3TdZ9ecG8yrRCZlO59w/0+mhrfNjajf72lPnJvmrNcf+se5+7rX4OnblvZlOFa6rqq6f5G8yrZDcortvkukUUyXTabTufnJ33y7TdUhPWrl2qbtf190/nykCO8nzRo45H/dGmZ7fD+9qe3e/pLvvnuQumU7TPXVl0252ud7zcutVxz4oyY9nWulKppA5bNV9b7kH+z0/09e/su+aj3XeOo8DrsFmXtO0Lclvzz+AnpLkFfPtxyY5dr7Y9ONVNbRCBfuJU5I8p6qOSZKq2lJVj5i33TjJVZlWbA5L8kdrHvvNJLfbi2O/JsnxVfXAeVXrBjVdrP7jA4/dkelU4zUd//eT3KuqXlBVt0ySqrpDVb2mqm6y5r7Xy3SabUeSf62qB2e6pijz4x42P7aSXJJpNe7qqrpTVf3SHF3fybSSdvV6w1fV9avq7kneluk6slfv4j4/W1X3mK85+va8/5V9X9vn/u5V9cj59OvvZfr+fnze9qkkJ8zfiwdlus5txTeTHHkNp2bfmOShVXW/ed4nz/v+6LWYEZhtSjTN/5q7V5I3VdWnkvxZfrhsf0imC13vk+kiyj/fxQ9U2F+9OMk7Mp12uizT/0DvMW/7y0ynXM5LclZ++D/XFa/MdC3Pzqp6254euLvPzbSS9cxMsXJuppWUdX9OdPcVSZ6T6df1d1bVj1yXM19bdM9MK2SfrapLMq0mbU9y2Zr7XpbpAvU3ZoqYEzI9LyvumGnl6vIkH0vyiu7++0yh9dwkF2Y6/XTzTKf7d+dp8/N8Uabn98wk95ovtl7r8CR/Ps/z1fkxL5i3Xdvn/u2Zrj+6OMl/SvLI+RqkJPndTBfu78z023k/2G93fz7Thd5fno/5/53S6+6zM12j9tJMz8XxmV7q4rt7MBuwRk3XTu6DA1VtTfKu7v7J+TqAs7v7R65vqKpTMl2o+ur58/clefr868AAAJtiU1aauvvSJF+pqkclP3j9k5XXJ3lbplWm1PQaMMcm+fImjAkA8AP76iUHXp9pCf1ONb2Y329kWm7+jar6dJLPZjotkCR/l+Siqjoryd8neWp3X7Qv5gQA2J19dnoOAOC6zCuCAwAMEE0AAAMW/s7bRx11VG/dunXRhwEA2Gtnnnnmhd29ZVfbFh5NW7duzfbt2xd9GACAvVZVX93dNqfnAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGHLLZAxyobvv00zZ7BK4jvvLch272CADEShMAwBDRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMGIqmqnpsVX2iqi6vqvOq6i+r6uhFDwcAsCzWjaaqeniS1yf5aJJHJDkpyb2TnFZVVqoAgAPCIQP3OSHJJ7r7t1ZuqKpLk7w9yZ2SfG5BswEALI2RlaJDk1yy5rad89+1odMAACypkWh6VZJfqKr/XFWHV9WxSZ6d5P3dfdZixwMAWA7rRlN3n5bk8Um2ZVpxOjvJwUl+ZXePqaoTq2p7VW3fsWPHBo0KALB5Ri4Ev2+SU5K8OMl9kzw2yc2SvLWqDt7VY7p7W3cf193HbdmyZSPnBQDYFCMXgr8wyTu6+6SVG6rqU0k+n+m36d6ymNEAAJbHyDVNd07yqdU3dPfZSa5McvsFzAQAsHRGoumrSX5m9Q1V9RNJbpjknAXMBACwdEZOz52S5EVVdX6SM5LcIsn/yhRMpy9uNACA5TESTS9J8t0k/y3JEzO9RtNHkjyju7+9uNEAAJbHutHU3Z3kf89/AAAOSN47DgBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABQ9FUVYdU1dOr6v9W1VVV9bWqetGihwMAWBaHDN7v1CS/lORZST6f5NZJ7rKgmQAAls660VRVD0rymCR36+6zFj8SAMDyGTk99+tJ3i+YAIAD2Ug03SPJF6rqZVV1aVVdUVVvqaqjFz0cAMCyGImmWyZ5fJKfTvLYJE9Icvckb62qWthkAABLZORC8Jr/PKK7L0qSqvp6kg9mujj8fT/ygKoTk5yYJLe5zW02bFgAgM0ystJ0cZJ/Wgmm2UeSfDe7+Q267t7W3cd193FbtmzZgDEBADbXSDR9LtNK01qV5OqNHQcAYDmNRNO7kvxUVR216rZ7Jzk0yacXMhUAwJIZiaZtSS5K8s6qOr6qTkjyV0ne290fWeh0AABLYt1o6u5LM13wfXGSNyR5eaaLvx+92NEAAJbH0NuodPcXkzxkwbMAACytoTfsBQA40IkmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAbscTRV1a2q6vKq6qq60SKGAgBYNtdmpekFSS7f6EEAAJbZHkVTVd07yYOS/MlixgEAWE6HjN6xqg5O8tIkf5Bk56IGAgBYRnuy0vTEJNdP8vIFzQIAsLSGoqmqjkzyh0me1N3fG7j/iVW1vaq279ixY29nBADYdKMrTc9J8vHuPn3kzt29rbuP6+7jtmzZcu2nAwBYEute01RVd03y60nuXVU3mW8+bP77iKr6fndfuaD5AACWwsiF4HdMcmiSj+1i29eSvDLJb27kUAAAy2Ykmj6S5L5rbntQkpOSPCTJlzd6KACAZbNuNHX3hUk+sPq2qto6f/jh7vZClwDAfs97zwEADLhW0dTdp3Z3WWUCAA4UVpoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGCCaAAAGiCYAgAGiCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGCAaAIAGLBuNFXVo6rqHVV1XlVdXlVnVtWv7YvhAACWxSED93lSkq8k+e9JLkzykCSvq6qjuvulixwOAGBZjETT8d194arP319VR2eKKdEEABwQ1j09tyaYVnwyydEbPw4AwHK6theC3zPJFzZyEACAZbbH0VRV90vyy0leeA33ObGqtlfV9h07duzFeAAAy2GPoqmqtiZ5XZK3d/epu7tfd2/r7uO6+7gtW7bs3YQAAEtgOJqq6mZJzkjy1SSPW9hEAABLaCiaquqwJO9Kcr0kD+vuKxY6FQDAkln3JQeq6pAkb0pyxyT36u4LFj4VAMCSGXmdpldkekHL301yZFUduWrbJ7v7qoVMBgCwREai6QHz3y/exbbbJjlnw6YBAFhS60ZTd2/dB3MAACy1a/vilgAABxTRBAAwQDQBAAwYuRAcgOuKk4/Y7Am4rjj5ks2e4DrHShMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADRBMAwADRBAAwQDQBAAwQTQAAA0QTAMAA0QQAMEA0AQAMEE0AAANEEwDAANEEADBANAEADBBNAAADhqKpqu5SVe+rqiuq6vyq+oOqOnjRwwEALItD1rtDVd00yXuTnJXkEUlun+SFmYLrfy50OgCAJbFuNCV5YpIbJnlkd1+a5D1VdXiSk6vq+fNtAAD7tZHTcw9O8ndr4ugNmULqFxcyFQDAkhmJpjsn+fzqG7r7X5JcMW8DANjvjUTTTZPs3MXtF8/bAAD2eyPXNO2xqjoxyYnzp5dX1dmLOA77naOSXLjZQyybet5mTwDXeX627MqzarMnWFbH7G7DSDRdnOSIXdx+03nbj+jubUm2DY0Gs6ra3t3HbfYcwP7FzxY2ysjpuc9nzbVLVXXrJIdlzbVOAAD7q5FoOiPJA6vqxqtue0ySK5N8cCFTAQAsmZFoOiXJVUneUlX3n69XOjnJn3qNJjaYU7rAIvjZwoao7l7/TlV3SfKyJPfM9Jt0f5Hk5O7+/kKnAwBYEkPRBABwoBt6w15YlKq6Q1X9WVV9pqq+X1Uf2OyZgP2DN5tnoy3kdZpgD9w1yUOSfDzJoZs8C7Cf8GbzLILTc2yqqjqou6+eP35zkqO6+z6bOxVwXVdVz0jytCTHrPzSUlU9LdMvMt3SLzJxbTg9x6ZaCSaADebN5tlwogmA/ZE3m2fDiSYA9kfebJ4NJ5oAAAaIJgD2R3v8ZvOwHtEEwP7Im82z4UQTAPsjbzbPhvPilmyqqjos04tbJsmtkhxeVb86f356d1+xOZMB13GnJPmdTG82/7wkt4s3m2cveXFLNlVVbU3yld1svm13n7PvpgH2J95sno0mmgAABrimCQBggGgCABggmgAABogmAIABogkAYIBoAgAYIJoAAAaIJgCAAaIJAGDA/wMVGdfHu8yOMAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print_proportion(train_df, treatment_column)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Random Undersampling (finding the majority class and undersampling it)\n", "# https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook\n", "def random_under(df, feature):\n", " \n", " target = df[feature].value_counts()\n", " \n", " if target.values[0]" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print_proportion(train_df, treatment_column)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# method to transfor data for LGWUM method, will be explained later\n", "def target_class_lgwum(df, treatment, target, column_name):\n", " \n", " #CN:\n", " df[column_name] = 0 \n", " #CR:\n", " df.loc[(df[treatment] == 0) & (df[target] != 0), column_name] = 1 \n", " #TN:\n", " df.loc[(df[treatment] != 0) & (df[target] == 0), column_name] = 2 \n", " #TR:\n", " df.loc[(df[treatment] != 0) & (df[target] != 0), column_name] = 3 \n", " return df\n", "\n", "response_column_lgwum = \"lqwum_response\"\n", "train_df = target_class_lgwum(train_df, treatment_column, response_column, response_column_lgwum)\n", "test_df = target_class_lgwum(test_df, treatment_column, response_column, response_column_lgwum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start H2O" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.\n", "Attempting to start a local H2O server...\n", " Java Version: openjdk version \"1.8.0_342\"; OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~22.04-b07); OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)\n", " Starting server from /home/kurkami/git/h2o/h2o-3/build/h2o.jar\n", " Ice root: /tmp/tmp8edjah_q\n", " JVM stdout: /tmp/tmp8edjah_q/h2o_kurkami_started_from_python.out\n", " JVM stderr: /tmp/tmp8edjah_q/h2o_kurkami_started_from_python.err\n", " Server is running at http://127.0.0.1:54321\n", "Connecting to H2O server at http://127.0.0.1:54321 ... successful.\n", "Warning: Version mismatch. H2O is version 3.38.0.99999, but the h2o-python package is version 3.37.0.99999. This is a developer build, please contact your developer.\n" ] }, { "data": { "text/html": [ "\n", " \n", "
\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
H2O_cluster_uptime:01 secs
H2O_cluster_timezone:America/New_York
H2O_data_parsing_timezone:UTC
H2O_cluster_version:3.38.0.99999
H2O_cluster_version_age:1 hour and 17 minutes
H2O_cluster_name:H2O_from_python_kurkami_q2bo8y
H2O_cluster_total_nodes:1
H2O_cluster_free_memory:8.88 Gb
H2O_cluster_total_cores:12
H2O_cluster_allowed_cores:12
H2O_cluster_status:locked, healthy
H2O_connection_url:http://127.0.0.1:54321
H2O_connection_proxy:{\"http\": null, \"https\": null}
H2O_internal_security:False
Python_version:3.10.4 final
\n", "
\n" ], "text/plain": [ "-------------------------- ------------------------------\n", "H2O_cluster_uptime: 01 secs\n", "H2O_cluster_timezone: America/New_York\n", "H2O_data_parsing_timezone: UTC\n", "H2O_cluster_version: 3.38.0.99999\n", "H2O_cluster_version_age: 1 hour and 17 minutes\n", "H2O_cluster_name: H2O_from_python_kurkami_q2bo8y\n", "H2O_cluster_total_nodes: 1\n", "H2O_cluster_free_memory: 8.88 Gb\n", "H2O_cluster_total_cores: 12\n", "H2O_cluster_allowed_cores: 12\n", "H2O_cluster_status: locked, healthy\n", "H2O_connection_url: http://127.0.0.1:54321\n", "H2O_connection_proxy: {\"http\": null, \"https\": null}\n", "H2O_internal_security: False\n", "Python_version: 3.10.4 final\n", "-------------------------- ------------------------------" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "h2o.init(strict_version_check=False) # max_mem_size=10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import data to H2O" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n", "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n" ] } ], "source": [ "h2o_train_df = h2o.H2OFrame(train_df)\n", "del(train_df)\n", "\n", "h2o_train_df[treatment_column] = h2o_train_df[treatment_column].asfactor()\n", "h2o_train_df[response_column] = h2o_train_df[response_column].asfactor()\n", "h2o_train_df[response_column_lgwum] = h2o_train_df[response_column_lgwum].asfactor()\n", "h2o_train_df = h2o.assign(h2o_train_df, \"train_df\")\n", "\n", "h2o_test_df = h2o.H2OFrame(test_df)\n", "del(test_df)\n", "h2o_test_df[treatment_column] = h2o_test_df[treatment_column].asfactor()\n", "h2o_test_df[response_column] = h2o_test_df[response_column].asfactor()\n", "h2o_test_df[response_column_lgwum] = h2o_test_df[response_column_lgwum].asfactor()\n", "h2o_test_df = h2o.assign(h2o_test_df, \"test_df\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train H2O UpliftDRF model" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upliftdrf Model Build progress: |████████████████████████████████████████████████| (done) 100%\n" ] }, { "data": { "text/html": [ "
H2OUpliftRandomForestEstimator : Uplift Distributed Random Forest\n",
       "Model Key: UpliftDRF_model_python_1665008325837_1\n",
       "
\n", "
\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Summary:
number_of_treesnumber_of_internal_treesmodel_size_in_bytesmin_depthmax_depthmean_depthmin_leavesmax_leavesmean_leaves
20.040.0175049.015.015.015.0293.0394.0344.5
\n", "
\n", "
\n",
       "\n",
       "[tips]\n",
       "Use `model.show()` for more details.\n",
       "Use `model.explain()` to inspect the model.\n",
       "--\n",
       "Use `h2o.display.toggle_user_tips()` to switch on/off this section.
" ], "text/plain": [ "H2OUpliftRandomForestEstimator : Uplift Distributed Random Forest\n", "Model Key: UpliftDRF_model_python_1665008325837_1\n", "\n", "\n", "Model Summary: \n", " number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves\n", "-- ----------------- -------------------------- --------------------- ----------- ----------- ------------ ------------ ------------ -------------\n", " 20 40 175049 15 15 15 293 394 344.5\n", "\n", "[tips]\n", "Use `model.show()` for more details.\n", "Use `model.explain()` to inspect the model.\n", "--\n", "Use `h2o.display.toggle_user_tips()` to switch on/off this section." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ntree = 20\n", "max_depth = 15\n", "metric=\"Euclidean\"\n", "\n", "h2o_uplift_model = H2OUpliftRandomForestEstimator(\n", " ntrees=ntree,\n", " max_depth=max_depth,\n", " min_rows=30,\n", " nbins=1000,\n", " sample_rate=0.80,\n", " score_each_iteration=True,\n", " treatment_column=treatment_column,\n", " uplift_metric=metric,\n", " auuc_nbins=1000,\n", " auuc_type=\"gain\",\n", " seed=42)\n", "\n", "h2o_uplift_model.train(y=response_column, x=feature_cols, training_frame=h2o_train_df)\n", "h2o_uplift_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict and plot Uplift Score" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Plot uplift score\n", "# source https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook\n", "def plot_uplift_score(uplift_score):\n", " plt.figure(figsize = (10,6))\n", " plt.xlim(-.05, .1)\n", " plt.hist(uplift_score, bins=1000, color=['#2077B4'])\n", " plt.xlabel('Uplift score')\n", " plt.ylabel('Number of observations in validation set')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upliftdrf prediction progress: |█████████████████████████████████████████████████| (done) 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
uplift_predict p_y1_ct1 p_y1_ct0
0.0005053270.001459410.000954085
0.0005053270.001459410.000954085
0.0005053270.001459410.000954085
0.0328249 0.071748 0.0389231
0.0005053270.001459410.000954085
0.0007439870.001809470.00106548
0.0005053270.001459410.000954085
0.00146267 0.0345459 0.0330832
0.0005053270.001459410.000954085
0.0005053270.001459410.000954085
[2795919 rows x 3 columns]
" ], "text/plain": [ " uplift_predict p_y1_ct1 p_y1_ct0\n", "---------------- ---------- -----------\n", " 0.000505327 0.00145941 0.000954085\n", " 0.000505327 0.00145941 0.000954085\n", " 0.000505327 0.00145941 0.000954085\n", " 0.0328249 0.071748 0.0389231\n", " 0.000505327 0.00145941 0.000954085\n", " 0.000743987 0.00180947 0.00106548\n", " 0.000505327 0.00145941 0.000954085\n", " 0.00146267 0.0345459 0.0330832\n", " 0.000505327 0.00145941 0.000954085\n", " 0.000505327 0.00145941 0.000954085\n", "[2795919 rows x 3 columns]\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h2o_uplift_pred = h2o_uplift_model.predict(h2o_test_df)\n", "h2o_uplift_pred" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_uplift_score(h2o_uplift_pred['uplift_predict'].as_data_frame().uplift_predict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate the model" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "perf_h2o = h2o_uplift_model.model_performance(h2o_test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Area Under Uplift Curve (AUUC) calculation\n", "\n", "To calculate AUUC for big data, the predictions are binned to histograms. Due to this feature the results should be different compared to exact computation.\n", "\n", "To define AUUC, binned predictions are sorted from largest to smallest value. For every group the cumulative sum of observations statistic is calculated. The uplift is defined based on these statistics.\n", "\n", "\n", "#### Types of AUUC\n", "\n", "\n", "| AUUC type |                    Formula                    |\n", "|:----------:|:-------------------------------------------:|\n", "| **Qini** | $TY1 - CY1 * \\frac{T}{C}$ |\n", "| **Lift** | $\\frac{TY1}{T} - \\frac{CY1}{C}$ |\n", "| **Gain** | $(\\frac{TY1}{T} - \\frac{CY1}{C}) * (T + C)$ |\n", "\n", "\n", "Where:\n", "\n", "- **T** how many observations are in the treatment group (how many data rows in the bin have treatment_column label == 1)\n", "- **C** how many observations are in the control group (how many data rows in the bin have treatment_column label == 0)\n", "- **TY1** how many observations are in the treatment group and respond to the offer (how many data rows in the bin have treatment_column label == 1 and response_column label == 1)\n", "- **CY1** how many observations are in the control group and respond to the offer (how many data rows in the bin have treatment_column label == 0 and response_column label == 1)\n", "\n", "\n", "The resulting AUUC value is:\n", "\n", "- Not normalized.\n", "- The result could be a positive or negative number.\n", "- Higher number means better model.\n", "\n", "More information about normalization is in **Normalized AUUC** section.\n", "\n", "\n", "For some observation groups the results should be NaN. In this case, the results from NaN groups are linearly interpolated to calculate AUUC and plot uplift curve.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
AUUC table (number of bins: 168): All types of AUUC value
uplift_typeqiniliftgain
AUUC value21585.48821570.025358625328.4067094
AUUC normalized0.88053300.02535860.8782349
AUUC random value15585.37521520.006558018335.7289893
\n", "
\n" ], "text/plain": [ "AUUC table (number of bins: 168): All types of AUUC value\n", "uplift_type qini lift gain\n", "----------------- -------- ---------- --------\n", "AUUC value 21585.5 0.0253586 25328.4\n", "AUUC normalized 0.880533 0.0253586 0.878235\n", "AUUC random value 15585.4 0.00655803 18335.7" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perf_h2o.auuc_table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cumulative Uplift curve plot\n", "\n", "To plot the uplift curve, the ``plot_uplift``method can be used. There is specific parameter ``metric`` which can be ``\"qini\", \"gain\", or \"lift\"``. The most popular is the Qini uplift curve which is similar to the ROC curve. The Gain and Lift curves are also known from traditional binomial models. \n", "\n", "Depending on these curves, you can decide how many observations (for example customers) from the test dataset you send an offer to get optimal gain." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_h2o.plot_uplift(metric=\"qini\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_h2o.plot_uplift(metric=\"gain\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_h2o.plot_uplift(metric=\"lift\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Qini value and Average Excess Cumulative Uplift (AECU)\n", "\n", "Qini value is calculated as the difference between the Qini AUUC and area under the random uplift curve (random AUUC). The random AUUC is computed as diagonal from zero to overall gain uplift. \n", "\n", "The Qini value can be generalized for all AUUC metric types. So AECU for Qini metric is the same as Qini value, but the AECU can be also calculated for Gain and Lift metric type. These values are stored in ``aecu_table``.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "
AECU values table: All types of AECU value
uplift_typeqiniliftgain
AECU value6000.11300050.01880056992.6777202
\n", "
\n" ], "text/plain": [ "AECU values table: All types of AECU value\n", "uplift_type qini lift gain\n", "------------- ------- --------- -------\n", "AECU value 6000.11 0.0188005 6992.68" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perf_h2o.aecu_table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalized AUUC\n", "\n", "To get normalized AUUC, you have to call ``auuc_normalized`` method. The normalized AUUC is calculated from uplift values which are normalized by uplift value from maximal treated number of observations. So if you have for example uplift values [10, 20, 30] the normalized uplift is [1/3, 2/3, 1]. If the maximal value is negative, the normalization factor is the absolute value from this number. The normalized AUUC can be again negative and positive and can be outside of (0, 1) interval. The normalized AUUC for ``auuc_metric=\"lift\"`` is not defined, so the normalized AUUC = AUUC for this case. Also the ``plot_uplift`` with ``metric=\"lift\"`` is the same for ``normalize=False`` and ``normalize=True``.\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_h2o.plot_uplift(metric=\"gain\", normalize=True)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8782349075050278" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perf_h2o.auuc_normalized()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scoring histrory and importance of number of trees\n", "\n", "To speed up the calculation of AUUC, the predictions are binned into quantile histograms. To calculate precision AUUC the more bins the better. The more trees usually produce more various predictions and then the algorithm creates histograms with more bins. So the algorithm needs more iterations to get meaningful AUUC results. \n", "You can see in the scoring history table the number of bins as well as the result AUUC. There is also Qini value parameter, which reflects the number of bins and then is a better pointer of the model improvement. In the scoring history table below you can see the algorithm stabilized after building 6 trees. But it depends on data and model settings on how many trees are necessary." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampdurationnumber_of_treestraining_auuc_nbinstraining_auuctraining_auuc_normalizedtraining_qini_value
02022-10-05 18:19:390.030 sec0.00NaNNaNNaN
12022-10-05 18:19:412.314 sec1.01929533.0388040.833748128.764676
22022-10-05 18:19:478.248 sec2.04226565.1145230.749961657.003459
32022-10-05 18:19:5314.429 sec3.07225346.4913620.7155581373.757472
42022-10-05 18:20:0021.083 sec4.011225200.1519040.7114272198.512852
52022-10-05 18:20:0627.502 sec5.016125482.4700080.7193972929.990639
62022-10-05 18:20:1334.462 sec6.022825976.9637490.7333573551.194616
72022-10-05 18:20:2141.969 sec7.030526589.2084580.7506414101.071634
82022-10-05 18:20:2950.484 sec8.038827380.9731230.7729934660.496077
92022-10-05 18:20:3858.928 sec9.047927672.1567770.7812144908.823402
102022-10-05 18:20:471 min 8.312 sec10.056128108.4987830.7935325200.407577
112022-10-05 18:20:571 min 18.075 sec11.063328493.1049290.8043905441.356926
122022-10-05 18:21:071 min 28.003 sec12.071428828.6974600.8138645641.677461
132022-10-05 18:21:171 min 38.369 sec13.077329066.7877520.8205865783.946316
142022-10-05 18:21:281 min 48.844 sec14.082929181.5610070.8238265855.454094
152022-10-05 18:21:391 min 59.705 sec15.088429331.7346390.8280655941.529970
162022-10-05 18:21:502 min 10.810 sec16.092329380.7531710.8294495973.314654
172022-10-05 18:22:012 min 22.097 sec17.094229485.8605950.8324176031.958517
182022-10-05 18:22:122 min 33.476 sec18.095429638.0005890.8367126112.091663
192022-10-05 18:22:242 min 45.046 sec19.096829690.0000050.8381806139.355404
202022-10-05 18:22:362 min 57.092 sec20.097829708.2202360.8386946150.467978
\n", "
" ], "text/plain": [ " timestamp duration number_of_trees \\\n", "0 2022-10-05 18:19:39 0.030 sec 0.0 \n", "1 2022-10-05 18:19:41 2.314 sec 1.0 \n", "2 2022-10-05 18:19:47 8.248 sec 2.0 \n", "3 2022-10-05 18:19:53 14.429 sec 3.0 \n", "4 2022-10-05 18:20:00 21.083 sec 4.0 \n", "5 2022-10-05 18:20:06 27.502 sec 5.0 \n", "6 2022-10-05 18:20:13 34.462 sec 6.0 \n", "7 2022-10-05 18:20:21 41.969 sec 7.0 \n", "8 2022-10-05 18:20:29 50.484 sec 8.0 \n", "9 2022-10-05 18:20:38 58.928 sec 9.0 \n", "10 2022-10-05 18:20:47 1 min 8.312 sec 10.0 \n", "11 2022-10-05 18:20:57 1 min 18.075 sec 11.0 \n", "12 2022-10-05 18:21:07 1 min 28.003 sec 12.0 \n", "13 2022-10-05 18:21:17 1 min 38.369 sec 13.0 \n", "14 2022-10-05 18:21:28 1 min 48.844 sec 14.0 \n", "15 2022-10-05 18:21:39 1 min 59.705 sec 15.0 \n", "16 2022-10-05 18:21:50 2 min 10.810 sec 16.0 \n", "17 2022-10-05 18:22:01 2 min 22.097 sec 17.0 \n", "18 2022-10-05 18:22:12 2 min 33.476 sec 18.0 \n", "19 2022-10-05 18:22:24 2 min 45.046 sec 19.0 \n", "20 2022-10-05 18:22:36 2 min 57.092 sec 20.0 \n", "\n", " training_auuc_nbins training_auuc training_auuc_normalized \\\n", "0 0 NaN NaN \n", "1 19 29533.038804 0.833748 \n", "2 42 26565.114523 0.749961 \n", "3 72 25346.491362 0.715558 \n", "4 112 25200.151904 0.711427 \n", "5 161 25482.470008 0.719397 \n", "6 228 25976.963749 0.733357 \n", "7 305 26589.208458 0.750641 \n", "8 388 27380.973123 0.772993 \n", "9 479 27672.156777 0.781214 \n", "10 561 28108.498783 0.793532 \n", "11 633 28493.104929 0.804390 \n", "12 714 28828.697460 0.813864 \n", "13 773 29066.787752 0.820586 \n", "14 829 29181.561007 0.823826 \n", "15 884 29331.734639 0.828065 \n", "16 923 29380.753171 0.829449 \n", "17 942 29485.860595 0.832417 \n", "18 954 29638.000589 0.836712 \n", "19 968 29690.000005 0.838180 \n", "20 978 29708.220236 0.838694 \n", "\n", " training_qini_value \n", "0 NaN \n", "1 128.764676 \n", "2 657.003459 \n", "3 1373.757472 \n", "4 2198.512852 \n", "5 2929.990639 \n", "6 3551.194616 \n", "7 4101.071634 \n", "8 4660.496077 \n", "9 4908.823402 \n", "10 5200.407577 \n", "11 5441.356926 \n", "12 5641.677461 \n", "13 5783.946316 \n", "14 5855.454094 \n", "15 5941.529970 \n", "16 5973.314654 \n", "17 6031.958517 \n", "18 6112.091663 \n", "19 6139.355404 \n", "20 6150.467978 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h2o_uplift_model.scoring_history()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparasion Tree-based approach and Generalized Weighed Uplift (LGWUM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LGWUM (Kane et al., 2014) is one of several methods available for Uplift Modeling, and uses an approach to Uplift Modelling better known as Class Variable Transformation. LGWUM assumes that positive uplift lies in treating treatment-group responders (TR) and control-group non-responders (CN), whilst avoiding treatment-group non-responders (TN) and control-group responders (CR). This is visually shown as:\n", "\n", "𝑈𝑝𝑙𝑖𝑓𝑡 𝐿𝐺𝑊𝑈𝑀 = P(TR)/P(T) + P(CN)/P(C) - P(TN)/P(T) - P(CR)/P(C)\n", "\n", "source: https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%\n" ] }, { "data": { "text/html": [ "
H2OGradientBoostingEstimator : Gradient Boosting Machine\n",
       "Model Key: GBM_model_python_1665008325837_24\n",
       "
\n", "
\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Summary:
number_of_treesnumber_of_internal_treesmodel_size_in_bytesmin_depthmax_depthmean_depthmin_leavesmax_leavesmean_leaves
20.080.04011303.015.015.015.01341.06950.03990.975
\n", "
\n", "
\n",
       "\n",
       "[tips]\n",
       "Use `model.show()` for more details.\n",
       "Use `model.explain()` to inspect the model.\n",
       "--\n",
       "Use `h2o.display.toggle_user_tips()` to switch on/off this section.
" ], "text/plain": [ "H2OGradientBoostingEstimator : Gradient Boosting Machine\n", "Model Key: GBM_model_python_1665008325837_24\n", "\n", "\n", "Model Summary: \n", " number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves\n", "-- ----------------- -------------------------- --------------------- ----------- ----------- ------------ ------------ ------------ -------------\n", " 20 80 4.0113e+06 15 15 15 1341 6950 3990.97\n", "\n", "[tips]\n", "Use `model.show()` for more details.\n", "Use `model.explain()` to inspect the model.\n", "--\n", "Use `h2o.display.toggle_user_tips()` to switch on/off this section." ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from h2o.estimators.gbm import H2OGradientBoostingEstimator\n", "\n", "h2o_gbm_lgwum = H2OGradientBoostingEstimator(ntrees=ntree,\n", " max_depth=max_depth,\n", " min_rows=30,\n", " nbins=1000,\n", " score_each_iteration=False,\n", " seed=42)\n", "\n", "h2o_gbm_lgwum.train(y=response_column_lgwum, x=feature_cols, training_frame=h2o_train_df)\n", "h2o_gbm_lgwum" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictp_cnp_crp_tnp_truplift_score
000.4631860.0390680.4586990.0390470.001324
100.4614040.0398720.4590650.039659-0.000040
200.4619670.0392860.4594310.0393150.000904
320.3947160.0582610.4895730.057449-0.047193
400.4635720.0390380.4583460.0390440.001654
.....................
279591420.2724730.2434910.2807080.203327-0.103698
279591520.3740560.0835780.4333720.1089940.036659
279591600.4631080.0390590.4586210.0392120.001971
279591700.4486330.0561530.4432690.051945-0.012692
279591800.4631350.0390640.4586490.0391520.001727
\n", "

2795919 rows × 6 columns

\n", "
" ], "text/plain": [ " predict p_cn p_cr p_tn p_tr uplift_score\n", "0 0 0.463186 0.039068 0.458699 0.039047 0.001324\n", "1 0 0.461404 0.039872 0.459065 0.039659 -0.000040\n", "2 0 0.461967 0.039286 0.459431 0.039315 0.000904\n", "3 2 0.394716 0.058261 0.489573 0.057449 -0.047193\n", "4 0 0.463572 0.039038 0.458346 0.039044 0.001654\n", "... ... ... ... ... ... ...\n", "2795914 2 0.272473 0.243491 0.280708 0.203327 -0.103698\n", "2795915 2 0.374056 0.083578 0.433372 0.108994 0.036659\n", "2795916 0 0.463108 0.039059 0.458621 0.039212 0.001971\n", "2795917 0 0.448633 0.056153 0.443269 0.051945 -0.012692\n", "2795918 0 0.463135 0.039064 0.458649 0.039152 0.001727\n", "\n", "[2795919 rows x 6 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uplift_predict_lgwum = h2o_gbm_lgwum.predict(h2o_test_df)\n", "\n", "result = uplift_predict_lgwum.as_data_frame()\n", "result.columns = ['predict', 'p_cn', 'p_cr', 'p_tn', 'p_tr']\n", "result['uplift_score'] = result.eval('\\\n", " p_cn/(p_cn + p_cr) \\\n", " + p_tr/(p_tn + p_tr) \\\n", " - p_tn/(p_tn + p_tr) \\\n", " - p_cr/(p_cn + p_cr)')\n", "result" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_uplift_score(result.uplift_score)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n" ] }, { "data": { "text/html": [ "
ModelMetricsBinomialUplift: \n",
       "** Reported on test data. **\n",
       "\n",
       "AUUC: 20878.6347962265\n",
       "AUUC normalized: 0.7239439144140724
\n", "
\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
AUUC table (number of bins: 79): All types of AUUC value
uplift_typeqiniliftgain
AUUC value17782.56991260.024070320878.6347962
AUUC normalized0.72540120.02407030.7239439
AUUC random value12420.20480840.005226214612.0004307
\n", "
\n", "
\n", "
Qini value: 5362.365104164344
\n", "
\n", " \n", "
\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "
AECU values table: All types of AECU value
uplift_typeqiniliftgain
AECU value5362.36510420.01884416266.6343655
\n", "
\n", "
" ], "text/plain": [ "ModelMetricsBinomialUplift: \n", "** Reported on test data. **\n", "\n", "AUUC: 20878.6347962265\n", "AUUC normalized: 0.7239439144140724\n", "\n", "AUUC table (number of bins: 79): All types of AUUC value\n", "uplift_type qini lift gain\n", "----------------- -------- ---------- --------\n", "AUUC value 17782.6 0.0240703 20878.6\n", "AUUC normalized 0.725401 0.0240703 0.723944\n", "AUUC random value 12420.2 0.00522619 14612\n", "\n", "Qini value: 5362.365104164344\n", "\n", "AECU values table: All types of AECU value\n", "uplift_type qini lift gain\n", "------------- ------- --------- -------\n", "AECU value 5362.37 0.0188441 6266.63" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lgwum_predict = h2o.H2OFrame(result['uplift_score'].tolist())\n", "perf_lgwum = h2o.make_metrics(lgwum_predict, h2o_test_df[response_column], treatment=h2o_test_df[treatment_column], auuc_type=\"gain\", auuc_nbins=81)\n", "perf_lgwum" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_h2o.plot_uplift(metric=\"qini\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "perf_lgwum.plot_uplift(metric=\"qini\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "From the Qini curves, you can see the Uplift DRF algorithm performs better than the LGWUM algorithm. The main reason is, that the split in Uplift DRF can be more precious thanks to information about both treatment and control groups." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uplift trees modeling sources:\n", "\n", "- N. J. Radcliffe, and P. D. Surry, “Real-World Uplift Modelling withSignificance-Based Uplift Trees”, Stochastic Solutions White Paper, 2011.\n", "\n", "- P. D. Surry, and N. J. Radcliffe, “Quality measures for uplift models”, 2011.\n", "\n", "## References\n", "\n", "- P. Rzepakowski, and S. Jaroszewicz, “Decision trees for uplift modeling with single and multiple treatments”, 2012.\n", "\n", "- Hugh Huyton, “Criteo Uplift Modelling“, 2021, https://www.kaggle.com/code/hughhuyton/criteo-uplift-modelling/notebook.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 4 }