{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 10- CostSensitive Churn\n", "\n", "[paper](http://download.springer.com/static/pdf/125/art%253A10.1186%252Fs40165-015-0014-6.pdf?originUrl=http%3A%2F%2Fdecisionanalyticsjournal.springeropen.com%2Farticle%2F10.1186%2Fs40165-015-0014-6&token2=exp=1462974790~acl=%2Fstatic%2Fpdf%2F125%2Fart%25253A10.1186%25252Fs40165-015-0014-6.pdf*~hmac=05041d990b7e5a5e70d6efc1fbb29c2a380465c6edc84be1feda1b6d49588a1a)\n", "[slides](http://www.slideshare.net/albahnsen/maximizing-a-churn-campaigns-profitability-with-cost-sensitive-predictive-analytics)\n", "\n", "Customer churn predictive modeling deals with predicting the probability of a customer defecting \n", "using historical, behavioral and socio-economical information. This tool is of great benefit to \n", "subscription based companies allowing them to maximize the results of retention campaigns. The \n", "problem of churn predictive modeling has been widely studied by the data mining and machine learning\n", "communities. It is usually tackled by using classification algorithms in order to learn the \n", "different patterns of both the churners and non-churners. Nevertheless, current state-of-the-art \n", "classification algorithms are not well aligned with commercial goals, in the sense that, the models \n", "miss to include the real financial costs and benefits during the training and evaluation phases. In \n", "the case of churn, evaluating a model based on a traditional measure such as accuracy or predictive \n", "power, does not yield to the best results when measured by the actual financial cost, i.e., \n", "investment per subscriber on a loyalty campaign and the financial impact of failing to detect a \n", "real churner versus wrongly predicting a non-churner as a churner.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two main objectives of subscription-based companies are to acquire new subscribers and \n", "retain those they already have, mainly because profits are directly linked with the number of \n", "subscribers. In order to maximize the profit, companies must increase the customer base by \n", "incrementing sales while decreasing the number of churners. Furthermore, it is common knowledge \n", "that retaining a customer is about five times less expensive than acquiring a new one , this creates pressure to have better and more effective churn campaigns.\n", "\n", "A typical churn campaign consists in identifying from the current customer base which ones are \n", "more likely to leave the company, and make an offer in order to avoid that behavior.\n", "With this in mind the companies use intelligence to create and improve retention and collection\n", "strategies. In the first case, this usually implies an offer that can be either a discount or a \n", "free upgrade during certain span of time. In both cases the company has to \tassume a cost for that \n", "offer, therefore, accurate prediction of the churners becomes important. The logic of this flow is \n", "shown in the following figure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![fig1](../notebooks/images/ch5_fig1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The churn campaign process starts with the sales that every month increase the customer \n", "base, however, monthly there is a group of customers that decide to leave the company for many \n", "reasons. Then the objective of a churn model is to identify those customers before they take the \n", "decision of defecting.\n", "\n", "Using a churn model, those customers more likely to leave are predicted as churners and \n", "an offer is made in order to retain them. However, it is known that not all customers will accept \n", "the offer, in the case when a customer is planning to defect, it is possible that the offer is not \n", "good enough to retain him or that the reason for defecting can not be influenced by an offer.\n", "Using historical information, it is estimated that a customer will accept the offer with \n", "probability $\\gamma$.\n", "On the other hand, there is the case in which the churn model misclassified a non-churner as \n", "churner, also known as false positives, in that case the customer will always accept the offer that \n", "means and additional cost to the company since those misclassified customers do not have the \n", "intentions of leaving.\n", "\n", "In the case were the churn model predicts customers as non-churners, there is also the possibility \n", "of a misclassification, in this case an actual churner is predicted as non-churner, since \n", "these customers do not receive an offer and they will leave the company, these cases are known as \n", "false negatives. Lastly, there is the case were the customers are actually non-churners, then \n", "there is no need to make a retention offer to these customers since they will continue to be part \n", "of the customer base.\n", "\n", "It can be seen that a churn campaign (or churn model) has three main points. First, avoid false \n", "positives since there is a financial cost of making an offer where it is not needed. Second, find \n", "the right offer to give to those customers identified as churners. And lastly, to decrease \n", "the number of false negatives." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following figure, the financial impact of a churn model is shown. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![fig1](../notebooks/images/ch5_fig2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note than we take \n", "into account the costs and not the profit in each case.\n", "When a customer is predicted to be a churner, an offer is made with the objective of avoiding \n", "the customer defecting. However, if a customer is actually a churner, he may or not accept the \n", "offer with a probability $\\gamma_i$. If the customer accepts the offer, the financial impact is \n", "equal to the cost of the offer ($C_{o_i}$) plus the administrative cost of contacting the \n", "customer ($C_a$). On the other hand, if the customer declines the offer, the cost is the \n", "expected \tincome that the clients would otherwise generate, also called customer lifetime value \n", "($CLV_i$), \tplus $C_a$. Lastly, if the customer is not actually a churner, he will be happy to \n", "accept the \toffer and the cost will be $C_{o_i}$ plus $C_a$.\n", "\t\n", "In the case that the customer is predicted as non-churner, there are two possible outcomes. \n", "Either the customer is not a churner, then the cost is zero, or the customer is a churner and the \n", "cost is $CLV_i$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "| \t| Actual Positive ($y_i=1$) \t| Actual Negative \t($y_i=0$)|\n", "|---\t|:-:\t|:-:\t|\n", "| Predicted Positive ($c_i=1$)\t| $C_{TP_i}=\\gamma_iC_{o_i}+(1-\\gamma_i)(CLV_i+C_a)$\t| $C_{FP_i}=C_{o_i}+C_a$ |\n", "| Predicted Negative ($c_i=0$) \t| $C_{FN_i}=CLV_i$\t| $C_{TN_i}=0$\t|" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import zipfile\n", "with zipfile.ZipFile('../datasets/cost_sensitive_classification_churn.csv.zip', 'r') as z:\n", " f = z.open('cost_sensitive_classification_churn.csv')\n", " data = pd.read_csv(f, index_col=0)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x1x2x3x4x5x6x7x8x9x10...x42x43x44x45x46C_FPC_FNC_TPC_TNtarget
id
00.01.01.01.00.01.01.00.00.01.0...1.01.05.02.02.074.0000001028.571429121.8285710.00.0
10.01.01.01.00.01.01.00.00.01.0...3.01.05.02.04.053.4285711028.57142982.7428570.00.0
21.01.01.01.00.01.01.00.00.01.0...1.08.03.01.04.066.2857141285.714286102.9285710.00.0
31.01.01.00.00.01.01.00.00.01.0...1.08.04.03.02.092.0000001285.714286151.7857140.00.0
40.01.01.01.00.01.01.00.00.01.0...1.07.05.02.04.053.4285711028.57142982.7428570.00.0
\n", "

5 rows × 51 columns

\n", "
" ], "text/plain": [ " x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x42 x43 x44 \\\n", "id ... \n", "0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 ... 1.0 1.0 5.0 \n", "1 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 ... 3.0 1.0 5.0 \n", "2 1.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 ... 1.0 8.0 3.0 \n", "3 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 ... 1.0 8.0 4.0 \n", "4 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 ... 1.0 7.0 5.0 \n", "\n", " x45 x46 C_FP C_FN C_TP C_TN target \n", "id \n", "0 2.0 2.0 74.000000 1028.571429 121.828571 0.0 0.0 \n", "1 2.0 4.0 53.428571 1028.571429 82.742857 0.0 0.0 \n", "2 1.0 4.0 66.285714 1285.714286 102.928571 0.0 0.0 \n", "3 3.0 2.0 92.000000 1285.714286 151.785714 0.0 0.0 \n", "4 2.0 4.0 53.428571 1028.571429 82.742857 0.0 0.0 \n", "\n", "[5 rows x 51 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 0.952127\n", "1.0 0.047873\n", "Name: target, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.target.value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X =data[['x'+str(i) for i in range(1, 47)]]\n", "y = data.target\n", "cost_mat = data[['C_FP','C_FN','C_TP','C_TN']].values" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", " \"This module will be removed in 0.20.\", DeprecationWarning)\n" ] } ], "source": [ "from sklearn.cross_validation import train_test_split\n", "temp = train_test_split(X, y, cost_mat, test_size=0.33, random_state=42)\n", "X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = temp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice 10.1\n", "\n", "\n", "* Train 4 different models to predict target (Churn) using x1-x46 as features\n", "1. Logistic Regression\n", "2. Ensemble\n", "3. Under-sampling LR\n", "4. Under-sampling Ensemble" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice 10.2\n", "\n", "* Calculate the savings of the different models\n", "* Compare F1Score and Savings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice 10.3\n", "\n", "Using the probabilities of each model estimate a BMR classifier\n", "\n", "Compare the savings and F1Score" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice 10.4\n", "\n", "Estimate a CostSensitiveDecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }