{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Alexander Nichiporenko, @AlexNich\n", " \n", "##
Prediction of customers which will buy car insurance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1. Feature and data explanation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probably, many of us are faced with a situation when a company calls you to buy or buy something. Typical examples:\n", "\n", "* You use a credit card, and the bank calls you with an offer to issue a loan;*\n", "* You bought auto insurance, and the insurance company calls and offers you other types of insurance;\n", "* You have been using cellular communication for a long time, and your operator calls you with a proposal to use a new more profitable (oddly enough, more expensive) tariff;\n", "* You bought something from an online store, and after a while he calls you to buy another item.\n", "* Any situations related to the acquisition of a new service, an additional service, a more expensive service.\n", "\n", "Usually, in most cases, the client does not agree to such offers, because he simply does not need it. It turns out that ringing the entire customer base is long and inefficient, so companies try to contact only those who are likely to agree to their proposal. How to find such customers? This can be done as follows:\n", "\n", "* Call a certain random part of clients, record the result;\n", "* Find in the remaining customer base of the most similar to those who agreed to the proposed service;\n", "* Call these customers, thereby increasing the effectiveness of contacts.\n", "\n", "We will solve a similar problem. We have a dataset from one bank in the United States. Besides usual services, this bank also provides car insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential customers’ data, and bank’s employees call them for advertising available car insurance options. We are provided with general information about clients (age, job, etc.) as well as more specific information about the current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like previous attempts, outcome). The task is to predict of customers who will buy car insurance or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import libraries\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.model_selection import cross_val_score, TimeSeriesSplit, GridSearchCV, train_test_split, KFold, learning_curve, validation_curve\n", "from sklearn.metrics import accuracy_score,classification_report,f1_score,roc_auc_score,roc_curve,precision_recall_curve\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import RandomForestClassifier\n", "from xgboost import XGBClassifier\n", "plt.rcParams['figure.figsize'] = (20,20)\n", "#sns.set(style=\"darkgrid\");\n", "%matplotlib inline\n", "pd.options.display.max_columns=500" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at our dataset. You can download it here: https://www.kaggle.com/kondla/carinsurance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('carInsurance_train.csv',index_col='Id')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 4000 customers with 17 features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our target variabe - **'CarInsurance'**, which is binary (1/0).\"1\" means that the customer has agreed to the offer, \"0\" means that not.\n", "\n", "Eighteen features overvies:\n", "\n", "- **Id** - Unique ID number;\n", "- **Age** - Age of the client;\n", "- **Job** - Job of the client. \"admin.\", \"blue-collar\", etc.\n", " **Marital** - Marital status of the client \"divorced\", \"married\", \"single\";\n", "- **Education** - Education level of the client \"primary\", \"secondary\", etc.\n", "- **Default** - Has credit in default? \"yes\" - 1,\"no\" - 0\n", "- **Balance** - Average yearly balance, in USD\n", "- **HHInsurance** - Is household insured \"yes\" - 1,\"no\" - 0\n", "- **CarLoan** - Has the client a car loan \"yes\" - 1,\"no\" - 0\n", "- **Communication** - Contact communication type \"cellular\", \"telephone\", “NA”\n", "- **LastContactMonth** - Month of the last contact \"jan\", \"feb\", etc.\n", "- **LastContactDay** - Day of the last contact\n", "- **CallStart** - Start time of the last call (HH:MM:SS) 12:43:15\n", "- **CallEnd** - End time of the last call (HH:MM:SS) 12:43:15\n", "- **NoOfContacts** - Number of contacts performed during this campaign for this client; \n", "- **DaysPassed** - Number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted) \n", "- **PrevAttempts** - Number of contacts performed before this campaign and for this client \n", "- **Outcome** - Outcome of the previous marketing campaign \"failure\", \"other\", \"success\", “NA”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2. Primary data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, examine our data on missing values and outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#devide features in categorical and numerical\n", "\n", "data['Default']=data['Default'].astype('object')\n", "data['HHInsurance']=data['HHInsurance'].astype('object')\n", "data['CarLoan']=data['CarLoan'].astype('object')\n", "data['LastContactDay']=data['LastContactDay'].astype('object')\n", "\n", "cat = []\n", "num = []\n", "for feature in data.drop(columns=['CarInsurance']).columns:\n", " if data[feature].dtype == object:\n", " cat.append(feature)\n", " else:\n", " num.append(feature)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print ('Number of categorical features:',len(cat))\n", "print ('Number of numerical features:',len(num))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Job'].isnull()].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Education'].isnull()].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Communication'].isnull()].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Outcome'].isnull()].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we see dataset has some missing values: \n", "* Job and Education may be missed because customers didn't specify this information;\n", "* Communication may be missed because bank didn't fix communication type\n", "* Outcome has missing values because some customers haven't been offered anything before, respectively, and there is no outcome;\n", "\n", "We will fill **NaN's** later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe(include = ['object'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some values seem suspicious and may be outliers:\n", "\n", "* **max Age = 95 years**. Real survivor!\n", "* **max Balance = 98 417 USD**, when mean is **1532 USD** and 75% procentile equals to **1619 USD**. May be this man is very rich? It's typical for income distribution.\n", "* **min Balance = - 3058 USD**. Maybe this person spent all the credit money and did not return?\n", "* **max NoOfContacs = 43**. Did the bank offer so many times insurance within this company to some person? Interestingly, he agreed?\n", "* **max DaysPassed = 854**. The bank does not call someone for more than three years?\n", "* **max PrevAttempts = 58** when mean is 0.72. \n", "\n", "Let's look at id with this strange values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Age']==95].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Balance']==98417].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Balance']==-3058].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['DaysPassed']==854].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['NoOfContacts']==43].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['PrevAttempts']==58].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at this data it is impossible to say that there are definitely some errors in the data. Perhaps everything is correct. Later we will visualize the data and decide what to do with suspicious values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the part of customers who bought car insurance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['CarInsurance'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**40%** isn't bad! But I think the bank wants **100%**, so it calls customers several times. In ML terms we can say that our two classes are balanced." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now examine the influence of our features on the target variable. Firsly, numerical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(by=['CarInsurance'])[['Age']].agg([np.mean,np.std,np.min,np.max])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(by=['CarInsurance'])[['NoOfContacts']].agg([np.mean,np.std,np.min,np.max])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(by=['CarInsurance'])[['DaysPassed']].agg([np.mean,np.std,np.min,np.max])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(by=['CarInsurance'])[['PrevAttempts']].agg([np.mean,np.std,np.min,np.max])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(by=['CarInsurance'])[['Balance']].agg([np.mean,np.std,np.min,np.max])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the constructed tables, we can see that those customers who agree to insurance in average:\n", "\n", "* The bank makes more offers with this insurance\n", "* Such clients were offered an offer by another bank company on average more than two months ago, for those who did not agree - just over a month\n", "* They were more often offered other bank offers\n", "* Have a bit more balance\n", "* Have less contacts from the bank for other campaigns\n", "\n", "To confirm these observations we build histograms and boxplots of features futher.\n", "Now take a look at categorical and binary features.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['Education'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['Marital'],data['Education'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['Default'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['CarLoan'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['CarLoan'],data['HHInsurance'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['Communication'],data['Outcome'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['Communication'],data['LastContactMonth'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data['LastContactDay'],data['LastContactMonth'],values=data['CarInsurance'],aggfunc='mean',margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at these crosstabs we can see:\n", "\n", "* The way of communication doesn't affect on target variable\n", "* Monthly and dayily dependence of campaign\n", "* Persons with CarLoan rare agree to the offer\n", "* Persons with HHInsurance rare agree to the offer\n", "* People who agreed to other offers of the bank more often agree to insurance\n", "* Persons with Default rare agree to the offer\n", "* Single persons and persons who have tretiary education ofter agree to insurance \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3. Primary visual data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make visualizations of our features and their effect on the target variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#target distribution\n", "sns.countplot(data['CarInsurance'],palette=\"Accent\");\n", "plt.title('Target distribution');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#distribution of categorical features\n", "\n", "plt.figure(figsize=(20,20))\n", "for i in range(1,len(cat[:11])):\n", " plt.subplot(4,3,i)\n", " sns.countplot(data[cat[i-1]],palette='Accent')\n", " plt.xticks(rotation=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be seen that some values of categorical features (**\"Default=1\"** or months) have a small number of examples. In general, such values are usually combined into one group to prevent overfitting, and in the binary case, this column can be deleted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#target variable versus categorical\n", "\n", "plt.figure(figsize=(20,20))\n", "for i in range(1,len(cat[:11])):\n", " plt.subplot(4,3,i)\n", " sns.barplot(data[cat[i-1]],data['CarInsurance'],palette='Accent')\n", " plt.xticks(rotation=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusions regarding the dependence of the target variable on categorical features obtained using primary data analysis are confirmed by these visualizations (see **Part 2**)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#histograms of numerical features and their scatterplots\n", "\n", "sns.pairplot(data[num], palette=\"Accent\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_matrix = data[num].corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(corr_matrix,cmap=\"Accent\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From scatterplots and heatmap is obviosly that our numerical haven't visible correlations, and the distributions are strongly skewed to the left except for age." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#histograms of numerical features and their scatterplot\n", "\n", "sns.pairplot(data[num + ['CarInsurance']],hue='CarInsurance',palette=\"Accent\",diag_kind='kde');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#boxplots depending on the target variable\n", "\n", "plt.figure(figsize=(20,10))\n", "for i in range(1,len(num)+1):\n", " plt.subplot(2,3,i)\n", " sns.boxplot(data=data, x=data['CarInsurance'],y=data[num[i-1]],palette=\"Accent\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#graphs depending on the target variable with a limit of 0.975 quantile for better visibility\n", "\n", "plt.figure(figsize=(20,10))\n", "for i in range(1,len(num)+1):\n", " plt.subplot(2,3,i)\n", " sns.boxplot(data=data, x=data['CarInsurance'],y=data[data[num[i-1]]