{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " #  <p style=\"text-align: center;\">Predicting Customer Propensity using Classification Algorithms</p> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we will show you how to predict customer propensity to buy the product based on his/her interactions on the website. Using that propensity, if a certain threshold are reached we will then decide whether to assign an agent to offer a chat to the customer. We will also be using different classification algorithms to show how each algorithms are coded.\n",
    "\n",
    "Because we will only be using sample data for exercise purposes, the data set that we'll be using here are very small (500 rows of data). Thus, we might not get a real accurate predictions out of it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- - - - -"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Call to Action\n",
    "* Analyse real-time customer's actions on website\n",
    "* Predict the propensity score\n",
    "* Offer chat once propensity score exceeds threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and Viewing Data\n",
    "We will load the data file then checkout the summary statistics and columns for that file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SESSION_ID         int64\n",
       "IMAGES             int64\n",
       "REVIEWS            int64\n",
       "FAQ                int64\n",
       "SPECS              int64\n",
       "SHIPPING           int64\n",
       "BOUGHT_TOGETHER    int64\n",
       "COMPARE_SIMILAR    int64\n",
       "VIEW_SIMILAR       int64\n",
       "WARRANTY           int64\n",
       "SPONSORED_LINKS    int64\n",
       "BUY                int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "from sklearn.model_selection  import train_test_split, cross_val_score\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.linear_model import LogisticRegression, SGDClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.svm import SVC\n",
    "import sklearn.metrics\n",
    "\n",
    "customer_data = pd.read_csv(\"Data/browsing.csv\")\n",
    "\n",
    "# After loading the data, we look at the data types to make sure that the data has been loaded correctly.\n",
    "customer_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data contains information about the various links on the website that are clicked by the user during his browsing. This is past data that will be used to build the model.\n",
    "\n",
    "- Session ID : A unique identifier for the web browsing session\n",
    "- Buy : Whether the prospect ended up buying the product\n",
    "- Other columns : a boolean indicator to show whether the prospect visited that particular page or did the activity mentioned.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SESSION_ID</th>\n",
       "      <th>IMAGES</th>\n",
       "      <th>REVIEWS</th>\n",
       "      <th>FAQ</th>\n",
       "      <th>SPECS</th>\n",
       "      <th>SHIPPING</th>\n",
       "      <th>BOUGHT_TOGETHER</th>\n",
       "      <th>COMPARE_SIMILAR</th>\n",
       "      <th>VIEW_SIMILAR</th>\n",
       "      <th>WARRANTY</th>\n",
       "      <th>SPONSORED_LINKS</th>\n",
       "      <th>BUY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1002</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1003</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1004</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1005</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SESSION_ID  IMAGES  REVIEWS  FAQ  SPECS  SHIPPING  BOUGHT_TOGETHER  \\\n",
       "0        1001       0        0    1      0         1                0   \n",
       "1        1002       0        1    1      0         0                0   \n",
       "2        1003       1        0    1      1         1                0   \n",
       "3        1004       1        0    0      0         1                1   \n",
       "4        1005       1        1    1      0         1                0   \n",
       "\n",
       "   COMPARE_SIMILAR  VIEW_SIMILAR  WARRANTY  SPONSORED_LINKS  BUY  \n",
       "0                0             0         1                0    0  \n",
       "1                0             0         0                1    0  \n",
       "2                0             0         1                0    0  \n",
       "3                1             0         0                0    0  \n",
       "4                1             0         0                0    0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View the top records to understand how the data looks like.\n",
    "customer_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SESSION_ID</th>\n",
       "      <th>IMAGES</th>\n",
       "      <th>REVIEWS</th>\n",
       "      <th>FAQ</th>\n",
       "      <th>SPECS</th>\n",
       "      <th>SHIPPING</th>\n",
       "      <th>BOUGHT_TOGETHER</th>\n",
       "      <th>COMPARE_SIMILAR</th>\n",
       "      <th>VIEW_SIMILAR</th>\n",
       "      <th>WARRANTY</th>\n",
       "      <th>SPONSORED_LINKS</th>\n",
       "      <th>BUY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.0000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.0000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "      <td>500.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1250.500000</td>\n",
       "      <td>0.510000</td>\n",
       "      <td>0.5200</td>\n",
       "      <td>0.440000</td>\n",
       "      <td>0.4800</td>\n",
       "      <td>0.528000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.580000</td>\n",
       "      <td>0.468000</td>\n",
       "      <td>0.532000</td>\n",
       "      <td>0.550000</td>\n",
       "      <td>0.370000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>144.481833</td>\n",
       "      <td>0.500401</td>\n",
       "      <td>0.5001</td>\n",
       "      <td>0.496884</td>\n",
       "      <td>0.5001</td>\n",
       "      <td>0.499715</td>\n",
       "      <td>0.500501</td>\n",
       "      <td>0.494053</td>\n",
       "      <td>0.499475</td>\n",
       "      <td>0.499475</td>\n",
       "      <td>0.497992</td>\n",
       "      <td>0.483288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1001.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1125.750000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1250.500000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1375.250000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1500.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        SESSION_ID      IMAGES   REVIEWS         FAQ     SPECS    SHIPPING  \\\n",
       "count   500.000000  500.000000  500.0000  500.000000  500.0000  500.000000   \n",
       "mean   1250.500000    0.510000    0.5200    0.440000    0.4800    0.528000   \n",
       "std     144.481833    0.500401    0.5001    0.496884    0.5001    0.499715   \n",
       "min    1001.000000    0.000000    0.0000    0.000000    0.0000    0.000000   \n",
       "25%    1125.750000    0.000000    0.0000    0.000000    0.0000    0.000000   \n",
       "50%    1250.500000    1.000000    1.0000    0.000000    0.0000    1.000000   \n",
       "75%    1375.250000    1.000000    1.0000    1.000000    1.0000    1.000000   \n",
       "max    1500.000000    1.000000    1.0000    1.000000    1.0000    1.000000   \n",
       "\n",
       "       BOUGHT_TOGETHER  COMPARE_SIMILAR  VIEW_SIMILAR    WARRANTY  \\\n",
       "count       500.000000       500.000000    500.000000  500.000000   \n",
       "mean          0.500000         0.580000      0.468000    0.532000   \n",
       "std           0.500501         0.494053      0.499475    0.499475   \n",
       "min           0.000000         0.000000      0.000000    0.000000   \n",
       "25%           0.000000         0.000000      0.000000    0.000000   \n",
       "50%           0.500000         1.000000      0.000000    1.000000   \n",
       "75%           1.000000         1.000000      1.000000    1.000000   \n",
       "max           1.000000         1.000000      1.000000    1.000000   \n",
       "\n",
       "       SPONSORED_LINKS         BUY  \n",
       "count       500.000000  500.000000  \n",
       "mean          0.550000    0.370000  \n",
       "std           0.497992    0.483288  \n",
       "min           0.000000    0.000000  \n",
       "25%           0.000000    0.000000  \n",
       "50%           1.000000    0.000000  \n",
       "75%           1.000000    1.000000  \n",
       "max           1.000000    1.000000  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Do summary statistics analysis of the data to make sure the data is not skewed in any way.\n",
    "customer_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Perform Correlation Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SESSION_ID         0.026677\n",
       "IMAGES             0.046819\n",
       "REVIEWS            0.404628\n",
       "FAQ               -0.095136\n",
       "SPECS              0.009950\n",
       "SHIPPING          -0.022239\n",
       "BOUGHT_TOGETHER   -0.103562\n",
       "COMPARE_SIMILAR    0.190522\n",
       "VIEW_SIMILAR      -0.096137\n",
       "WARRANTY           0.179156\n",
       "SPONSORED_LINKS    0.110328\n",
       "BUY                1.000000\n",
       "Name: BUY, dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "customer_data.corr()['BUY']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the correlations above we can see that features like REVIEWS, BOUGHT_TOGETHER, COMPARE_SIMILAR, WARRANTY and SPONSORED_LINKS have medium correlation to the target variable. We will reduce our feature set to only those features that have some good correlation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Drop columns with low correlation\n",
    "predictors = customer_data[['REVIEWS','BOUGHT_TOGETHER','COMPARE_SIMILAR','WARRANTY','SPONSORED_LINKS']]\n",
    "targets = customer_data.BUY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Training and Testing Split\n",
    "\n",
    "We now split the model into training and testing data in the ratio of 70:30"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predictor_Training : (350, 5)  |  Predictor_Testing : (150, 5)\n"
     ]
    }
   ],
   "source": [
    "pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.3)\n",
    "\n",
    "print( \"Predictor_Training :\", pred_train.shape,\" | \", \"Predictor_Testing :\", pred_test.shape )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Model and Check Accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Naïve Bayes Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Naïve Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.\n",
    "<br><br>\n",
    "Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.\n",
    "<br><br>\n",
    "Disadvantages: Naive Bayes is is known to be a bad estimator.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[76, 20],\n",
       "       [24, 30]], dtype=int64)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb = GaussianNB()\n",
    "nb.fit(pred_train, tar_train)\n",
    "\n",
    "nb_predictions = nb.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of nb predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, nb_predictions)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7066666666666667"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, nb_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.76      0.79      0.78        96\n",
      "           1       0.60      0.56      0.58        54\n",
      "\n",
      "   micro avg       0.71      0.71      0.71       150\n",
      "   macro avg       0.68      0.67      0.68       150\n",
      "weighted avg       0.70      0.71      0.70       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, nb_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Logistic Regression Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.\n",
    "<br><br>\n",
    "Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.\n",
    "<br><br>\n",
    "Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[77, 19],\n",
       "       [28, 26]], dtype=int64)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lr = LogisticRegression(solver='liblinear')\n",
    "lr.fit(pred_train, tar_train)\n",
    "\n",
    "lr_predictions = lr.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, lr_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6866666666666666"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, lr_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.73      0.80      0.77        96\n",
      "           1       0.58      0.48      0.53        54\n",
      "\n",
      "   micro avg       0.69      0.69      0.69       150\n",
      "   macro avg       0.66      0.64      0.65       150\n",
      "weighted avg       0.68      0.69      0.68       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, lr_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Stochastic Gradient Descent Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.\n",
    "<br><br>\n",
    "Advantages: Efficiency and ease of implementation.\n",
    "<br><br>\n",
    "Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[84, 12],\n",
       "       [35, 19]], dtype=int64)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sgd = SGDClassifier(max_iter=1000, shuffle=True, tol=1e-3, random_state=101)\n",
    "sgd.fit(pred_train, tar_train)\n",
    "\n",
    "sgd_predictions = sgd.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of sgd predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, sgd_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6866666666666666"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, sgd_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.71      0.88      0.78        96\n",
      "           1       0.61      0.35      0.45        54\n",
      "\n",
      "   micro avg       0.69      0.69      0.69       150\n",
      "   macro avg       0.66      0.61      0.61       150\n",
      "weighted avg       0.67      0.69      0.66       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, sgd_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using K-Nearest Neighbors Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.\n",
    "<br><br>\n",
    "Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.\n",
    "<br><br>\n",
    "Disadvantages: Need to determine the value of K and the computation cost is high as it needs to computer the distance of each instance to all the training samples.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[75, 21],\n",
       "       [20, 34]], dtype=int64)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn = KNeighborsClassifier(n_neighbors=5)\n",
    "knn.fit(pred_train, tar_train)\n",
    "\n",
    "knn_predictions = knn.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of knn predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, knn_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7266666666666667"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, knn_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.79      0.78      0.79        96\n",
      "           1       0.62      0.63      0.62        54\n",
      "\n",
      "   micro avg       0.73      0.73      0.73       150\n",
      "   macro avg       0.70      0.71      0.70       150\n",
      "weighted avg       0.73      0.73      0.73       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, knn_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Decision Tree Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.\n",
    "<br><br>\n",
    "Advantages: Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data.\n",
    "<br><br>\n",
    "Disadvantages: Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[83, 13],\n",
       "       [29, 25]], dtype=int64)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dtree = DecisionTreeClassifier(max_depth=10, min_samples_leaf=15, random_state=101)\n",
    "dtree.fit(pred_train, tar_train)\n",
    "\n",
    "dtree_predictions = dtree.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of dtree predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test,dtree_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.72"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, dtree_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.74      0.86      0.80        96\n",
      "           1       0.66      0.46      0.54        54\n",
      "\n",
      "   micro avg       0.72      0.72      0.72       150\n",
      "   macro avg       0.70      0.66      0.67       150\n",
      "weighted avg       0.71      0.72      0.71       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, dtree_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Random Forest Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.\n",
    "<br><br>\n",
    "Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.\n",
    "<br><br>\n",
    "Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[74, 22],\n",
       "       [23, 31]], dtype=int64)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rf = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=-1,\n",
    "                            random_state=101, max_features=None, min_samples_leaf=30)\n",
    "rf.fit(pred_train, tar_train)\n",
    "\n",
    "rf_predictions = rf.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of rf predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, rf_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sklearn.metrics.accuracy_score(tar_test, rf_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.76      0.77      0.77        96\n",
      "           1       0.58      0.57      0.58        54\n",
      "\n",
      "   micro avg       0.70      0.70      0.70       150\n",
      "   macro avg       0.67      0.67      0.67       150\n",
      "weighted avg       0.70      0.70      0.70       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, rf_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "    <summary>\n",
    "        <span style=\"font-size: 20px; font-weight:bold; color: blue;\">Using Support Vector Machine Classifier</span>\n",
    "    </summary>\n",
    "    <p>\n",
    "Definition: Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.\n",
    "<br><br>\n",
    "Advantages: Effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient.\n",
    "<br><br>\n",
    "Disadvantages: The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.\n",
    "    </p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[83, 13],\n",
       "       [30, 24]], dtype=int64)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm = SVC(C=1, kernel='linear', gamma='auto')\n",
    "svm.fit(pred_train, tar_train)\n",
    "svm_predictions = svm.predict(pred_test)\n",
    "\n",
    "#Analyze accuracy of svm predictions\n",
    "sklearn.metrics.confusion_matrix(tar_test, svm_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.722"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#sklearn.metrics.accuracy_score(tar_test, svm_predictions)\n",
    "scores = cross_val_score(svm, predictors, targets, cv=5)\n",
    "scores.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.73      0.86      0.79        96\n",
      "           1       0.65      0.44      0.53        54\n",
      "\n",
      "   micro avg       0.71      0.71      0.71       150\n",
      "   macro avg       0.69      0.65      0.66       150\n",
      "weighted avg       0.70      0.71      0.70       150\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(sklearn.metrics.classification_report(tar_test, svm_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using probability as propensity score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of doing a Yes/No prediction, we want to predict the probability of somebody who wants to buy, and we can do that by using a method called predict_proba.\n",
    "\n",
    "Let's say we use NB model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.10621579226568478"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pred_prob=nb.predict_proba(pred_test)\n",
    "pred_prob[0,1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The probability above can be read as 11% chance that the customer will buy the product."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Real time predictions\n",
    "\n",
    "From the models we examine, let's say the best model to use for our real time prediction is the Naive Bayes.\n",
    "\n",
    "So when the customer starts visiting the pages one by one, we collect that list and then use it to compute the probability. We do that for every new click that comes in.\n",
    "\n",
    "So let us start. The prospect just came to your website. There are no significant clicks. Let us compute the probability. The array of values passed has the values for REVIEWS, BOUGHT_TOGETHER, COMPARE_SIMILAR, WARRANTY and SPONSORED_LINKS. So the array is all zeros to begin with"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New visitor: propensity : [0.04544815]\n"
     ]
    }
   ],
   "source": [
    "browsing_data = np.array([0,0,0,0,0]).reshape(1, -1)\n",
    "print(\"New visitor: propensity :\", nb.predict_proba(browsing_data)[:,1] )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the initial probability is 5%. Now, suppose the customer does a comparison of similar products. The array changes to include a 1 for that function. The new probability will be"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After checking similar products: propensity : [0.1082966]\n"
     ]
    }
   ],
   "source": [
    "browsing_data = np.array([0,0,1,0,0]).reshape(1, -1)\n",
    "print(\"After checking similar products: propensity :\", nb.predict_proba(browsing_data)[:,1] )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It goes up. Next, he checks out reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After checking reviews: propensity : [0.5612052]\n"
     ]
    }
   ],
   "source": [
    "browsing_data = np.array([1,0,1,0,0]).reshape(1, -1)\n",
    "print(\"After checking reviews: propensity :\", nb.predict_proba(browsing_data)[:,1] )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It shoots up to 50+%. You can have a threshold for when you want to offer chat. You can keep checking this probability against that threshold to see if you want to popup a chat window.\n",
    "\n",
    "This example shows you how you can use predictive analytics in real time to decide whether a customer has high propensity to convert and offer him a chat with a sales rep/agent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References:\n",
    "* Ponnambalam, K. (2017, July 3). Predictive customer analytics. Retrieved from https://www.linkedin.com/learning/predictive-customer-analytics\n",
    "* Garg, R. (2018, January 19). 7 types of classification algorithms. Retrieved May 5, 2019 from <https://www.analyticsindiamag.com/7-types-classification-algorithms/>\n",
    "* Scikit Learn. (2018). API reference. Retrieved May 4, 2019 from <https://scikit-learn.org/stable/modules/classes.html>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next Topic: How to recommend products to customer based on items affinity?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- - - -"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}