{ "cells": [ { "cell_type": "markdown", "id": "78ad8a32", "metadata": {}, "source": [ "
\n", "

\n", " Customer Churn - Machine Learning Challenge\n", "
\n", " \"Teradata\"\n", "

\n", "
" ] }, { "cell_type": "markdown", "id": "454c6ae9", "metadata": {}, "source": [ "

Introduction

\n", "\n", "

\n", "Customer churn is a concern for all companies, but the complexity makes it difficult to track. Customers may leave due to various reasons such dissatisfaction with service quality, pricing, customer service, or finding better alternatives from competitors. Although some churn may be expected, companies aim to retain their customers to avoid using additional resources to find new customers. Thus, with the help of Teradata Vantage, companies can achieve their goal of identifying the factors contributing to the churn, so they can take appropriate measures to retain customers. Vantage’s capabilities allow companies to analyze large amounts of customer data, such as usage patterns, billing information, demographics, and interactions, to find patterns that may indicate customers who are at risk of churning. Plus, Teradata’s machine learning and predictive analytics can be used to build models to predict customers which are likely to churn in the future. This information will give companies the chance to intervene, such as sending targeted marketing campaigns, personalized offers, improved customer service, or addressing customer concerns.

\n", "

Business Values

\n", "\n", "

Why Vantage?

\n", "

\n", "Traditional ML and AI development and deployment pipelines require users to manually combine various tools and techniques across the lifecycle. This leads to lengthy, fragile, manual, error-prone processes, often impossible to migrate out of the lab and into production in order to realize business value.
ClearScape Analytics helps to solve this “development to deployment gap” by providing highly scalable, performant, and easy-to-use analytic capabilities that address all aspects of the development lifecycle. The same tools and techniques that data scientists use in development can be seamlessly deployed into production using the same code, platform, and operational pipeline.

\n", "\n", "

\n", "Managing telco churn is complex and requires continuous monitoring, analysis, and proactive customer engagement strategies. By using data and advanced analytics, telecom companies can better understand customer behavior and preferences, and take proactive measures to retain customers and maintain profitability.

\n", "\n", "

\n", "Let's demonstrate this use case with sample data using InDb analytics in Vantage which can pre-process and analyze huge amounts of data at scale. \n", "

" ] }, { "cell_type": "markdown", "id": "22173c1b-a4eb-4cd7-b0ae-ba68bc39aba2", "metadata": {}, "source": [ "
\n", "

1.Connect to Vantage, Import python packages and explore the dataset

\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5199b325-11f8-4dbd-a13f-fadf334c58c6", "metadata": { "tags": [ "to_do", "ID_1" ] }, "outputs": [], "source": [ "#Complete\n", "#import libraries\n", "import matplotlib.pyplot as plt \n", "import getpass\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "warnings.simplefilter(action='ignore', category=DeprecationWarning)\n", "warnings.simplefilter(action='ignore', category=RuntimeWarning)\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "# Your most important import goes here :) \n", "#from import *\n", "\n", "import plotly.express as px\n", "from plotly.subplots import make_subplots\n", "import plotly.graph_objects as go\n", "\n", "\n", "from sklearn.metrics import mean_absolute_error\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.metrics import roc_curve\n", "display.max_rows=5" ] }, { "cell_type": "markdown", "id": "113ada1d-a5ff-4d5e-9145-f0f03b26b3f2", "metadata": {}, "source": [ "

We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.

" ] }, { "cell_type": "code", "execution_count": null, "id": "eb0165cd-c7eb-40cc-8eac-84ad0ec3ba52", "metadata": {}, "outputs": [], "source": [ "%run -i ../startup.ipynb\n", "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", "print(eng)" ] }, { "cell_type": "markdown", "id": "6d8f8388-e318-4bf4-84c1-7dfcf2d1dd40", "metadata": {}, "source": [ "

Getting Data for This Demo

\n", "

We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.

" ] }, { "cell_type": "code", "execution_count": null, "id": "b3460a89-d570-488e-a97a-39423afa7b27", "metadata": {}, "outputs": [], "source": [ "%run -i ../run_procedure.py \"call get_data('DEMO_Telco_cloud');\"" ] }, { "cell_type": "markdown", "id": "bdd6dd8c", "metadata": {}, "source": [ "
\n", "

2. Data Exploration

" ] }, { "cell_type": "markdown", "id": "13288769-f1b3-40a5-8cad-95e5f4ae92fd", "metadata": {}, "source": [ "

Customer Churn

\n", "

Let us start by creating a \"Virtual DataFrame\" that points directly to the dataset in Vantage. We then begin our analysis by checking the shape of the DataFrame and examining the data types of all its columns.

" ] }, { "cell_type": "code", "execution_count": null, "id": "d40df274-d9cb-439b-93bb-343d317f052c", "metadata": {}, "outputs": [], "source": [ "tdf = DataFrame(in_schema(\"DEMO_Telco\", \"Customer_Churn\"))\n", "tdf" ] }, { "cell_type": "markdown", "id": "1d620292-c936-4546-89eb-59fd50c35221", "metadata": {}, "source": [ "

We can check the demographics of data by shape and info method.

" ] }, { "cell_type": "code", "execution_count": null, "id": "279ee6be-9288-41ae-b21a-f2389add4623", "metadata": {}, "outputs": [], "source": [ "print(\"Shape of the data: \", tdf.shape)\n", "tdf.info()" ] }, { "cell_type": "markdown", "id": "4d9927f9-f0f6-4f45-966d-7b7b2ca36f84", "metadata": {}, "source": [ "

As we can see from above result our dataset has 7043 rows with 21 columns.

" ] }, { "cell_type": "markdown", "id": "b03454cf-d47a-4edc-aea4-5b517b7da9d6", "metadata": {}, "source": [ "

Summary of Columns
\n", "

We can use the ColumnSummary function for quickly examining the columns, their datatypes, and summary of NULLs/non-NULLs for a given table.

" ] }, { "cell_type": "code", "execution_count": null, "id": "21eece2c-533a-40e3-bcad-4ed4bb2b6cf3", "metadata": {}, "outputs": [], "source": [ "from teradataml import ColumnSummary\n", "obj = ColumnSummary(data=tdf,\n", " target_columns=[':']\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "7eeb1e86-c00a-48d5-8e3b-363118c847ce", "metadata": {}, "outputs": [], "source": [ "obj.result.head(21)" ] }, { "cell_type": "markdown", "id": "fa5ea57f-b8ba-44eb-9d9b-cf07d37b77b5", "metadata": {}, "source": [ "
\n", "

3. Data Preprocessing

" ] }, { "cell_type": "markdown", "id": "b58490d2-1f7a-4941-a62a-e99a4b7f7543", "metadata": {}, "source": [ "

Before the data can be used for model creation; we will need to do some data cleansing and transformation on it. We can do this InDb with Teradata Vantage's inbuilt functions.
We will use the \"You should figure it out\" function to showcase the distinct values and their corresponding counts for each specified column in the input DataFrame. This function provides a concise summary of categorical data, aiding in a quick understanding of the distribution of values within the specified columns.

" ] }, { "cell_type": "code", "execution_count": null, "id": "ddf75f76-eb84-4f9d-856b-4051f0df6d70", "metadata": { "tags": [ "to_do", "ID_2" ] }, "outputs": [], "source": [ "#Complete\n", "from teradataml import #Mistery_Function_here\n", "CatSum = #Mistery_Function_Here(data=tdf,target_columns=[\"MultipleLines\",\"InternetService\",\"OnlineSecurity\",\"OnlineBackup\",\"DeviceProtection\",\"TechSupport\",\"StreamingTV\",\"StreamingMovies\"])\n", "CatSum.result.sort(\"ColumnName\")" ] }, { "cell_type": "markdown", "id": "84680551-fbca-45e4-83e9-c5f90fa3078b", "metadata": {}, "source": [ "

\n", "As we can see from the sample data above and the categorical summary values, the columns

\n", "

are related to InternetService, wherever InternetService value is \"No\" the column have value of \"No internet service\". For our model let us replace \"No internet service\" to No in our column. We will do similar operation for replacing \"No phone service\" to \"No\".
We will use sqlalchemy's oreplace function to replace the respective strings to desired value.

" ] }, { "cell_type": "code", "execution_count": null, "id": "9f0f4876-7f8d-4177-8afc-f7c474b6ac4a", "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import func\n", "\n", "\n", "tdf = tdf.assign(oreplace_MultipleLines=func.oreplace(tdf.MultipleLines.expression, \"No phone service\",\"No\"),\n", " oreplace_OnlineSecurity=func.oreplace(tdf.OnlineSecurity.expression, \"No internet service\",\"No\"),\n", " oreplace_OnlineBackup=func.oreplace(tdf.OnlineBackup.expression, \"No internet service\",\"No\"),\n", " oreplace_DeviceProtection=func.oreplace(tdf.DeviceProtection.expression, \"No internet service\",\"No\"), oreplace_TechSupport=func.oreplace(tdf.TechSupport.expression, \"No internet service\",\"No\"),\n", " oreplace_StreamingTV=func.oreplace(tdf.StreamingTV.expression, \"No internet service\",\"No\"),\n", " oreplace_StreamingMovies=func.oreplace(tdf.StreamingMovies.expression, \"No internet service\",\"No\"))\n", "tdf" ] }, { "cell_type": "code", "execution_count": null, "id": "a5439705-8a1a-4060-b344-2f3245b18e57", "metadata": {}, "outputs": [], "source": [ "# now lets drop the extra columns, rename the columns in dataframe\n", "from teradataml.dataframe.sql_functions import case\n", "\n", "tdf2 = tdf.assign(drop_columns=True\n", " ,CustomerID=tdf.CustomerID \n", " ,Gender=tdf.Gender \n", " ,SeniorCitizen=tdf.SeniorCitizen\n", " ,Partner=tdf.Partner\n", " ,Dependents=tdf.Dependents\n", " ,Tenure=tdf.Tenure\n", " ,PhoneService=tdf.PhoneService \n", " ,MultipleLines=tdf.oreplace_MultipleLines \n", " ,InternetService=tdf.InternetService \n", " ,OnlineSecurity=tdf.oreplace_OnlineSecurity \n", " ,OnlineBackup=tdf.oreplace_OnlineBackup \n", " ,DeviceProtection=tdf.oreplace_DeviceProtection \n", " ,TechSupport=tdf.oreplace_TechSupport \n", " ,StreamingTV=tdf.oreplace_StreamingTV \n", " ,StreamingMovies=tdf.oreplace_StreamingMovies \n", " ,Contract=tdf.Contract \n", " ,PaperlessBilling=tdf.PaperlessBilling \n", " ,PaymentMethod=tdf.PaymentMethod \n", " ,MonthlyCharges=tdf.MonthlyCharges \n", " ,TotalCharges=tdf.TotalCharges \n", " ,Churn = case({ \"Yes\" : 1, \"No\" : 0},value=tdf.Churn,else_=0) ) " ] }, { "cell_type": "code", "execution_count": null, "id": "c5bddfa5-6a90-477b-8c38-aa2985736fe6", "metadata": {}, "outputs": [], "source": [ "tdf2" ] }, { "cell_type": "markdown", "id": "d256d17f-1c10-4bc2-978c-4b9a2a184437", "metadata": {}, "source": [ "

We need to encode categorical values

\n", "

From our categorical attributes we can see that there are limited distinct values in each of these columns. We will use Teradata's functions to convert the categorical attributes to numerical.

" ] }, { "cell_type": "code", "execution_count": null, "id": "bc112884-d0bf-462d-9024-003bb4c2640a", "metadata": { "tags": [ "to_do", "ID_3" ] }, "outputs": [], "source": [ "#Complete\n", "onehotfit_df = #Mistery_function_here(data=tdf2,\n", " is_input_dense=True,\n", " approach=\"auto\",\n", " target_column=[\"Gender\",\"Partner\",\"Dependents\",\"PhoneService\",\"MultipleLines\",\"OnlineSecurity\"\n", " ,\"OnlineBackup\",\"DeviceProtection\",\"TechSupport\",\"StreamingTV\",\"StreamingMovies\",\n", " \"PaperlessBilling\"],\n", " category_counts=[2,2,2,2,2,2,2,2,2,2,2,2])" ] }, { "cell_type": "markdown", "id": "2c136b69-41c1-42f1-bf47-89d1dae800a3", "metadata": {}, "source": [ "

\n", "The other categorical columns

\n", "

have more values where we can apply ordinalencoding on it

\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "0e6b2946-b138-4ad6-aaa7-dff619fcf1e7", "metadata": { "tags": [ "to_do", "ID_4" ] }, "outputs": [], "source": [ "#Complete\n", "ordinalfit_df = #Mistery_function_here(target_column=['InternetService','Contract','PaperlessBilling','PaymentMethod'],\n", " default_value=-1,\n", " data=tdf2)" ] }, { "cell_type": "code", "execution_count": null, "id": "3614b408-21c4-4b45-a699-0ad001b6938d", "metadata": {}, "outputs": [], "source": [ "ordinalfit_df.result" ] }, { "cell_type": "markdown", "id": "7a437c68-b650-4300-a3e8-cd39a5e21054", "metadata": {}, "source": [ "

Scale the numerical values

For the numerical attributes we will use Teradata's functions to scale the specified input table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns.

" ] }, { "cell_type": "code", "execution_count": null, "id": "059dd73a-3751-46a8-85bf-7a11a7eacb7c", "metadata": { "tags": [ "to_do", "ID_5" ] }, "outputs": [], "source": [ "#Complete\n", "scalefit_df = #Mistery_function_here(data=tdf2,\n", " target_columns=['MonthlyCharges','TotalCharges'],\n", " scale_method=\"RANGE\",\n", " miss_value=\"KEEP\",\n", " global_scale=False)" ] }, { "cell_type": "markdown", "id": "faec04de-8197-4c10-bcc5-3bc3f605e81e", "metadata": {}, "source": [ "

Putting it altogether

We will use a Teradata function to apply all the transformations from the fit tables created above in one go.

" ] }, { "cell_type": "code", "execution_count": null, "id": "d108b118-0970-4c84-9371-df4ef125ce69", "metadata": { "tags": [ "to_do", "ID_6" ] }, "outputs": [], "source": [ "#Complete\n", "ColumnTransformer_out = #Mistery_function_here(fillrowid_column_name=\"output_value\",\n", " input_data=tdf2,\n", " onehotencoding_fit_data=onehotfit_df.result,\n", " ordinalencoding_fit_data=ordinalfit_df.result,\n", " scale_fit_data=scalefit_df.output) " ] }, { "cell_type": "code", "execution_count": null, "id": "4da18dae-fec0-4bba-b87c-ab7045fd21e1", "metadata": {}, "outputs": [], "source": [ "Transformed_data= ColumnTransformer_out.result.assign(drop_columns=True,\n", " Churn=ColumnTransformer_out.result.Churn,\n", " CustomerID=ColumnTransformer_out.result.CustomerID,\n", " SeniorCitizen=ColumnTransformer_out.result.SeniorCitizen,\n", " Tenure=ColumnTransformer_out.result.Tenure,\n", " InternetService=ColumnTransformer_out.result.InternetService,\n", " Contract=ColumnTransformer_out.result.Contract,\n", " PaperlessBilling=ColumnTransformer_out.result.PaperlessBilling,\n", " PaymentMethod=ColumnTransformer_out.result.PaymentMethod,\n", " MonthlyCharges=ColumnTransformer_out.result.MonthlyCharges,\n", " TotalCharges=ColumnTransformer_out.result.TotalCharges,\n", " Gender_0=ColumnTransformer_out.result.Gender_0,\n", " Gender_1=ColumnTransformer_out.result.Gender_1,\n", " Partner_0=ColumnTransformer_out.result.Partner_0,\n", " Partner_1=ColumnTransformer_out.result.Partner_1,\n", " Dependents_0=ColumnTransformer_out.result.Dependents_0,\n", " Dependents_1=ColumnTransformer_out.result.Dependents_1,\n", " PhoneService_0=ColumnTransformer_out.result.PhoneService_0,\n", " PhoneService_1=ColumnTransformer_out.result.PhoneService_1,\n", " MultipleLines_0=ColumnTransformer_out.result.MultipleLines_0,\n", " MultipleLines_1=ColumnTransformer_out.result.MultipleLines_1,\n", " OnlineSecurity_0=ColumnTransformer_out.result.OnlineSecurity_0,\n", " OnlineSecurity_1=ColumnTransformer_out.result.OnlineSecurity_1,\n", " OnlineBackup_0=ColumnTransformer_out.result.OnlineBackup_0,\n", " OnlineBackup_1=ColumnTransformer_out.result.OnlineBackup_1,\n", " DeviceProtection_0=ColumnTransformer_out.result.DeviceProtection_0,\n", " DeviceProtection_1=ColumnTransformer_out.result.DeviceProtection_1,\n", " TechSupport_0=ColumnTransformer_out.result.TechSupport_0,\n", " TechSupport_1=ColumnTransformer_out.result.TechSupport_1,\n", " StreamingTV_0=ColumnTransformer_out.result.StreamingTV_0,\n", " StreamingTV_1=ColumnTransformer_out.result.StreamingTV_1,\n", " StreamingMovies_0=ColumnTransformer_out.result.StreamingMovies_0,\n", " StreamingMovies_1=ColumnTransformer_out.result.StreamingMovies_1,\n", " PaperlessBilling_0=ColumnTransformer_out.result.PaperlessBilling_0,\n", " PaperlessBilling_1=ColumnTransformer_out.result.PaperlessBilling_1)\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "7e4c30b0-4989-4540-bf60-3e4631afeacd", "metadata": {}, "outputs": [], "source": [ "Transformed_data" ] }, { "cell_type": "code", "execution_count": null, "id": "f38c7f0f-6da8-4ef9-acaf-0774d29c92e8", "metadata": {}, "outputs": [], "source": [ "Transformed_data.shape" ] }, { "cell_type": "markdown", "id": "0edef2c8-e568-4626-9377-e189d66e3350", "metadata": {}, "source": [ "

We can see from above how our data is transformed from the original values.

" ] }, { "cell_type": "code", "execution_count": null, "id": "05c50e66-97a9-4fc9-8bd1-20654dc318fc", "metadata": {}, "outputs": [], "source": [ "# Copying the intermediate table to database\n", "Transformed_data.to_sql(\"Transformed_data\",primary_index = \"CustomerID\", if_exists = \"replace\")" ] }, { "cell_type": "markdown", "id": "37551d5e-2366-42cf-83f9-1a48ee438c6c", "metadata": {}, "source": [ "

Create train and test data

Now we have transformed our data and it is fit to be used in machine learning models, let us split the whole dataset into train and test sets for model training and scoring. We will use TrainTestSplit function for this task.

" ] }, { "cell_type": "code", "execution_count": null, "id": "b22bf0d0-255e-4ea9-8bdc-ffac9ea02f34", "metadata": {}, "outputs": [], "source": [ "TrainTestSplit_out = TrainTestSplit(\n", " data = DataFrame('Transformed_data'),\n", " id_column = \"CustomerID\",\n", " train_size = 0.75,\n", " test_size = 0.25,\n", " seed = 21\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "19d382b8-7f46-43f8-aee9-5598d6f24ebf", "metadata": {}, "outputs": [], "source": [ "# Split into 2 virtual dataframes\n", "df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)\n", "df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)" ] }, { "cell_type": "markdown", "id": "88e87734-1eb6-47a6-87d4-3b2d0585088a", "metadata": {}, "source": [ "

We have done our preprocessing of data and we created our training and test datasets, let's now create some predictive models." ] }, { "cell_type": "markdown", "id": "3624ff0f-2e80-450e-a76e-85398a8c73da", "metadata": {}, "source": [ "


\n", "

4. InDb Model Training and Scoring

" ] }, { "cell_type": "markdown", "id": "2893fa15-2812-473d-b91c-5949ba436461", "metadata": {}, "source": [ "

4.1 Logistic Regression

" ] }, { "cell_type": "markdown", "id": "f4561645-5edd-4e2b-9983-59e77ab4745a", "metadata": {}, "source": [ "

For our model we will use logistic regression.
\n", " Logistic regression is a statistical algorithm used for binary classification problems. It is a type of supervised learning algorithm that predicts the probability of an input belonging to a certain class (e.g., positive or negative) based on its features.
Logistic regression works by modeling the relationship between the input features and the probability of belonging to a certain class using a logistic function. The logistic function takes the input feature values and maps them onto a probability scale between 0 and 1, which represents the probability of belonging to the positive class.
\n", " The GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets.\n", "
Please refer GLM for function elements and output." ] }, { "cell_type": "code", "execution_count": null, "id": "cc6639e3-2427-42d2-b302-08e18196b2b1", "metadata": {}, "outputs": [], "source": [ "df_train" ] }, { "cell_type": "code", "execution_count": null, "id": "e92723cc-1aec-4fee-97fe-96b3e86e7802", "metadata": {}, "outputs": [], "source": [ "from teradataml import GLM, TDGLMPredict\n", "\n", "glm_model = GLM(data = df_train,\n", " input_columns = ['1:8','10:33'], \n", " response_column = 'Churn',\n", " family = 'Binomial')" ] }, { "cell_type": "code", "execution_count": null, "id": "0e3b4239-40a5-4b2c-9589-7d974c574641", "metadata": {}, "outputs": [], "source": [ "glm_model.result" ] }, { "cell_type": "markdown", "id": "a24bb51b-7115-486e-957a-848bad5bc4d9", "metadata": {}, "source": [ "

We have created our model, let's do the predictions on the test dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "5ebd11f4-bb9a-4b71-a6ca-7fe9f6fe9755", "metadata": {}, "outputs": [], "source": [ "glm_prediction = TDGLMPredict(newdata = df_test,\n", " id_column = 'CustomerID',\n", " object = glm_model.result,\n", " accumulate = 'Churn',\n", " output_prob=True,\n", " output_responses = ['0', '1'])" ] }, { "cell_type": "code", "execution_count": null, "id": "498799ce-2607-4cba-94e6-26a40abd7d0e", "metadata": {}, "outputs": [], "source": [ "out_glm = glm_prediction.result.assign(prediction = glm_prediction.result.prediction.cast(type_ = BYTEINT))\n", "out_glm = out_glm.assign(prediction = out_glm.prediction.cast(type_ = VARCHAR(2)))\n", "out_glm = out_glm.assign(Churn = out_glm.Churn.cast(type_ = VARCHAR(2)))\n", "out_glm" ] }, { "cell_type": "markdown", "id": "121887bb-4fb2-429f-99d7-120c6dd47e51", "metadata": {}, "source": [ "

The output above shows prob_1, i.e. customer will Churn and prob_0, i.e. customer will not Churn. The prediction column uses these probabilities to give a class label, i.e. prediction column.

" ] }, { "cell_type": "markdown", "id": "7fa0a4c0-c4f8-499b-9e60-8bc87684a5c8", "metadata": { "tags": [] }, "source": [ "
\n", "

4.2 Evaluation of Logistic Regression Model

\n", "

We will use the ClassificationEvaluator function to evaluate the trained glm model on test data. This will let us know how well our model has performed on unseen data.

" ] }, { "cell_type": "code", "execution_count": null, "id": "683015ad-d54b-4a33-a71f-345a348ee912", "metadata": {}, "outputs": [], "source": [ "ClassificationEvaluator_glm = ClassificationEvaluator(\n", " data = out_glm,\n", " observation_column = 'Churn',\n", " prediction_column = 'prediction',\n", " labels = ['0', '1']\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "679b7a22-e08e-414a-aec3-e8ae6b1e0701", "metadata": {}, "outputs": [], "source": [ "ClassificationEvaluator_glm.output_data.head(10)" ] }, { "cell_type": "markdown", "id": "d281b3fa-0793-4e86-986b-26bda09833ec", "metadata": {}, "source": [ "

The above output shows recall, and F1-score values of confusion matrix.

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ColumnDescription
PrecisionThe positive predictive value. Refers to the fraction of relevant instances among\n", "the total retrieved instances.\n", " Precision answers the following question: what proportion of predicted Positives is truly Positive? \n", " Precision = (TP)/(TP+FP)
RecallRefers to the fraction of relevant instances retrieved over the total amount of\n", "relevant instances. Recall answers a different question: what proportion of actual Positives is correctly classified?\n", "Recall = (TP)/(TP+FN)
F1F1 score, defined as the harmonic mean of the precision and recall and is a number between 0 and 1. F1 score maintains a balance between the precision and recall for your classifier. \n", " F1 = 2*(precision*recall/precision+recall)
SupportThe number of times a label displays in the Observation Column.
\n", "

**TP:- True Positive , FP :- False Positive, TN :- True Negative , FN :- False Negative

" ] }, { "cell_type": "markdown", "id": "6aaf3776-22c7-4697-a7f8-e5e334067b18", "metadata": { "tags": [] }, "source": [ "

We can also calculate mean absolute error and AUC(Area Under the Curve) for Receiver Operating Characteristic Curve.
Mean Absolute Error is the summation of the difference between actual and predicted values averaged over the number of observations.

" ] }, { "cell_type": "code", "execution_count": null, "id": "937f54ea-2776-472d-896f-3a707d645be0", "metadata": {}, "outputs": [], "source": [ "glm_pred = glm_prediction.result.to_pandas()\n", "print(mean_absolute_error(glm_pred['Churn'], glm_pred['prob_1']))" ] }, { "cell_type": "markdown", "id": "35733201-e225-4ba5-b712-105992bf177e", "metadata": {}, "source": [ "

The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve is a metric of how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative classes.

" ] }, { "cell_type": "code", "execution_count": null, "id": "5962c501-8e8e-4594-ab63-d6e0df2e07a8", "metadata": {}, "outputs": [], "source": [ "AUC = roc_auc_score(glm_pred['Churn'], glm_pred['prob_1'])\n", "AUC" ] }, { "cell_type": "code", "execution_count": null, "id": "483506fb-d59d-4df0-a2fe-9d1e86ffdf72", "metadata": {}, "outputs": [], "source": [ "fpr, tpr, thresholds = roc_curve(glm_pred['Churn'], glm_pred['prob_1'])\n", "plt.plot(fpr, tpr, color='orange', label='ROC. AUC = {}'.format(str(AUC)))\n", "plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver Operating Characteristic (ROC) Curve')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e91ddacb-4ca3-405c-93c8-c587788191c4", "metadata": {}, "source": [ "

Conclusion

" ] }, { "cell_type": "markdown", "id": "516e0588-5e98-4373-8f30-8e8f40898835", "metadata": {}, "source": [ "

In this demo we have seen how we can do analysis and pre-processing of the data in Vantage using InDb functions. We have also used created two commonly used predictive models for classification and predicted the customers that are likely to churn. " ] }, { "cell_type": "markdown", "id": "35ebb886-8da9-479a-8995-c6dd7ccebffd", "metadata": {}, "source": [ "


\n", "

5. Cleanup

" ] }, { "cell_type": "markdown", "id": "f0b01f3e-03fa-4a14-b388-02eeb210b8c1", "metadata": {}, "source": [ "

Work Tables

\n", "

\n", "We need to clean up our work tables to prevent errors next time." ] }, { "cell_type": "code", "execution_count": null, "id": "914cfbcf-f229-496c-be13-b63c62729291", "metadata": {}, "outputs": [], "source": [ "tables = ['Transformed_data']\n", "\n", "# Loop through the list of tables and execute the drop table command for each table\n", "for table in tables:\n", " try:\n", " db_drop_table(table_name = table)\n", " except:\n", " pass" ] }, { "cell_type": "markdown", "id": "98028152-2401-429a-9141-0488eaecb0f5", "metadata": {}, "source": [ "

Databases and Tables

\n", "

We will use the following code to clean up tables and databases created for this demonstration.

" ] }, { "cell_type": "code", "execution_count": null, "id": "7aae8012-ca09-4541-9389-1bb82f283ff1", "metadata": {}, "outputs": [], "source": [ "%run -i ../run_procedure.py \"call remove_data('DEMO_Telco');\" \n", "#Takes 10 seconds" ] }, { "cell_type": "code", "execution_count": null, "id": "9ad94d1e-d82d-4611-b5c7-4180397f6c94", "metadata": {}, "outputs": [], "source": [ "remove_context()" ] }, { "cell_type": "markdown", "id": "10724002-0091-4ef1-b091-71c0a2fdda5a", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "id": "3eee6695-147e-4b5b-a0e0-ae6a1d9629db", "metadata": {}, "source": [ "Required Materials\n", "

Let’s look at the elements we have available for reference for this notebook:

" ] }, { "cell_type": "markdown", "id": "90d6c2a3-92e9-4121-a46a-2beaba63cac2", "metadata": {}, "source": [ "

Filters:

\n", " \n", "

Related Resources:

\n", " " ] }, { "cell_type": "markdown", "id": "9094fe39-0a98-43d6-a62b-7c2d36acf654", "metadata": {}, "source": [ "

Reference Links:

\n", "" ] }, { "cell_type": "markdown", "id": "a8f9f644", "metadata": {}, "source": [ "Dataset:\n", "\n", "- `CustomerID`: unique id of customer\n", "- `Gender`: Whether the customer is a male or a female\n", "- `SeniorCitizen`:Whether the customer is a senior citizen or not (1, 0)\n", "- `Partner`:Whether the customer has a partner or not (Yes, No)\n", "- `Dependents`:Whether the customer has dependents or not (Yes, No)\n", "- `Tenure`:Number of months the customer has stayed with the company\n", "- `PhoneService`:Whether the customer has a phone service or not (Yes, No)\n", "- `MultipleLines`:Whether the customer has multiple lines or not (Yes, No, No phone service)\n", "- `InternetService`:Customer’s internet service provider (DSL, Fiber optic, No)\n", "- `OnlineSecurity`:Whether the customer has online security or not (Yes, No, No internet service)\n", "- `OnlineBackup`:Whether the customer has online backup or not (Yes, No, No internet service)\n", "- `DeviceProtection`:Whether the customer has device protection or not (Yes, No, No internet service)\n", "- `TechSupport`:Whether the customer has tech support or not (Yes, No, No internet service)\n", "- `StreamingTV`:Whether the customer has streaming TV or not (Yes, No, No internet service)\n", "- `StreamingMovies`:Whether the customer has streaming movies or not (Yes, No, No internet service)\n", "- `Contract`:The contract term of the customer (Month-to-month, One year, Two year)\n", "- `PaperlessBilling`:Whether the customer has paperless billing or not (Yes, No)\n", "- `PaymentMethod`:The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))\n", "- `MonthlyCharges`:The amount charged to the customer monthly\n", "- `TotalCharges`:The total amount charged to the customer\n", "- `Churn`:Whether the customer churned or not (Yes or No)" ] }, { "cell_type": "markdown", "id": "c30802a4-8141-47f6-971d-0bb79be6f5bf", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "toc-autonumbering": true }, "nbformat": 4, "nbformat_minor": 5 }