{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Telco Customer Churn para ICP4D" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usaremos este 'notebook' para crear un modelo de 'machine learning' para predecir el 'CHURN' en los clientes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.0 Instalar las paqueterías requeridas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --user watson-machine-learning-client --upgrade | tail -n 1\n", "!pip install --user pyspark==2.3.3 --upgrade|tail -n 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.0 Cargar y limpiar los datos.\n", "Cargamos nuestros datos como un marco de datos (Data Frame) de pandas.\n", "\n", "* Resalta la celda inferior dándole clic.\n", "* Da clic en el ícono `01/00` \"Find and add data\" en la parte superior derecha de la 'Notebook'.\n", "* Si estás usando 'Virtualized data', comienza seleccionando la pestaña 'Files'. Ahora, selecciona tus datos virtualizados (ej. MYSCHEMA.BILLINGPRODUCTCUSTOMERS), da clic en 'Insert to code' y selecciona 'Insert Pandas DataFrame'.\n", "* Si estás usando esta 'Notebook' sin datos virtualizados, agrega localmente el archivo: `Telco-Customer-Churn.csv` Seleccionando la pestaña 'Files'. Después, selecciona el archivo: 'Telco-Customer-Churn.csv'. Da clic en 'Insert to code' y selecciona 'Insert Pandas DataFrame'.\n", "* El códico para traer los datos a este ambiente de 'Notebook' y crear el marco de datos de Pandas (Pandas DataFrame) será agregado el la celda inferior.\n", "* Correr la celda.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "# Coloca el cursor debajo e inserta el marco de datos de pandas(Pandas DataFrame) para los datos de la Telco.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usaremos la convención de nombramiento de Pandas 'df' para nuestro marco de datos (DataFrame). Asegúrate de que la celda inferior use el mismo nombre para el marco de datos(DataFrame) usado en la celda superior. Para el archivo cargado de forma local, debe verse como: 'df_data_1' o 'df_data_2' o 'df_data_x'. Para el caso de los datos virtualizados, debe verse como: data_df_1 o data_df_2 o data_df_x." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Para datos virtualizados:\n", "# df = data_df_1\n", "\n", "# Para carga local:\n", "df = df_data_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Desplegar la característica 'CustomerID' (columna)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop('ClienteID', axis=1)\n", "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Examinar los tipos de datos de las características (columnas)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Verificar la necesidad de convertir la columna 'TotalCharges' a numérico si es detectado como objeto\n", "\n", "Si el 'df.info' superior, muestra la columna \"TotalCharges\" como un objeto, necesitamos convertirlo a numérico. Si ya lo has hecho durante el ejericio previo para \"Visualización de datos con refinería de datos\", puedes saltar este paso `2.5`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "totalCharges = df.columns.get_loc(\"Cargos totales\")\n", "print(totalCharges)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_col = pd.to_numeric(df.iloc[:, totalCharges], errors='coerce')\n", "new_col" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Modificamos nuestro marco de datos 'Data Frame' para reflejar el nuevo tipo de dato" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[:, totalCharges] = pd.Series(new_col)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### 2.5 Los valores NaN deben ser removidos para crear un modelo más certero." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Revisar si tenemos valores 'NaN' ('Not a Number').\n", "df.isnull().values.any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configura la columna 'nan_column' al numero de columna de 'TotalCharges' (comenzando en 0)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nan_column = df.columns.get_loc(\"Cargos totales\")\n", "print(nan_column)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Maneja los valores faltantes para la columna 'nan_column' de 'Cargos totales'\n", "\n", "import numpy as np\n", "from sklearn.impute import SimpleImputer\n", "\n", "imp = SimpleImputer(missing_values=np.nan, strategy='mean')\n", "\n", "df.iloc[:, nan_column] = imp.fit_transform(df.iloc[:, nan_column].values.reshape(-1, 1))\n", "df.iloc[:, nan_column] = pd.Series(df.iloc[:, nan_column])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verificar si tenemos valores 'NaN'.\n", "df.isnull().values.any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.6 Visualizar los datos" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from sklearn import preprocessing, svm\n", "from itertools import combinations\n", "from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler\n", "import sklearn.feature_selection\n", "from sklearn.model_selection import train_test_split\n", "from collections import defaultdict\n", "from sklearn import metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Conteo de frecuencia en la permanencia de la trama (plot tenure).\n", "sns.set(style=\"darkgrid\")\n", "sns.set_palette(\"hls\", 3)\n", "fig, ax = plt.subplots(figsize=(20,10))\n", "ax = sns.countplot(x=\"Permanencia\", hue=\"CHURN\", data=df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Conteo de frecuencia en la permanencia de la trama (plot tenure).\n", "sns.set(style=\"darkgrid\")\n", "sns.set_palette(\"hls\", 3)\n", "fig, ax = plt.subplots(figsize=(20,10))\n", "ax = sns.countplot(x=\"Contrato\", hue=\"CHURN\", data=df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Conteo de frecuencia en la permanencia de la trama (plot tenure). \n", "sns.set(style=\"darkgrid\")\n", "sns.set_palette(\"hls\", 3)\n", "fig, ax = plt.subplots(figsize=(20,10))\n", "ax = sns.countplot(x=\"Soporte tecnico\", hue=\"CHURN\", data=df)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Crear la cuadrícula para relaciones en pares.\n", "gr = sns.PairGrid(df, height=5, hue=\"CHURN\")\n", "gr = gr.map_diag(plt.hist)\n", "gr = gr.map_offdiag(plt.scatter)\n", "gr = gr.add_legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configurar el tamaño de la trama.\n", "fig, ax = plt.subplots(figsize=(6,6))\n", "\n", "# Distribución de atributos\n", "a = sns.boxplot(orient=\"v\", palette=\"hls\", data=df.iloc[:, totalCharges], fliersize=14)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribución de datos de cargas totales.\n", "histogram = sns.distplot(df.iloc[:, totalCharges], hist=True)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tenure = df.columns.get_loc(\"Permanencia\")\n", "print(tenure)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribución de permanencia de datos (Tenure).\n", "histogram = sns.distplot(df.iloc[:, tenure], hist=True)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "monthly = df.columns.get_loc(\"Cargos mensuales\")\n", "print(monthly)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribución de datos de cargos mensuales.\n", "histogram = sns.distplot(df.iloc[:, monthly], hist=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Entender la distribución de datos\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.0 Crear un modelo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "import pandas as pd\n", "import json\n", "\n", "spark = SparkSession.builder.getOrCreate()\n", "df_data = spark.createDataFrame(df)\n", "df_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Dividir los datos en conjuntos de entrenamiento y de prueba." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spark_df = df_data\n", "(train_data, test_data) = spark_df.randomSplit([0.8, 0.2], 24)\n", "\n", "print(\"Número de registros para entrenamiento: \" + str(train_data.count()))\n", "print(\"Número de registros para evaluación: \" + str(test_data.count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Examinar el esquema de marco de datos (DataFrame Schema) de 'Spark'.\n", "Observa los tipos de datos para determinar los requerimientos para la ingeniería de características." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spark_df.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Usar 'StringIndexer' para codificar una columna de etiquetas 'string' a una columna de etiquetas índice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.classification import RandomForestClassifier\n", "from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler\n", "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", "from pyspark.ml import Pipeline, Model\n", "\n", "\n", "si_gender = StringIndexer(inputCol = 'Genero', outputCol = 'gender_IX')\n", "si_Partner = StringIndexer(inputCol = 'Pareja', outputCol = 'Partner_IX')\n", "si_Dependents = StringIndexer(inputCol = 'Dependientes', outputCol = 'Dependents_IX')\n", "si_PhoneService = StringIndexer(inputCol = 'Servicio telefonico', outputCol = 'PhoneService_IX')\n", "si_MultipleLines = StringIndexer(inputCol = 'Lineas multiples', outputCol = 'MultipleLines_IX')\n", "si_InternetService = StringIndexer(inputCol = 'Servicio de internet', outputCol = 'InternetService_IX')\n", "si_OnlineSecurity = StringIndexer(inputCol = 'Seguridad en linea', outputCol = 'OnlineSecurity_IX')\n", "si_OnlineBackup = StringIndexer(inputCol = 'Respaldo en linea', outputCol = 'OnlineBackup_IX')\n", "si_DeviceProtection = StringIndexer(inputCol = 'Proteccion de dispositivo', outputCol = 'DeviceProtection_IX')\n", "si_TechSupport = StringIndexer(inputCol = 'Soporte tecnico', outputCol = 'TechSupport_IX')\n", "si_StreamingTV = StringIndexer(inputCol = 'StreamingTV', outputCol = 'StreamingTV_IX')\n", "si_StreamingMovies = StringIndexer(inputCol = 'StreamingMovies', outputCol = 'StreamingMovies_IX')\n", "si_Contract = StringIndexer(inputCol = 'Contrato', outputCol = 'Contract_IX')\n", "si_PaperlessBilling = StringIndexer(inputCol = 'Facturacion sin papel', outputCol = 'PaperlessBilling_IX')\n", "si_PaymentMethod = StringIndexer(inputCol = 'Metodo de pago', outputCol = 'PaymentMethod_IX')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "si_Label = StringIndexer(inputCol=\"CHURN\", outputCol=\"label\").fit(spark_df)\n", "label_converter = IndexToString(inputCol=\"prediction\", outputCol=\"predictedLabel\", labels=si_Label.labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Crear un vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "va_features = VectorAssembler(inputCols=['gender_IX', 'Adulto Mayor', 'Partner_IX', 'Dependents_IX', 'PhoneService_IX', 'MultipleLines_IX', 'InternetService_IX', \\\n", " 'OnlineSecurity_IX', 'OnlineBackup_IX', 'DeviceProtection_IX', 'TechSupport_IX', 'StreamingTV_IX', 'StreamingMovies_IX', \\\n", " 'Contract_IX', 'PaperlessBilling_IX', 'PaymentMethod_IX', 'Cargos totales', 'Cargos mensuales'], outputCol=\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Crear una 'pipeline', y ajustar un modelo usando 'RandomForestClassifier'. \n", "Montar todas las etapas a una 'pipeline'. No esperamos una regresión lineal limpia, así que usaremos 'RandomForestClassifier' para encontrar el mejor árbol de decisión para nuestros datos." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classifier = RandomForestClassifier(featuresCol=\"features\")\n", "\n", "pipeline = Pipeline(stages=[si_gender, si_Partner, si_Dependents, si_PhoneService, si_MultipleLines, si_InternetService, si_OnlineSecurity, si_OnlineBackup, si_DeviceProtection, \\\n", " si_TechSupport, si_StreamingTV, si_StreamingMovies, si_Contract, si_PaperlessBilling, si_PaymentMethod, si_Label, va_features, \\\n", " classifier, label_converter])\n", "\n", "model = pipeline.fit(train_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = model.transform(test_data)\n", "evaluatorDT = BinaryClassificationEvaluator(rawPredictionCol=\"prediction\")\n", "area_under_curve = evaluatorDT.evaluate(predictions)\n", "\n", "#La evauación predeterminada es 'areaUnderROC'\n", "print(\"areaUnderROC = %g\" % area_under_curve)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.0 Guardar el modelo y probar los datos." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agrega un nombre único para el MODEL_NAME." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODEL_NAME = \"mi-modelo mi-fecha\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Guardar el modelo en 'ICP4D local Watson Machine Learning'.\n", "Reemplaza el URL, apikey e instance ID con tus credenciales de 'Watson Machine Learning'.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from watson_machine_learning_client import WatsonMachineLearningAPIClient\n", "\n", "wml_credentials = {\n", " \"url\": \"*****\",\n", " \"apikey\" : \"*****\",\n", " \"instance_id\": \"*****\",\n", " }\n", "\n", "client = WatsonMachineLearningAPIClient(wml_credentials)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client.repository.list()\n", "client.repository.list_models()\n", "client.deployments.list()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Guardar el modelo\n", "model_props = model_props = {client.repository.ModelMetaNames.NAME: MODEL_NAME}\n", "published_model = client.repository.store_model(model=model, pipeline=pipeline, meta_props=model_props, training_data=train_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Usar esta celda para cualquier limpieza de modelos y despliegues previos.\n", "client.repository.list_models()\n", "client.deployments.list()\n", "\n", "# client.repository.delete('GUID del modelo guardado')\n", "# client.deployments.delete('GUID del modelo desplegado')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Escribe los datos de prueba sin ninguna etiqueta a un .csv, para usarlo después como puntuación por lotes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from project_lib import Project\n", "project = Project(\"\"\n", ",\n", "\"Project_ID\" ,\n", "\"Access_Token\" )\n", "write_score_CSV=test_data.toPandas().drop(['Churn'], axis=1)\n", "project.save_data(file_name = \n", "'TelcoCustomerSparkMLBatchScore.csv' ,data = write_score_CSV.to_csv(index=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Escribe los datos de prueba a un .csv para usarlos posteriormente para la evaluación." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from project_lib import Project\n", "project = Project(\"\"\n", ",\n", "\"Project_ID\" ,\n", "\"Access_Token\" )\n", "write_eval_CSV=test_data.toPandas()\n", "project.save_data(file_name = \n", "'TelcoCustomerSparkMLEval.csv' ,data = write_eval_CSV.to_csv(index=False))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Felicidades, ya has creado un modelo basado en los datos de 'CHURN' de clientes, y lo has desplegado a 'Watson Machine Learning'!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 1 }