{ "cells": [ { "cell_type": "markdown", "id": "df47e79e", "metadata": {}, "source": [ "# Ejemplo: Comparando modelos utilizando una prueba de McNemar" ] }, { "cell_type": "markdown", "id": "84dab76c", "metadata": {}, "source": [ "## Introducción" ] }, { "cell_type": "markdown", "id": "bcfb8b20", "metadata": {}, "source": [ "Instalamos la librerias necesarias" ] }, { "cell_type": "code", "execution_count": null, "id": "8b8a79ca", "metadata": {}, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/E72102/master/docs/develop/modeling/selection/code/mcnemar.txt \\\n", " --quiet --no-clobber\n", "!pip install -r mcnemar.txt --quiet" ] }, { "cell_type": "markdown", "id": "a9c70f20", "metadata": {}, "source": [ "### Sobre el conjunto de datos del censo UCI\n", "\n", "El conjunto de datos del censo de la UCI es un conjunto de datos en el que cada registro representa a una persona. Cada registro contiene 14 columnas que describen a una una sola persona, de la base de datos del censo de Estados Unidos de 1994. Esto incluye información como la edad, el estado civil y el nivel educativo. La tarea es determinar si una persona tiene un ingreso alto (definido como ganar más de $50 mil al año). Esta tarea, dado el tipo de datos que utiliza, se usa a menudo en el estudio de equidad, en parte debido a los atributos comprensibles del conjunto de datos, incluidos algunos que contienen tipos sensibles como la edad y el género, y en parte también porque comprende una tarea claramente del mundo real." ] }, { "cell_type": "markdown", "id": "14671507", "metadata": {}, "source": [ "Descargamos el conjunto de datos" ] }, { "cell_type": "code", "execution_count": null, "id": "a5408998", "metadata": {}, "outputs": [], "source": [ "!wget https://santiagxf.blob.core.windows.net/public/datasets/uci_census.zip \\\n", " --quiet --no-clobber\n", "!mkdir -p datasets/uci_census\n", "!unzip -qq uci_census.zip -d datasets/uci_census" ] }, { "cell_type": "markdown", "id": "0742ec68", "metadata": {}, "source": [ "Preparando nuestros conjuntos de datos" ] }, { "cell_type": "code", "execution_count": 1, "id": "20c42cb1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "train = pd.read_csv('datasets/uci_census/data/adult-train.csv')\n", "test = pd.read_csv('datasets/uci_census/data/adult-test.csv')\n", "\n", "X_train = train.drop(['income'], axis=1)\n", "y_train = train['income'].to_numpy()\n", "X_test = test.drop(['income'], axis=1)\n", "y_test = test['income'].to_numpy()" ] }, { "cell_type": "markdown", "id": "25e98327", "metadata": {}, "source": [ "## Preparación de los datos para el ejemplo" ] }, { "cell_type": "markdown", "id": "216af887", "metadata": {}, "source": [ "Realizaremos un pequeño preprocesamiento antes de entrenar el modelo:\n", "\n", "- Imputaremos los valores faltantes de las caracteristicas numéricas con la media\n", "- Imputaremos los valores faltantes de las caracteristicas categóricas con el valor `?`\n", "- Escalaremos los valores numericos utilizando un `StandardScaler`\n", "- Codificaremos las variables categóricas utilizando `OneHotEncoder`" ] }, { "cell_type": "code", "execution_count": 3, "id": "6416e7ca", "metadata": {}, "outputs": [], "source": [ "from typing import Tuple, List\n", "\n", "import sklearn\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.compose import ColumnTransformer\n", "\n", "\n", "def prepare(X: pd.DataFrame) -> Tuple[np.ndarray, sklearn.compose.ColumnTransformer]:\n", " pipe_cfg = {\n", " 'num_cols': X.dtypes[X.dtypes == 'int64'].index.values.tolist(),\n", " 'cat_cols': X.dtypes[X.dtypes == 'object'].index.values.tolist(),\n", " }\n", " \n", " num_pipe = Pipeline([\n", " ('num_imputer', SimpleImputer(strategy='median')),\n", " ('num_scaler', StandardScaler())\n", " ])\n", " \n", " cat_pipe = Pipeline([\n", " ('cat_imputer', SimpleImputer(strategy='constant', fill_value='?')),\n", " ('cat_encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))\n", " ])\n", " \n", " transformations = ColumnTransformer([\n", " ('num_pipe', num_pipe, pipe_cfg['num_cols']),\n", " ('cat_pipe', cat_pipe, pipe_cfg['cat_cols'])\n", " ])\n", " X = transformations.fit_transform(X)\n", " \n", " return X, transformations\n", "\n", "\n", "X_train_transformed, transformations = prepare(X_train)\n", "X_test_transformed = transformations.transform(X_test)" ] }, { "cell_type": "markdown", "id": "1d5e7904", "metadata": {}, "source": [ "## Definiendo nuestros modelos a comparar\n", "\n", "Para demostrar la técnica, utilizaremos dos clasificadores basados en `LightGBM`" ] }, { "cell_type": "code", "execution_count": 14, "id": "a20072c9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LGBMClassifier(min_split_gain=2, n_jobs=2, reg_alpha=1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from lightgbm import LGBMClassifier\n", "\n", "clf1 = LGBMClassifier(n_estimators=100, n_jobs=2)\n", "clf1.fit(X_train_transformed, y_train)\n", "clf2 = LGBMClassifier(n_estimators=100, reg_alpha=1, min_split_gain=2, n_jobs=2)\n", "clf2.fit(X_train_transformed, y_train)" ] }, { "cell_type": "markdown", "id": "1ea786a5", "metadata": {}, "source": [ "Evaluemos su performance:" ] }, { "cell_type": "code", "execution_count": 15, "id": "0a612a14", "metadata": {}, "outputs": [], "source": [ "clf1_pred = clf1.predict(X_test_transformed)\n", "clf2_pred = clf2.predict(X_test_transformed)" ] }, { "cell_type": "code", "execution_count": 22, "id": "ee8a95ae", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Modelo 1: 0.875\n", "Modelo 2: 0.874\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "print(f'Modelo 1: {accuracy_score(clf1_pred, y_test):.3g}')\n", "print(f'Modelo 2: {accuracy_score(clf2_pred, y_test):.3g}')" ] }, { "cell_type": "markdown", "id": "4e6247a3", "metadata": {}, "source": [ "Pareciera que el modelo 1 tiene ligeramente una mejor performance. Verifiquemos si vale la pena:" ] }, { "cell_type": "markdown", "id": "eb9af338", "metadata": {}, "source": [ "## Procedimiento de McNemar" ] }, { "cell_type": "markdown", "id": "375b23bb", "metadata": {}, "source": [ "### Construimos una tabla de contingencia\n", "\n", "La prueba de McNemar se base en una matrix de contingencia de 2x2 donde en las filas tenemos las diferentes instancias de datos, y en las columnas tenemos un idicador mencionando si el modelo realizó una predicción correcta o no. Esto es equivalente a generar una matriz de confusión entre los ambos modelos." ] }, { "cell_type": "code", "execution_count": 16, "id": "7a48ace5", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "cont_table = confusion_matrix(clf1_pred, clf2_pred)" ] }, { "cell_type": "markdown", "id": "ca114c8d", "metadata": {}, "source": [ "### Computamos el valor estadístico" ] }, { "cell_type": "markdown", "id": "5b52b7f5", "metadata": {}, "source": [ "En base a esta tabla, el valor estádistico de McNemar se calcula, para dos modelos:" ] }, { "cell_type": "code", "execution_count": 17, "id": "1e74444b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pvalue 0.1581559805552949\n", "statistic 1.991769547325103\n" ] } ], "source": [ "from statsmodels.stats.contingency_tables import mcnemar\n", "\n", "results = mcnemar(cont_table, exact=False)\n", "print(results)" ] }, { "cell_type": "markdown", "id": "30bdff42", "metadata": {}, "source": [ "Tomamos una decisión" ] }, { "cell_type": "code", "execution_count": 18, "id": "4104a056", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No podemos rechazar la hipotesis de que ambos modelos cometen los mismos errores.\n" ] } ], "source": [ "if results.pvalue <= 0.05:\n", " print(\"Rechazamos la hipótesis nula en favor de la alternativa para concluir que los modelos no cometen mismos errores.\")\n", "else:\n", " print(\"No podemos rechazar la hipotesis de que ambos modelos cometen los mismos errores.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 5 }