{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Population Stability Index (PSI) - (desviación covariable y de concepto)\n",
"==============================================================================================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introducción"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El índice de estabilidad de la población (PSI) es una métrica para medir cuánto ha cambiado la distribución de un predictor entre dos muestras distintas. Por lo general, PSI se usa para medir la estabilidad de los modelos o las cualidades de sus predictores. Es una métrica que encuentra sus origenes en los modelos de predicciones de riesgo crediticio. \n",
"\n",
"Dadas dos conjuntos de datos: origen y objetivo, el PSI se calculará mediante los siguientes pasos:\n",
"\n",
" * Se realiza un agrupamiento de los cuantiles de los predictores del conjunto original y objetivo.\n",
" * Se calcular el porcentaje de cada intervalo (Q), que viene dado por $$ Q = \\frac{recuento\\;de\\;muestras\\;en\\;intervalo}{número\\;total\\;de\\;muestras} $$\n",
" * Finalmente podemos calcular PSI como:\n",
"\n",
" $$ \\sum (Q_t - Q_s)*ln(\\frac{Q_t}{Q_s}) $$\n",
"\n",
"Este índice puede interpretarse de la siguiente manera:\n",
"\n",
"* **PSI < 0.1**: No existe un cambio significativo en las características de las muestras.\n",
"* **PSI > 0.1 y PSI < 0.2**: Hay un cambio moderado en las características de las muestras.\n",
"* **PSI > 0.2**: Existe un cambio significativo en las características de las muestras.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ejemplo:\n",
"\n",
"Para visualizar el concepto de desviación, usaremos los datos de IRIS dataset para generar lotes con distribuciones distintas de los datos. Posteriormente veremos como la performance del modelo se degrada y como podríamos detectar este hecho utilizando la métrica PSI. El conjunto de datos de IRIS es parte de la biblioteca sklearn que constan de 3 tipos diferentes de longitud de pétalo y sépalo (Setosa, Versicolour y Virginica), descriptos por la longitud del sépalo, el ancho del sépalo, la longitud del pétalo y el ancho del pétalo:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn import datasets\n",
"\n",
"iris = datasets.load_iris()\n",
"X = iris.data[:,:2]\n",
"y = iris.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Como podemos observar, el conjunto de datos está balanceado, teniendo 50 observaciones para cada uno de los tipos de pétalos disponibles."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAMrElEQVR4nO3dYYhl513H8e/P3ZRKG2mWnd2uTdJVWMRYaBKGGAlINU2Jqbh5YSQF6yKBRVFIQZC1guK71BdFBEGXNjhiWw20cZfY1m7XhiLU2ElMmoRN3VhiDFl2p6k2CYqS+vfFnMiymck9d2buzN6/3w8s95xzz537PDzZb27O3HuTqkKSNP++b6cHIEnaGgZdkpow6JLUhEGXpCYMuiQ1sXs7n2zv3r118ODB7XxKSZp7jz766LeramHSedsa9IMHD7K8vLydTylJcy/Jv4w5z0suktSEQZekJgy6JDVh0CWpCYMuSU0YdElqYtTbFpM8B7wCfA94raoWk+wB/hI4CDwH/EJV/dtshilJmmSaV+g/VVXXV9XisH8MOF1Vh4DTw74kaYds5pLLYWBp2F4C7tz0aCRJGzb2k6IFfClJAX9SVceB/VV1DqCqziXZt9YDkxwFjgJce+21Gx7owWN/veHH6s09d98HZ/JzXbPZcc3mz6zW7GJjg35LVb04RPtUkmfGPsEQ/+MAi4uL/u+RJGlGRl1yqaoXh9sLwIPATcD5JAcAhtsLsxqkJGmyiUFP8rYkV76+DXwAeAo4CRwZTjsCnJjVICVJk4255LIfeDDJ6+d/uqq+mOTrwANJ7gGeB+6a3TAlSZNMDHpVfQt47xrHXwJuncWgJEnT85OiktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaGB30JLuS/GOSh4b9PUlOJTk73F41u2FKkiaZ5hX6vcCZi/aPAaer6hBwetiXJO2QUUFPcjXwQeATFx0+DCwN20vAnVs6MknSVMa+Qv8D4DeB/7no2P6qOgcw3O5b64FJjiZZTrK8srKymbFKkt7ExKAn+VngQlU9upEnqKrjVbVYVYsLCwsb+RGSpBF2jzjnFuDnktwBvBX4gSR/DpxPcqCqziU5AFyY5UAlSW9u4iv0qvqtqrq6qg4CdwN/W1W/CJwEjgynHQFOzGyUkqSJNvM+9PuA25KcBW4b9iVJO2TMJZf/U1UPAw8P2y8Bt279kCRJG+EnRSWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNTEx6EnemuQfkjyR5Okkvzcc35PkVJKzw+1Vsx+uJGk9Y16h/xfw01X1XuB64PYkNwPHgNNVdQg4PexLknbIxKDXqleH3SuGPwUcBpaG40vAnbMYoCRpnFHX0JPsSvI4cAE4VVWPAPur6hzAcLtvZqOUJE00KuhV9b2quh64GrgpyXvGPkGSo0mWkyyvrKxscJiSpEmmepdLVf078DBwO3A+yQGA4fbCOo85XlWLVbW4sLCwudFKktY15l0uC0neMWx/P/B+4BngJHBkOO0IcGJGY5QkjbB7xDkHgKUku1j9F8ADVfVQkq8BDyS5B3geuGuG45QkTTAx6FX1DeCGNY6/BNw6i0FJkqbnJ0UlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpiYlBT3JNkq8kOZPk6ST3Dsf3JDmV5Oxwe9XshytJWs+YV+ivAb9RVT8K3Az8WpLrgGPA6ao6BJwe9iVJO2Ri0KvqXFU9Nmy/ApwB3gUcBpaG05aAO2c0RknSCFNdQ09yELgBeATYX1XnYDX6wL51HnM0yXKS5ZWVlU0OV5K0ntFBT/J24LPAR6rq5bGPq6rjVbVYVYsLCwsbGaMkaYRRQU9yBasx/1RVfW44fD7JgeH+A8CF2QxRkjTGmHe5BPgkcKaqPn7RXSeBI8P2EeDE1g9PkjTW7hHn3AJ8GHgyyePDsY8C9wEPJLkHeB64ayYjlCSNMjHoVfV3QNa5+9atHY4kaaP8pKgkNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYmBj3J/UkuJHnqomN7kpxKcna4vWq2w5QkTTLmFfqfArdfcuwYcLqqDgGnh31J0g6aGPSq+irwnUsOHwaWhu0l4M6tHZYkaVobvYa+v6rOAQy3+9Y7McnRJMtJlldWVjb4dJKkSWb+S9GqOl5Vi1W1uLCwMOunk6T/tzYa9PNJDgAMtxe2bkiSpI3YaNBPAkeG7SPAia0ZjiRpo8a8bfEzwNeAH0nyQpJ7gPuA25KcBW4b9iVJO2j3pBOq6kPr3HXrFo9FkrQJflJUkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmthU0JPcnuSbSZ5NcmyrBiVJmt6Gg55kF/BHwM8A1wEfSnLdVg1MkjSdzbxCvwl4tqq+VVX/DfwFcHhrhiVJmtbuTTz2XcC/XrT/AvDjl56U5ChwdNh9Nck3L7p7L/DtTYzhcjY3c8vHpjp9buY1pbmal2sGzNm8plizteb17jEP3EzQs8axesOBquPA8TV/QLJcVYubGMNlq+vcnNf86To35/VGm7nk8gJwzUX7VwMvbuLnSZI2YTNB/zpwKMkPJXkLcDdwcmuGJUma1oYvuVTVa0l+HfgbYBdwf1U9PeWPWfNSTBNd5+a85k/XuTmvS6TqDZe9JUlzyE+KSlITBl2SmtjWoCfZk+RUkrPD7VXrnPdckieTPJ5keTvHOI1JX32QVX843P+NJDfuxDg3YsTc3pfku8MaPZ7kd3ZinNNIcn+SC0meWuf+eV6vSXObu/UCSHJNkq8kOZPk6ST3rnHO3K3byHlNv2ZVtW1/gN8Hjg3bx4CPrXPec8De7RzbBuayC/hn4IeBtwBPANddcs4dwBdYfc/+zcAjOz3uLZzb+4CHdnqsU87rJ4EbgafWuX8u12vk3OZuvYZxHwBuHLavBP6pw9+zkfOaes22+5LLYWBp2F4C7tzm599KY7764DDwZ7Xq74F3JDmw3QPdgJZf61BVXwW+8yanzOt6jZnbXKqqc1X12LD9CnCG1U+pX2zu1m3kvKa23UHfX1XnYHVCwL51zivgS0keHb464HK01lcfXLogY865HI0d908keSLJF5L82PYMbabmdb3Gmuv1SnIQuAF45JK75nrd3mReMOWabeaj/+sN7svAO9e467en+DG3VNWLSfYBp5I8M7wCuZyM+eqDUV+PcBkaM+7HgHdX1atJ7gD+Cjg064HN2Lyu1xhzvV5J3g58FvhIVb186d1rPGQu1m3CvKZesy1/hV5V76+q96zx5wRw/vX/FBpuL6zzM14cbi8AD7J6CeByM+arD+b16xEmjruqXq6qV4ftzwNXJNm7fUOciXldr4nmeb2SXMFq9D5VVZ9b45S5XLdJ89rImm33JZeTwJFh+whw4tITkrwtyZWvbwMfANb8zf0OG/PVByeBXxp+C38z8N3XLzld5ibOLck7k2TYvonVf5Ze2vaRbq15Xa+J5nW9hjF/EjhTVR9f57S5W7cx89rImm35JZcJ7gMeSHIP8DxwF0CSHwQ+UVV3APuBB4d57AY+XVVf3OZxTlTrfPVBkl8Z7v9j4POs/gb+WeA/gF/eqfFOY+Tcfh741SSvAf8J3F3Dr+YvV0k+w+o7B/YmeQH4XeAKmO/1glFzm7v1GtwCfBh4Msnjw7GPAtfCXK/bmHlNvWZ+9F+SmvCTopLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1IT/wt7dWHNdjuTxwAAAABJRU5ErkJggg==",
"image/svg+xml": "\n\n\n\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from collections import Counter\n",
"\n",
"def plot_distribution(y):\n",
" labels, q = zip(*sorted(Counter(y).items()))\n",
" plt.bar(labels, q)\n",
" plt.show()\n",
"\n",
"plot_distribution(y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Entrenaremos un modelo para resolver el problema. Primero, dividiremos los datos en conjuntos de entrenamiento y validación, como es costrumbre, para luego definir nuestro algoritmo de aprendizaje. En este caso, utilizaremos un simple SVM y lo entrenamos sobre los datos:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import svm\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)\n",
"\n",
"model = svm.SVC(C=1.0, kernel='linear', gamma=0.5, probability=True)\n",
"model = model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Verificamos la performance de nuestro modelo de clasificación:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 16\n",
" 1 0.62 0.76 0.68 17\n",
" 2 0.69 0.53 0.60 17\n",
"\n",
" accuracy 0.76 50\n",
" macro avg 0.77 0.76 0.76 50\n",
"weighted avg 0.77 0.76 0.76 50\n",
"\n",
"F1: 0.7566315789473684\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import f1_score\n",
"\n",
"y_pred = model.predict(X_test)\n",
"print(classification_report(y_test, y_pred))\n",
"print(\"F1:\",f1_score(y_test, y_pred, average='weighted'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulando un cambio en la distribución de las clases\n",
"\n",
"La siguiente función nos permitirá alterar la distribución de las observaciones presentes en el set de datos, es decir, generará un nuevo conjunto de datos cuyas proporciones de las observaciones estarán alteradas por el parámetro `weights` que las especifica. Este parametro es un arreglo donde el primer valor corresponde a la proporción de la clase `1 (Setosa)`, el segundo a la `2 (Versicolour)` y el tercero a la `3 (Virginica)`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def simulate_samples(nsamples, X_source, y_source, weights):\n",
" totals = np.round(np.array(weights) * nsamples).astype(int)\n",
" indices = np.arange(y_source.size)\n",
" new_indices = []\n",
" for i, c in enumerate(np.unique(y_source)):\n",
" new_indices.extend(np.random.choice(indices[y_source==c], totals[i], replace=True))\n",
" \n",
" y_new = y_source[new_indices]\n",
" X_new = X_source[new_indices,:]\n",
" return(X_new, y_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generemos un nuevo conjunto de datos con las proporciones 10%, 10% y 80%:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAO1klEQVR4nO3dfaied33H8fdnaUTRDlNyGs9sYzYpY13BtByyjoJ01kqNY6kwwcK6MApxw0IFYWQOpv5Xhw9jMNyiLWabOgraNdT6kGUWEVzdaZe2KamLk8zVhuSoaFs2HK3f/XGuwiE9d+/rfjon92/vF9xcT7/7vr4/fu2nV69zPaSqkCTNv1/Y7AIkSdNhoEtSIwx0SWqEgS5JjTDQJakRF23kzrZv3167du3ayF1K0tx7+OGHf1hVC8PabWig79q1i+Xl5Y3cpSTNvST/2aedp1wkqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSI3oHepItSf4tyf3d8iVJjiY51U23za5MSdIwoxyh3wGcXLN8EDhWVVcAx7plSdIm6RXoSS4D3gF8es3qfcDhbv4wcPNUK5MkjaTvnaJ/AfwxcPGadTuq6gxAVZ1Jcul6X0xyADgAsHPnzvErlTRzuw5+abNLaNbpO98x830MPUJP8tvAuap6eJwdVNWhqlqqqqWFhaGPIpAkjanPEfp1wO8k2Qu8EvjFJH8PnE2y2B2dLwLnZlmoJOnlDT1Cr6o/qarLqmoX8G7gn6vq94AjwP6u2X7gvplVKUkaapLr0O8EbkxyCrixW5YkbZKRHp9bVQ8CD3bzPwJumH5JkqRxeKeoJDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRfV4S/cok307yaJInkny4W/+hJD9Icrz77J19uZKkQfq8sehnwFuq6rkkW4FvJvlyt+0TVfXR2ZUnSepraKBXVQHPdYtbu0/NsihJ0uh6nUNPsiXJceAccLSqHuo23Z7ksSR3J9k2qyIlScP1CvSqeqGqdgOXAXuSXAV8EngjsBs4A3xsve8mOZBkOcnyysrKVIqWJL3USFe5VNVPgAeBm6rqbBf0Pwc+BewZ8J1DVbVUVUsLCwuT1itJGqDPVS4LSV7bzb8KeCvwZJLFNc3eCZyYSYWSpF76XOWyCBxOsoXV/wDcU1X3J/m7JLtZ/QPpaeA9M6tSkjRUn6tcHgOuXmf9rTOpSJI0Fu8UlaRGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEb0eafoK5N8O8mjSZ5I8uFu/SVJjiY51U23zb5cSdIgfY7Qfwa8pareBOwGbkpyLXAQOFZVVwDHumVJ0iYZGui16rlucWv3KWAfcLhbfxi4eRYFSpL66XUOPcmWJMeBc8DRqnoI2FFVZwC66aUDvnsgyXKS5ZWVlSmVLUk6X69Ar6oXqmo3cBmwJ8lVfXdQVYeqaqmqlhYWFsYsU5I0zEhXuVTVT4AHgZuAs0kWAbrpuWkXJ0nqr89VLgtJXtvNvwp4K/AkcATY3zXbD9w3oxolST1c1KPNInA4yRZW/wNwT1Xdn+RbwD1JbgO+D7xrhnVKkoYYGuhV9Rhw9TrrfwTcMIuiJEmj805RSWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJakSfd4penuTrSU4meSLJHd36DyX5QZLj3Wfv7MuVJA3S552izwPvr6pHklwMPJzkaLftE1X10dmVJ0nqq887Rc8AZ7r5Z5OcBF4/68IkSaMZ6Rx6kl2svjD6oW7V7UkeS3J3km0DvnMgyXKS5ZWVlcmqlSQN1DvQk7wG+ALwvqp6Bvgk8EZgN6tH8B9b73tVdaiqlqpqaWFhYfKKJUnr6hXoSbayGuafraovAlTV2ap6oap+DnwK2DO7MiVJw/S5yiXAXcDJqvr4mvWLa5q9Ezgx/fIkSX31ucrlOuBW4PEkx7t1HwBuSbIbKOA08J4Z1CdJ6qnPVS7fBLLOpgemX44kaVzeKSpJjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmN6PNO0cuTfD3JySRPJLmjW39JkqNJTnXTbbMvV5I0SJ8j9OeB91fVrwHXAu9NciVwEDhWVVcAx7plSdImGRroVXWmqh7p5p8FTgKvB/YBh7tmh4GbZ1SjJKmHkc6hJ9kFXA08BOyoqjOwGvrApQO+cyDJcpLllZWVCcuVJA3SO9CTvAb4AvC+qnqm7/eq6lBVLVXV0sLCwjg1SpJ66BXoSbayGuafraovdqvPJlnsti8C52ZToiSpjz5XuQS4CzhZVR9fs+kIsL+b3w/cN/3yJEl9XdSjzXXArcDjSY536z4A3Anck+Q24PvAu2ZSoSSpl6GBXlXfBDJg8w3TLUeSNC7vFJWkRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RG9Hmn6N1JziU5sWbdh5L8IMnx7rN3tmVKkobpc4T+GeCmddZ/oqp2d58HpluWJGlUQwO9qr4B/HgDapEkTWCSc+i3J3msOyWzbVCjJAeSLCdZXllZmWB3kqSXM26gfxJ4I7AbOAN8bFDDqjpUVUtVtbSwsDDm7iRJw4wV6FV1tqpeqKqfA58C9ky3LEnSqMYK9CSLaxbfCZwY1FaStDEuGtYgyeeB64HtSZ4CPghcn2Q3UMBp4D2zK1GS1MfQQK+qW9ZZfdcMapEkTcA7RSWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRQwM9yd1JziU5sWbdJUmOJjnVTbfNtkxJ0jB9jtA/A9x03rqDwLGqugI41i1LkjbR0ECvqm8APz5v9T7gcDd/GLh5umVJkkY17jn0HVV1BqCbXjqoYZIDSZaTLK+srIy5O0nSMDP/o2hVHaqqpapaWlhYmPXuJOn/rXED/WySRYBuem56JUmSxjFuoB8B9nfz+4H7plOOJGlcfS5b/DzwLeBXkzyV5DbgTuDGJKeAG7tlSdImumhYg6q6ZcCmG6ZciyRpAt4pKkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0Y+sail5PkNPAs8ALwfFUtTaMoSdLoJgr0zm9V1Q+n8DuSpAl4ykWSGjHpEXoBX0tSwN9U1aHzGyQ5ABwA2Llz59g72nXwS2N/Vy/v9J3vmMnvOmazM6sx03yb9Aj9uqq6Bng78N4kbz6/QVUdqqqlqlpaWFiYcHeSpEEmCvSqerqbngPuBfZMoyhJ0ujGDvQkr05y8YvzwNuAE9MqTJI0mknOoe8A7k3y4u98rqq+MpWqJEkjGzvQq+p7wJumWIskaQJetihJjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNmCjQk9yU5DtJvpvk4LSKkiSNbpKXRG8B/gp4O3AlcEuSK6dVmCRpNJMcoe8BvltV36uq/wX+Adg3nbIkSaMa+yXRwOuB/1qz/BTwG+c3SnIAONAtPpfkO2s2bwd+OEENF7K56Vs+MlLzuenXiOaqX44ZMGf9GmHM1uvXG/p8cZJAzzrr6iUrqg4Bh9b9gWS5qpYmqOGC1Wrf7Nf8abVv9uulJjnl8hRw+Zrly4CnJ/g9SdIEJgn0fwWuSPLLSV4BvBs4Mp2yJEmjGvuUS1U9n+R24KvAFuDuqnpixJ9Z91RMI1rtm/2aP632zX6dJ1UvOe0tSZpD3ikqSY0w0CWpERsa6EkuSXI0yaluum1Au9NJHk9yPMnyRtY4imGPPsiqv+y2P5bkms2ocxw9+nZ9kp92Y3Q8yZ9tRp2jSHJ3knNJTgzYPs/jNaxvczdeAEkuT/L1JCeTPJHkjnXazN249ezX6GNWVRv2Af4cONjNHwQ+MqDdaWD7RtY2Rl+2AP8B/ArwCuBR4Mrz2uwFvszqNfvXAg9tdt1T7Nv1wP2bXeuI/XozcA1wYsD2uRyvnn2bu/Hq6l4ErunmLwb+vYV/z3r2a+Qx2+hTLvuAw938YeDmDd7/NPV59ME+4G9r1b8Ar02yuNGFjqHJxzpU1TeAH79Mk3kdrz59m0tVdaaqHunmnwVOsnqX+lpzN249+zWyjQ70HVV1BlY7BFw6oF0BX0vycPfogAvReo8+OH9A+rS5EPWt+zeTPJrky0l+fWNKm6l5Ha++5nq8kuwCrgYeOm/TXI/by/QLRhyzSW79H1TcPwGvW2fTn47wM9dV1dNJLgWOJnmyOwK5kPR59EGvxyNcgPrU/Qjwhqp6Lsle4B+BK2Zd2IzN63j1MdfjleQ1wBeA91XVM+dvXucrczFuQ/o18phN/Qi9qt5aVVet87kPOPvi/wp103MDfuPpbnoOuJfVUwAXmj6PPpjXxyMMrbuqnqmq57r5B4CtSbZvXIkzMa/jNdQ8j1eSrayG3mer6ovrNJnLcRvWr3HGbKNPuRwB9nfz+4H7zm+Q5NVJLn5xHngbsO5f7jdZn0cfHAF+v/sr/LXAT1885XSBG9q3JK9Lkm5+D6v/LP1owyudrnkdr6Hmdby6mu8CTlbVxwc0m7tx69OvccZs6qdchrgTuCfJbcD3gXcBJPkl4NNVtRfYAdzb9eMi4HNV9ZUNrnOoGvDogyR/2G3/a+ABVv8C/13gv4E/2Kx6R9Gzb78L/FGS54H/Ad5d3Z/mL1RJPs/qlQPbkzwFfBDYCvM9XtCrb3M3Xp3rgFuBx5Mc79Z9ANgJcz1uffo18ph5678kNcI7RSWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJasT/Ab1/MBgp7iR/AAAAAElFTkSuQmCC",
"image/svg+xml": "\n\n\n\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"n = 50\n",
"weights = np.array([0.10, 0.10, 0.80]) # Nuevas distribuciones\n",
"X_new, y_new = simulate_samples(n, X_test, y_test, weights)\n",
"\n",
"plot_distribution(y_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Efecto\n",
"\n",
"Veamos cual es el efecto en la performance del modelo al cambiar esta distribución:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 5\n",
" 1 0.21 1.00 0.34 5\n",
" 2 1.00 0.53 0.69 40\n",
"\n",
" accuracy 0.62 50\n",
" macro avg 0.74 0.84 0.68 50\n",
"weighted avg 0.92 0.62 0.69 50\n",
"\n",
"F1: 0.6853024307518374\n"
]
}
],
"source": [
"y_new_pred = model.predict(X_new)\n",
"print(classification_report(y_new, y_new_pred))\n",
"print(\"F1:\",f1_score(y_new, y_new_pred, average='weighted'))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podemos ver que ahora la performance del modelo a decaido. Recuerdemos que la puntuación F1 original era ~0,76. lo que significa que el rendimiento de nuestro modelo se ha deteriorado como consecuncia del cambio de la distribución. Incluso, no necesariamente la performance del modelo pudo haber cambiado."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instalamos una libreria que tenga la métrica de PSI implementada:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/mwburke/population-stability-index\n",
"!mv population-stability-index psi"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Distancia entre train y test: 0.1\n",
"Distancia entre train y el nuevo set de datos: 0.32\n"
]
}
],
"source": [
"from psi.psi import calculate_psi\n",
"\n",
"psi_train_test = calculate_psi(X_train.flatten(), X_test.flatten(), buckettype='quantiles', buckets=10, axis=1)\n",
"psi_train_new = calculate_psi(X_train.flatten(), X_new.flatten(), buckettype='quantiles', buckets=10, axis=1)\n",
"\n",
"print('PSI entre train y test:', np.round(psi_train_test, 2))\n",
"print('PSI entre train y el nuevo set de datos:', np.round(psi_train_new, 2))\n"
]
}
],
"metadata": {
"interpreter": {
"hash": "bea38c2984299ac640e8421861d34b2e05ee614f6236d2975c05eeb77366835f"
},
"kernelspec": {
"display_name": "Python 3.8.5 64-bit ('base': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}