{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Machine learning with missing values\n\nHere we use simulated data to understanding the fundamentals of statistical\nlearning with missing values.\n\nThis notebook reveals why a HistGradientBoostingRegressor (\n:class:`sklearn.ensemble.HistGradientBoostingRegressor` ) is a good choice to\npredict with missing values.\n\nWe use simulations to control the missing-value mechanism, and inspect\nit's impact on predictive models. In particular, standard imputation\nprocedures can reconstruct missing values without distortion only if the\ndata is *missing at random*.\n\nA good introduction to the mathematics behind this notebook can be found in\nhttps://arxiv.org/abs/1902.06931\n\n.. topic:: **Missing values in categorical data**\n\n If a categorical column has missing values, the simplest approach is\n to create a specific category \"missing\" and assign missing values to\n this new category, to represent missingness in the classifier.\n Indeed, as we will see, imputation is not crucial for prediction.\n In the following we focus on continuous columns, where the discrete\n nature of a missing value poses more problems.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The fully-observed data: a toy regression problem\n\nWe consider a simple regression problem where X (the data) is bivariate\ngaussian, and y (the prediction target) is a linear function of the first\ncoordinate, with noise.\n\n### The data-generating mechanism\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\ndef generate_without_missing_values(n_samples, rng=42):\n mean = [0, 0]\n cov = [[1, 0.9], [0.9, 1]]\n if not isinstance(rng, np.random.RandomState):\n rng = np.random.RandomState(rng)\n X = rng.multivariate_normal(mean, cov, size=n_samples)\n\n epsilon = 0.1 * rng.randn(n_samples)\n y = X[:, 0] + epsilon\n\n return X, y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick plot reveals what the data looks like\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nplt.rcParams['figure.figsize'] = (5, 4) # Smaller default figure size\n\nplt.figure()\nX_full, y_full = generate_without_missing_values(1000)\nplt.scatter(X_full[:, 0], X_full[:, 1], c=y_full)\nplt.colorbar(label='y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing completely at random settings\n\nWe now consider missing completely at random settings (a special case\nof missing at random): the missingness is completely independent from\nthe values.\n\n### The missing-values mechanism\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def generate_mcar(n_samples, missing_rate=.5, rng=42):\n X, y = generate_without_missing_values(n_samples, rng=rng)\n if not isinstance(rng, np.random.RandomState):\n rng = np.random.RandomState(rng)\n\n M = rng.binomial(1, missing_rate, (n_samples, 2))\n np.putmask(X, M, np.nan)\n\n return X, y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick plot to look at the data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = generate_mcar(500)\n\nplt.figure()\nplt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5', label='All data')\nplt.colorbar(label='y')\nplt.scatter(X[:, 0], X[:, 1], c=y, label='Fully observed')\nplt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the distribution of the fully-observed data is the same\nthan that of the original data\n\n### Conditional Imputation with the IterativeImputer\n\nAs the data is MAR (missing at random), an imputer can use the\nconditional dependencies between the observed and the missing values to\nimpute the missing values.\n\nWe'll use the IterativeImputer, a good imputer, but it needs to be enabled\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.experimental import enable_iterative_imputer\nfrom sklearn import impute\niterative_imputer = impute.IterativeImputer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us try the imputer on the small data used to visualize\n\n**The imputation is learned by fitting the imputer on the data with\nmissing values**\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iterative_imputer.fit(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The data are imputed with the transform method**\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_imputed = iterative_imputer.transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can display the imputed data as our previous visualization\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure()\nplt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',\n label='All data', alpha=.5)\nplt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',\n label='Imputed')\nplt.colorbar(label='y')\nplt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the imputer did a fairly good job of recovering the\ndata distribution\n\n### Supervised learning: imputation and a linear model\n\nGiven that the relationship between the fully-observed X and y is a\nlinear relationship, it seems natural to use a linear model for\nprediction. It must be adapted to missing values using imputation.\n\nTo use it in supervised setting, we will pipeline it with a linear\nmodel, using a ridge, which is a good default linear model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\nfrom sklearn.linear_model import RidgeCV\n\niterative_and_ridge = make_pipeline(impute.IterativeImputer(), RidgeCV())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate the model performance in a cross-validation loop\n(for better evaluation accuracy, we increase slightly the number of\nfolds to 10)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import model_selection\nscores_iterative_and_ridge = model_selection.cross_val_score(\n iterative_and_ridge, X, y, cv=10)\n\nscores_iterative_and_ridge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Computational cost**: One drawback of the IterativeImputer to keep in\nmind is that its computational cost can become prohibitive of large\ndatasets (it has a bad computation scalability).\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean imputation: SimpleImputer\n\nWe can try a simple imputer: imputation by the mean\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mean_imputer = impute.SimpleImputer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick visualization reveals a larger disortion of the distribution\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_imputed = mean_imputer.fit_transform(X)\nplt.figure()\nplt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',\n label='All data', alpha=.5)\nplt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',\n label='Imputed')\nplt.colorbar(label='y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating in prediction pipeline\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mean_and_ridge = make_pipeline(impute.SimpleImputer(), RidgeCV())\nscores_mean_and_ridge = model_selection.cross_val_score(\n mean_and_ridge, X, y, cv=10)\n\nscores_mean_and_ridge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised learning without imputation\n\nThe HistGradientBoosting models are based on trees, which can be\nadapted to model directly missing values\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.experimental import enable_hist_gradient_boosting\nfrom sklearn.ensemble import HistGradientBoostingRegressor\nscore_hist_gradient_boosting = model_selection.cross_val_score(\n HistGradientBoostingRegressor(), X, y, cv=10)\n\nscore_hist_gradient_boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recap: which pipeline predicts well on our small data?\n\nLet's plot the scores to see things better\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\nimport seaborn as sns\n\nscores = pd.DataFrame({'Mean imputation + Ridge': scores_mean_and_ridge,\n 'IterativeImputer + Ridge': scores_iterative_and_ridge,\n 'HistGradientBoostingRegressor': score_hist_gradient_boosting,\n })\n\nsns.boxplot(data=scores, orient='h')\nplt.title('Prediction accuracy\\n linear and small data\\n'\n 'Missing Completely at Random')\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not much difference with the more sophisticated imputer. A more thorough\nanalysis would be necessary, with more cross-validation runs.\n\n### Prediction performance with large datasets\n\nLet us compare models in regimes where there is plenty of data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = generate_mcar(n_samples=20000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iterative imputation and linear model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scores_iterative_and_ridge= model_selection.cross_val_score(\n iterative_and_ridge, X, y, cv=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mean imputation and linear model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scores_mean_and_ridge = model_selection.cross_val_score(\n mean_and_ridge, X, y, cv=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now the HistGradientBoostingRegressor, which does not need\nimputation\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "score_hist_gradient_boosting = model_selection.cross_val_score(\n HistGradientBoostingRegressor(), X, y, cv=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We plot the results\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scores = pd.DataFrame({'Mean imputation + Ridge': scores_mean_and_ridge,\n 'IterativeImputer + Ridge': scores_iterative_and_ridge,\n 'HistGradientBoostingRegressor': score_hist_gradient_boosting,\n })\n\nsns.boxplot(data=scores, orient='h')\nplt.title('Prediction accuracy\\n linear and large data\\n'\n 'Missing Completely at Random')\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**When there is a reasonnable amout of data, the\nHistGradientBoostingRegressor is the best strategy** even for a linear\ndata-generating mechanism, in MAR settings, which are settings\nfavorable to imputation + linear model [#]_.\n\n.. [#] Even in the case of a linear data-generating mechanism, the\n optimal prediction one data imputed by a constant\n is a piecewise affine function with 2^d regions (\n http://proceedings.mlr.press/v108/morvan20a.html ). The\n larger the dimensionality (number of features), the more a\n imperfect imputation is hard to approximate with a simple model.\n\n|\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing not at random: censoring\n\nWe now consider missing not at random settings, in particular\nself-masking or censoring, where large values are more likely to be\nmissing.\n\n### The missing-values mechanism\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def generate_censored(n_samples, missing_rate=.4, rng=42):\n X, y = generate_without_missing_values(n_samples, rng=rng)\n if not isinstance(rng, np.random.RandomState):\n rng = np.random.RandomState(rng)\n\n B = rng.binomial(1, 2 * missing_rate, (n_samples, 2))\n M = (X > 0.5) * B\n\n np.putmask(X, M, np.nan)\n\n return X, y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick plot to look at the data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = generate_censored(500, missing_rate=.4)\n\nplt.figure()\nplt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',\n label='All data')\nplt.colorbar(label='y')\nplt.scatter(X[:, 0], X[:, 1], c=y, label='Fully observed')\nplt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the full-observed data does not reflect well at all the\ndistribution of all the data\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imputation fails to recover the distribution\n\nWith MNAR data, off-the-shelf imputation methods do not recover the\ninitial distribution:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iterative_imputer = impute.IterativeImputer()\nX_imputed = iterative_imputer.fit_transform(X)\n\nplt.figure()\nplt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',\n label='All data', alpha=.5)\nplt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',\n label='Imputed')\nplt.colorbar(label='y')\nplt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recovering the initial data distribution would need much more mass on\nthe right and the top of the figure. The imputed data is shifted to\nlower values than the original data.\n\nNote also that as imputed values typically have lower X values than\ntheir full-observed counterparts, the association between X and y is\nalso distorted. This is visible as the imputed values appear as lighter\ndiagonal lines.\n\nAn important consequence is that **the link between imputed X and y is no\nlonger linear**, although the original data-generating mechanism is\nlinear [#]_. For this reason, **it is often a good idea to use non-linear\nlearners in the presence of missing values**.\n\n.. [#] As mentionned above, even in the case of a linear\n data-generating mechanism, imperfect imputation leads to complex\n functions to link to y (\n http://proceedings.mlr.press/v108/morvan20a.html )\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predictive pipelines\n\nLet us now evaluate predictive pipelines\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scores = dict()\n\n# Iterative imputation and linear model\nscores['IterativeImputer + Ridge'] = model_selection.cross_val_score(\n iterative_and_ridge, X, y, cv=10)\n\n# Mean imputation and linear model\nscores['Mean imputation + Ridge'] = model_selection.cross_val_score(\n mean_and_ridge, X, y, cv=10)\n\n# IterativeImputer and non-linear model\niterative_and_gb = make_pipeline(impute.SimpleImputer(),\n HistGradientBoostingRegressor())\nscores['Mean imputation\\n+ HistGradientBoostingRegressor'] = model_selection.cross_val_score(\n iterative_and_gb, X, y, cv=10)\n\n# Mean imputation and non-linear model\nmean_and_gb = make_pipeline(impute.SimpleImputer(),\n HistGradientBoostingRegressor())\nscores['IterativeImputer\\n+ HistGradientBoostingRegressor'] = model_selection.cross_val_score(\n mean_and_gb, X, y, cv=10)\n\n# And now the HistGradientBoostingRegressor, whithout imputation\nscores['HistGradientBoostingRegressor'] = model_selection.cross_val_score(\n HistGradientBoostingRegressor(), X, y, cv=10)\n\n# We plot the results\nsns.boxplot(data=pd.DataFrame(scores), orient='h')\nplt.title('Prediction accuracy\\n linear and small data\\n'\n 'Missing not at Random')\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the imputation is not the most important step of the\npipeline [#]_, rather **what is important is to use a powerful model**.\nHere there is information in missingness (if a value is missing, it is\nlarge), information that a model can use to predict better.\n\n.. [#] Note that there are less missing values in the example here\n compared to the section above on MCAR, hence the absolute prediction\n accuracies are not comparable.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".. topic:: Prediction with missing values\n\n The data above are very simple: linear data-generating mechanism,\n Gaussian, and low dimensional. Yet, they show the importance of using\n non-linear models, in particular the HistGradientBoostingRegressor\n which natively deals with missing values.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a predictor for the fully-observed case\n\nLet us go back to the \"easy\" case of the missing completely at random\nsettings with plenty of data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_samples = 20000\n\nX, y = generate_mcar(n_samples, missing_rate=.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we have been able to train a predictive model that works on\nfully-observed data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_full, y_full = generate_without_missing_values(n_samples)\nfull_data_predictor = HistGradientBoostingRegressor()\nfull_data_predictor.fit(X_full, y_full)\n\nmodel_selection.cross_val_score(full_data_predictor, X_full, y_full)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cross validation reveals that the predictor achieves an excellent\nexplained variance; it is a near-perfect predictor on fully observed\ndata\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we turn to data with missing values. Given that our data is MAR\n(missing at random), we will use imputation to build a completed data\nthat looks like the full-observed data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iterative_imputer = impute.IterativeImputer()\nX_imputed = iterative_imputer.fit_transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full data predictor can be used on the imputed data\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import metrics\nmetrics.r2_score(y, full_data_predictor.predict(X_imputed))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This prediction is less good than on the full data, but this is\nexpected, as missing values lead to a loss of information. We can\ncompare it to a model trained to predict on data with missing values\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train, y_train = generate_mcar(n_samples, missing_rate=.5)\nna_predictor = HistGradientBoostingRegressor()\nna_predictor.fit(X_train, y_train)\n\nmetrics.r2_score(y, na_predictor.predict(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying a model valid on the full data to imputed data work almost\nas well as a model trained for missing values. The small loss in\nperformance is because the imputation is imperfect.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When the data-generation is non linear\n\nWe now modify a bit the example above to consider the situation where y\nis a non-linear function of X\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = generate_mcar(n_samples, missing_rate=.5)\ny = y ** 2\n\n# Train a predictive model that works on fully-observed data:\nX_full, y_full = generate_without_missing_values(n_samples)\ny_full = y_full ** 2\nfull_data_predictor = HistGradientBoostingRegressor()\nfull_data_predictor.fit(X_full, y_full)\n\nmodel_selection.cross_val_score(full_data_predictor, X_full, y_full)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, we have a near-perfect predictor on fully-observed data\n\nOn data with missing values:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iterative_imputer = impute.IterativeImputer()\nX_imputed = iterative_imputer.fit_transform(X)\n\nfrom sklearn import metrics\nmetrics.r2_score(y, full_data_predictor.predict(X_imputed))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full-data predictor works much less well\n\nNow we use a model trained to predict on data with missing values\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train, y_train = generate_mcar(n_samples, missing_rate=.5)\ny_train = y_train ** 2\nna_predictor = HistGradientBoostingRegressor()\nna_predictor.fit(X_train, y_train)\n\nmetrics.r2_score(y, na_predictor.predict(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model trained on data with missing values works significantly\nbetter than that was optimal for the fully-observed data.\n\n**Only for linear mechanism is the model on full data also optimal for\nperfectly imputed data**. When the function linking X to y has\ncurvature, this curvature turns uncertainty resulting from missingness\ninto bias [#]_.\n\n.. [#] The detailed mathematical analysis of prediction after\n imputation can be found here: https://arxiv.org/abs/2106.00311\n\n|\n\n________\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 0 }