{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Imputation of missing values using Matrix Factorization\n", "\n", "This is a short example about imputation of missing values using the package [cmfrec](https://github.com/david-cortes/cmfrec) (low-rank matrix factorization), which among other things, can be used for imputation of tabular data using matrix factorization models.\n", "\n", "The basic idea is to produce an approximate factorization of a matrix $\\mathbf{X}$ with misisng data as the product of two lower-dimensional matrices, determined by finding two such matrices that minimize the squared error with respect to the non-missing entries:\n", "$$\n", "\\mathbf{X} \\approx \\mathbf{A} \\mathbf{B}^T\n", "$$\n", "\n", "Optionally adding other aspects such as centering and column means - e.g.:\n", "$$\n", "\\mathbf{X} \\approx \\mathbf{A} \\mathbf{B}^T + \\mu + \\mathbf{b}\n", "$$\n", "\n", "Missing values are then imputed according to the the values that such factorization (fitted to the non-missing values) would produce.\n", "\n", "Typically, values imputed in this way are not as good quality as those produce by iterative imputation methods, but for high-dimensional data, such a model is typically faster to fit and to produce imputations.\n", "\n", "Note that `cmfrec` is aimed primarily at recommender systems, and using it for imputation of tabular data benefits from changing the default parameters - particularly: do not add user (row) biases, and do not regularize the item biases (column intercepts). See the code below for an example of changing these parameters.\n", "\n", "For more information about the library and methods, see the package's webpage:\n", "https://github.com/david-cortes/cmfrec\n", "** *\n", "The example here is copy-pasted from SciKit-Learn's usage guide for their imputer, and just adds a few extra lines that do the job with this package.\n", "\n", "Original code was taken from this link:\n", "\n", "https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "### As of 2021-07-21, SciKit-Learn's example throws lots of convergence warnings\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "# To use this experimental feature, we need to explicitly ask for it:\n", "from sklearn.experimental import enable_iterative_imputer # noqa\n", "from sklearn.datasets import fetch_california_housing\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.impute import IterativeImputer\n", "from sklearn.linear_model import BayesianRidge\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import ExtraTreesRegressor\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.model_selection import cross_val_score\n", "\n", "N_SPLITS = 5\n", "\n", "rng = np.random.RandomState(0)\n", "\n", "X_full, y_full = fetch_california_housing(return_X_y=True)\n", "# ~2k samples is enough for the purpose of the example.\n", "# Remove the following two lines for a slower run with different error bars.\n", "X_full = X_full[::10]\n", "y_full = y_full[::10]\n", "n_samples, n_features = X_full.shape\n", "\n", "# Estimate the score on the entire dataset, with no missing values\n", "br_estimator = BayesianRidge()\n", "score_full_data = pd.DataFrame(\n", " cross_val_score(\n", " br_estimator, X_full, y_full, scoring='neg_mean_squared_error',\n", " cv=N_SPLITS\n", " ),\n", " columns=['Full Data']\n", ")\n", "\n", "# Add a single missing value to each row\n", "X_missing = X_full.copy()\n", "y_missing = y_full\n", "missing_samples = np.arange(n_samples)\n", "missing_features = rng.choice(n_features, n_samples, replace=True)\n", "X_missing[missing_samples, missing_features] = np.nan\n", "\n", "# Estimate the score after imputation (mean and median strategies)\n", "score_simple_imputer = pd.DataFrame()\n", "for strategy in ('mean', 'median'):\n", " estimator = make_pipeline(\n", " SimpleImputer(missing_values=np.nan, strategy=strategy),\n", " br_estimator\n", " )\n", " score_simple_imputer[strategy] = cross_val_score(\n", " estimator, X_missing, y_missing, scoring='neg_mean_squared_error',\n", " cv=N_SPLITS\n", " )\n", "\n", " \n", "##### NEW ADDITION HERE #########\n", "# This is the piece of code that adds imputations with cmfrec\n", "from cmfrec import CMF_imputer\n", "estimator = make_pipeline(\n", " CMF_imputer(k=10, user_bias=False, lambda_=[0,0,10,10,10,10],\n", " verbose=False),\n", " br_estimator\n", ")\n", "score_simple_imputer[\"MF (cmfrec)\"] = cross_val_score(\n", " estimator, X_missing, y_missing, scoring='neg_mean_squared_error',\n", " cv=N_SPLITS\n", ")\n", "##### END OF NEW ADDITION #########\n", "\n", "\n", "# Estimate the score after iterative imputation of the missing values\n", "# with different estimators\n", "estimators = [\n", " BayesianRidge(),\n", " DecisionTreeRegressor(max_features='sqrt', random_state=0),\n", " ExtraTreesRegressor(n_estimators=10, random_state=0),\n", " KNeighborsRegressor(n_neighbors=15)\n", "]\n", "score_iterative_imputer = pd.DataFrame()\n", "for impute_estimator in estimators:\n", " estimator = make_pipeline(\n", " IterativeImputer(random_state=0, estimator=impute_estimator),\n", " br_estimator\n", " )\n", " score_iterative_imputer[impute_estimator.__class__.__name__] = \\\n", " cross_val_score(\n", " estimator, X_missing, y_missing, scoring='neg_mean_squared_error',\n", " cv=N_SPLITS\n", " )\n", "\n", "scores = pd.concat(\n", " [score_full_data, score_simple_imputer, score_iterative_imputer],\n", " keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1\n", ")\n", "\n", "# plot california housing results\n", "fig, ax = plt.subplots(figsize=(13, 6))\n", "means = -scores.mean()\n", "errors = scores.std()\n", "means.plot.barh(xerr=errors, ax=ax)\n", "ax.set_title('California Housing Regression with Different Imputation Methods')\n", "ax.set_xlabel('MSE (smaller is better)')\n", "ax.set_yticks(np.arange(means.shape[0]))\n", "ax.set_yticklabels([\" w/ \".join(label) for label in means.index.tolist()])\n", "plt.tight_layout(pad=1)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python (OpenBLAS)", "language": "python", "name": "py3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }