{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Lasso model selection: AIC-BIC / cross-validation\n\nThis example focuses on model selection for Lasso models that are\nlinear models with an L1 penalty for regression problems.\n\nIndeed, several strategies can be used to select the value of the\nregularization parameter: via cross-validation or using an information\ncriterion, namely AIC or BIC.\n\nIn what follows, we will discuss in details the different strategies.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\nIn this example, we will use the diabetes dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n\nX, y = load_diabetes(return_X_y=True, as_frame=True)\nX.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, we add some random features to the original data to\nbetter illustrate the feature selection performed by the Lasso model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nimport pandas as pd\n\nrng = np.random.RandomState(42)\nn_random_features = 14\nX_random = pd.DataFrame(\n rng.randn(X.shape[0], n_random_features),\n columns=[f\"random_{i:02d}\" for i in range(n_random_features)],\n)\nX = pd.concat([X, X_random], axis=1)\n# Show only a subset of the columns\nX[X.columns[::3]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting Lasso via an information criterion\n:class:`~sklearn.linear_model.LassoLarsIC` provides a Lasso estimator that\nuses the Akaike information criterion (AIC) or the Bayes information\ncriterion (BIC) to select the optimal value of the regularization\nparameter alpha.\n\nBefore fitting the model, we will standardize the data with a\n:class:`~sklearn.preprocessing.StandardScaler`. In addition, we will\nmeasure the time to fit and tune the hyperparameter alpha in order to\ncompare with the cross-validation strategy.\n\nWe will first fit a Lasso model with the AIC criterion.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n\nfrom sklearn.linear_model import LassoLarsIC\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nstart_time = time.time()\nlasso_lars_ic = make_pipeline(StandardScaler(), LassoLarsIC(criterion=\"aic\")).fit(X, y)\nfit_time = time.time() - start_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We store the AIC metric for each value of alpha used during `fit`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = pd.DataFrame(\n {\n \"alphas\": lasso_lars_ic[-1].alphas_,\n \"AIC criterion\": lasso_lars_ic[-1].criterion_,\n }\n).set_index(\"alphas\")\nalpha_aic = lasso_lars_ic[-1].alpha_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we perform the same analysis using the BIC criterion.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lasso_lars_ic.set_params(lassolarsic__criterion=\"bic\").fit(X, y)\nresults[\"BIC criterion\"] = lasso_lars_ic[-1].criterion_\nalpha_bic = lasso_lars_ic[-1].alpha_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check which value of `alpha` leads to the minimum AIC and BIC.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def highlight_min(x):\n x_min = x.min()\n return [\"font-weight: bold\" if v == x_min else \"\" for v in x]\n\n\nresults.style.apply(highlight_min)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can plot the AIC and BIC values for the different alpha values.\nThe vertical lines in the plot correspond to the alpha chosen for each\ncriterion. The selected alpha corresponds to the minimum of the AIC or BIC\ncriterion.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ax = results.plot()\nax.vlines(\n alpha_aic,\n results[\"AIC criterion\"].min(),\n results[\"AIC criterion\"].max(),\n label=\"alpha: AIC estimate\",\n linestyles=\"--\",\n color=\"tab:blue\",\n)\nax.vlines(\n alpha_bic,\n results[\"BIC criterion\"].min(),\n results[\"BIC criterion\"].max(),\n label=\"alpha: BIC estimate\",\n linestyle=\"--\",\n color=\"tab:orange\",\n)\nax.set_xlabel(r\"$\\alpha$\")\nax.set_ylabel(\"criterion\")\nax.set_xscale(\"log\")\nax.legend()\n_ = ax.set_title(\n f\"Information-criterion for model selection (training time {fit_time:.2f}s)\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model selection with an information-criterion is very fast. It relies on\ncomputing the criterion on the in-sample set provided to `fit`. Both criteria\nestimate the model generalization error based on the training set error and\npenalize this overly optimistic error. However, this penalty relies on a\nproper estimation of the degrees of freedom and the noise variance. Both are\nderived for large samples (asymptotic results) and assume the model is\ncorrect, i.e. that the data are actually generated by this model.\n\nThese models also tend to break when the problem is badly conditioned (more\nfeatures than samples). It is then required to provide an estimate of the\nnoise variance.\n\n## Selecting Lasso via cross-validation\nThe Lasso estimator can be implemented with different solvers: coordinate\ndescent and least angle regression. They differ with regards to their\nexecution speed and sources of numerical errors.\n\nIn scikit-learn, two different estimators are available with integrated\ncross-validation: :class:`~sklearn.linear_model.LassoCV` and\n:class:`~sklearn.linear_model.LassoLarsCV` that respectively solve the\nproblem with coordinate descent and least angle regression.\n\nIn the remainder of this section, we will present both approaches. For both\nalgorithms, we will use a 20-fold cross-validation strategy.\n\n### Lasso via coordinate descent\nLet's start by making the hyperparameter tuning using\n:class:`~sklearn.linear_model.LassoCV`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LassoCV\n\nstart_time = time.time()\nmodel = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y)\nfit_time = time.time() - start_time" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nymin, ymax = 2300, 3800\nlasso = model[-1]\nplt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=\":\")\nplt.plot(\n lasso.alphas_,\n lasso.mse_path_.mean(axis=-1),\n color=\"black\",\n label=\"Average across the folds\",\n linewidth=2,\n)\nplt.axvline(lasso.alpha_, linestyle=\"--\", color=\"black\", label=\"alpha: CV estimate\")\n\nplt.ylim(ymin, ymax)\nplt.xlabel(r\"$\\alpha$\")\nplt.ylabel(\"Mean square error\")\nplt.legend()\n_ = plt.title(\n f\"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lasso via least angle regression\nLet's start by making the hyperparameter tuning using\n:class:`~sklearn.linear_model.LassoLarsCV`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LassoLarsCV\n\nstart_time = time.time()\nmodel = make_pipeline(StandardScaler(), LassoLarsCV(cv=20)).fit(X, y)\nfit_time = time.time() - start_time" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lasso = model[-1]\nplt.semilogx(lasso.cv_alphas_, lasso.mse_path_, \":\")\nplt.semilogx(\n lasso.cv_alphas_,\n lasso.mse_path_.mean(axis=-1),\n color=\"black\",\n label=\"Average across the folds\",\n linewidth=2,\n)\nplt.axvline(lasso.alpha_, linestyle=\"--\", color=\"black\", label=\"alpha CV\")\n\nplt.ylim(ymin, ymax)\nplt.xlabel(r\"$\\alpha$\")\nplt.ylabel(\"Mean square error\")\nplt.legend()\n_ = plt.title(f\"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of cross-validation approach\nBoth algorithms give roughly the same results.\n\nLars computes a solution path only for each kink in the path. As a result, it\nis very efficient when there are only of few kinks, which is the case if\nthere are few features or samples. Also, it is able to compute the full path\nwithout setting any hyperparameter. On the opposite, coordinate descent\ncomputes the path points on a pre-specified grid (here we use the default).\nThus it is more efficient if the number of grid points is smaller than the\nnumber of kinks in the path. Such a strategy can be interesting if the number\nof features is really large and there are enough samples to be selected in\neach of the cross-validation fold. In terms of numerical errors, for heavily\ncorrelated variables, Lars will accumulate more errors, while the coordinate\ndescent algorithm will only sample the path on a grid.\n\nNote how the optimal value of alpha varies for each fold. This illustrates\nwhy nested-cross validation is a good strategy when trying to evaluate the\nperformance of a method for which a parameter is chosen by cross-validation:\nthis choice of parameter may not be optimal for a final evaluation on\nunseen test set only.\n\n## Conclusion\nIn this tutorial, we presented two approaches for selecting the best\nhyperparameter `alpha`: one strategy finds the optimal value of `alpha`\nby only using the training set and some information criterion, and another\nstrategy is based on cross-validation.\n\nIn this example, both approaches are working similarly. The in-sample\nhyperparameter selection even shows its efficacy in terms of computational\nperformance. However, it can only be used when the number of samples is large\nenough compared to the number of features.\n\nThat's why hyperparameter optimization via cross-validation is a safe\nstrategy: it works in different settings.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }