{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# L1-based models for Sparse Signals\n\nThe present example compares three l1-based regression models on a synthetic\nsignal obtained from sparse and correlated features that are further corrupted\nwith additive gaussian noise:\n\n- a `lasso`;\n- an `automatic_relevance_determination`;\n- an `elastic_net`.\n\nIt is known that the Lasso estimates turn to be close to the model selection\nestimates when the data dimensions grow, given that the irrelevant variables are\nnot too correlated with the relevant ones. In the presence of correlated\nfeatures, Lasso itself cannot select the correct sparsity pattern [1]_.\n\nHere we compare the performance of the three models in terms of the $R^2$\nscore, the fitting time and the sparsity of the estimated coefficients when\ncompared with the ground-truth.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate synthetic dataset\n\nWe generate a dataset where the number of samples is lower than the total\nnumber of features. This leads to an underdetermined system, i.e. the solution\nis not unique, and thus we cannot apply an `ordinary_least_squares` by\nitself. Regularization introduces a penalty term to the objective function,\nwhich modifies the optimization problem and can help alleviate the\nunderdetermined nature of the system.\n\nThe target `y` is a linear combination with alternating signs of sinusoidal\nsignals. Only the 10 lowest out of the 100 frequencies in `X` are used to\ngenerate `y`, while the rest of the features are not informative. This results\nin a high dimensional sparse feature space, where some degree of\nl1-penalization is necessary.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nrng = np.random.RandomState(0)\nn_samples, n_features, n_informative = 50, 100, 10\ntime_step = np.linspace(-2, 2, n_samples)\nfreqs = 2 * np.pi * np.sort(rng.rand(n_features)) / 0.01\nX = np.zeros((n_samples, n_features))\n\nfor i in range(n_features):\n X[:, i] = np.sin(freqs[i] * time_step)\n\nidx = np.arange(n_features)\ntrue_coef = (-1) ** idx * np.exp(-idx / 10)\ntrue_coef[n_informative:] = 0 # sparsify coef\ny = np.dot(X, true_coef)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the informative features have close frequencies to induce\n(anti-)correlations.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "freqs[:n_informative]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A random phase is introduced using :func:`numpy.random.random_sample`\nand some gaussian noise (implemented by :func:`numpy.random.normal`)\nis added to both the features and the target.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in range(n_features):\n X[:, i] = np.sin(freqs[i] * time_step + 2 * (rng.random_sample() - 0.5))\n X[:, i] += 0.2 * rng.normal(0, 1, n_samples)\n\ny += 0.2 * rng.normal(0, 1, n_samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such sparse, noisy and correlated features can be obtained, for instance, from\nsensor nodes monitoring some environmental variables, as they typically register\nsimilar values depending on their positions (spatial correlations).\nWe can visualize the target.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nplt.plot(time_step, y)\nplt.ylabel(\"target signal\")\nplt.xlabel(\"time\")\n_ = plt.title(\"Superposition of sinusoidal signals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We split the data into train and test sets for simplicity. In practice one\nshould use a :class:`~sklearn.model_selection.TimeSeriesSplit`\ncross-validation to estimate the variance of the test score. Here we set\n`shuffle=\"False\"` as we must not use training data that succeed the testing\ndata when dealing with data that have a temporal relationship.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following, we compute the performance of three l1-based models in terms\nof the goodness of fit $R^2$ score and the fitting time. Then we make a\nplot to compare the sparsity of the estimated coefficients with respect to the\nground-truth coefficients and finally we analyze the previous results.\n\n## Lasso\n\nIn this example, we demo a :class:`~sklearn.linear_model.Lasso` with a fixed\nvalue of the regularization parameter `alpha`. In practice, the optimal\nparameter `alpha` should be selected by passing a\n:class:`~sklearn.model_selection.TimeSeriesSplit` cross-validation strategy to a\n:class:`~sklearn.linear_model.LassoCV`. To keep the example simple and fast to\nexecute, we directly set the optimal value for alpha here.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\n\nfrom sklearn.linear_model import Lasso\nfrom sklearn.metrics import r2_score\n\nt0 = time()\nlasso = Lasso(alpha=0.14).fit(X_train, y_train)\nprint(f\"Lasso fit done in {(time() - t0):.3f}s\")\n\ny_pred_lasso = lasso.predict(X_test)\nr2_score_lasso = r2_score(y_test, y_pred_lasso)\nprint(f\"Lasso r^2 on test data : {r2_score_lasso:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic Relevance Determination (ARD)\n\nAn ARD regression is the bayesian version of the Lasso. It can produce\ninterval estimates for all of the parameters, including the error variance, if\nrequired. It is a suitable option when the signals have gaussian noise. See\nthe example `sphx_glr_auto_examples_linear_model_plot_ard.py` for a\ncomparison of :class:`~sklearn.linear_model.ARDRegression` and\n:class:`~sklearn.linear_model.BayesianRidge` regressors.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import ARDRegression\n\nt0 = time()\nard = ARDRegression().fit(X_train, y_train)\nprint(f\"ARD fit done in {(time() - t0):.3f}s\")\n\ny_pred_ard = ard.predict(X_test)\nr2_score_ard = r2_score(y_test, y_pred_ard)\nprint(f\"ARD r^2 on test data : {r2_score_ard:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ElasticNet\n\n:class:`~sklearn.linear_model.ElasticNet` is a middle ground between\n:class:`~sklearn.linear_model.Lasso` and :class:`~sklearn.linear_model.Ridge`,\nas it combines a L1 and a L2-penalty. The amount of regularization is\ncontrolled by the two hyperparameters `l1_ratio` and `alpha`. For `l1_ratio =\n0` the penalty is pure L2 and the model is equivalent to a\n:class:`~sklearn.linear_model.Ridge`. Similarly, `l1_ratio = 1` is a pure L1\npenalty and the model is equivalent to a :class:`~sklearn.linear_model.Lasso`.\nFor `0 < l1_ratio < 1`, the penalty is a combination of L1 and L2.\n\nAs done before, we train the model with fix values for `alpha` and `l1_ratio`.\nTo select their optimal value we used an\n:class:`~sklearn.linear_model.ElasticNetCV`, not shown here to keep the\nexample simple.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import ElasticNet\n\nt0 = time()\nenet = ElasticNet(alpha=0.08, l1_ratio=0.5).fit(X_train, y_train)\nprint(f\"ElasticNet fit done in {(time() - t0):.3f}s\")\n\ny_pred_enet = enet.predict(X_test)\nr2_score_enet = r2_score(y_test, y_pred_enet)\nprint(f\"ElasticNet r^2 on test data : {r2_score_enet:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot and analysis of the results\n\nIn this section, we use a heatmap to visualize the sparsity of the true\nand estimated coefficients of the respective linear models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport pandas as pd\nimport seaborn as sns\nfrom matplotlib.colors import SymLogNorm\n\ndf = pd.DataFrame(\n {\n \"True coefficients\": true_coef,\n \"Lasso\": lasso.coef_,\n \"ARDRegression\": ard.coef_,\n \"ElasticNet\": enet.coef_,\n }\n)\n\nplt.figure(figsize=(10, 6))\nax = sns.heatmap(\n df.T,\n norm=SymLogNorm(linthresh=10e-4, vmin=-1, vmax=1),\n cbar_kws={\"label\": \"coefficients' values\"},\n cmap=\"seismic_r\",\n)\nplt.ylabel(\"linear model\")\nplt.xlabel(\"coefficients\")\nplt.title(\n f\"Models' coefficients\\nLasso $R^2$: {r2_score_lasso:.3f}, \"\n f\"ARD $R^2$: {r2_score_ard:.3f}, \"\n f\"ElasticNet $R^2$: {r2_score_enet:.3f}\"\n)\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the present example :class:`~sklearn.linear_model.ElasticNet` yields the\nbest score and captures the most of the predictive features, yet still fails\nat finding all the true components. Notice that both\n:class:`~sklearn.linear_model.ElasticNet` and\n:class:`~sklearn.linear_model.ARDRegression` result in a less sparse model\nthan a :class:`~sklearn.linear_model.Lasso`.\n\n## Conclusions\n\n:class:`~sklearn.linear_model.Lasso` is known to recover sparse data\neffectively but does not perform well with highly correlated features. Indeed,\nif several correlated features contribute to the target,\n:class:`~sklearn.linear_model.Lasso` would end up selecting a single one of\nthem. In the case of sparse yet non-correlated features, a\n:class:`~sklearn.linear_model.Lasso` model would be more suitable.\n\n:class:`~sklearn.linear_model.ElasticNet` introduces some sparsity on the\ncoefficients and shrinks their values to zero. Thus, in the presence of\ncorrelated features that contribute to the target, the model is still able to\nreduce their weights without setting them exactly to zero. This results in a\nless sparse model than a pure :class:`~sklearn.linear_model.Lasso` and may\ncapture non-predictive features as well.\n\n:class:`~sklearn.linear_model.ARDRegression` is better when handling gaussian\nnoise, but is still unable to handle correlated features and requires a larger\namount of time due to fitting a prior.\n\n## References\n\n.. [1] :doi:`\"Lasso-type recovery of sparse representations for\n high-dimensional data\" N. Meinshausen, B. Yu - The Annals of Statistics\n 2009, Vol. 37, No. 1, 246-270 <10.1214/07-AOS582>`\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }