{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Lasso on dense and sparse data\n\nWe show that linear_model.Lasso provides the same results for dense and sparse\ndata and that in the case of sparse data the speed is improved.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom time import time\n\nfrom scipy import linalg, sparse\n\nfrom sklearn.datasets import make_regression\nfrom sklearn.linear_model import Lasso" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing the two Lasso implementations on Dense data\n\nWe create a linear regression problem that is suitable for the Lasso,\nthat is to say, with more features than samples. We then store the data\nmatrix in both dense (the usual) and sparse format, and train a Lasso on\neach. We compute the runtime of both and check that they learned the\nsame model by computing the Euclidean norm of the difference between the\ncoefficients they learned. Because the data is dense, we expect better\nruntime with a dense data format.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = make_regression(n_samples=200, n_features=5000, random_state=0)\n# create a copy of X in sparse format\nX_sp = sparse.coo_matrix(X)\n\nalpha = 1\nsparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)\ndense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)\n\nt0 = time()\nsparse_lasso.fit(X_sp, y)\nprint(f\"Sparse Lasso done in {(time() - t0):.3f}s\")\n\nt0 = time()\ndense_lasso.fit(X, y)\nprint(f\"Dense Lasso done in {(time() - t0):.3f}s\")\n\n# compare the regression coefficients\ncoeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)\nprint(f\"Distance between coefficients : {coeff_diff:.2e}\")\n\n#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing the two Lasso implementations on Sparse data\n\nWe make the previous problem sparse by replacing all small values with 0\nand run the same comparisons as above. Because the data is now sparse, we\nexpect the implementation that uses the sparse data format to be faster.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# make a copy of the previous data\nXs = X.copy()\n# make Xs sparse by replacing the values lower than 2.5 with 0s\nXs[Xs < 2.5] = 0.0\n# create a copy of Xs in sparse format\nXs_sp = sparse.coo_matrix(Xs)\nXs_sp = Xs_sp.tocsc()\n\n# compute the proportion of non-zero coefficient in the data matrix\nprint(f\"Matrix density : {(Xs_sp.nnz / float(X.size) * 100):.3f}%\")\n\nalpha = 0.1\nsparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)\ndense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)\n\nt0 = time()\nsparse_lasso.fit(Xs_sp, y)\nprint(f\"Sparse Lasso done in {(time() - t0):.3f}s\")\n\nt0 = time()\ndense_lasso.fit(Xs, y)\nprint(f\"Dense Lasso done in {(time() - t0):.3f}s\")\n\n# compare the regression coefficients\ncoeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)\nprint(f\"Distance between coefficients : {coeff_diff:.2e}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }