{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Model Complexity Influence\n\nDemonstrate how model complexity influences both prediction accuracy and\ncomputational performance.\n\nWe will be using two datasets:\n - `diabetes_dataset` for regression.\n This dataset consists of 10 measurements taken from diabetes patients.\n The task is to predict disease progression;\n - `20newsgroups_dataset` for classification. This dataset consists of\n newsgroup posts. The task is to predict on which topic (out of 20 topics)\n the post is written about.\n\nWe will model the complexity influence on three different estimators:\n - :class:`~sklearn.linear_model.SGDClassifier` (for classification data)\n which implements stochastic gradient descent learning;\n\n - :class:`~sklearn.svm.NuSVR` (for regression data) which implements\n Nu support vector regression;\n\n - :class:`~sklearn.ensemble.GradientBoostingRegressor` builds an additive\n model in a forward stage-wise fashion. Notice that\n :class:`~sklearn.ensemble.HistGradientBoostingRegressor` is much faster\n than :class:`~sklearn.ensemble.GradientBoostingRegressor` starting with\n intermediate datasets (`n_samples >= 10_000`), which is not the case for\n this example.\n\n\nWe make the model complexity vary through the choice of relevant model\nparameters in each of our selected models. Next, we will measure the influence\non both computational performance (latency) and predictive power (MSE or\nHamming Loss).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn import datasets\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.metrics import hamming_loss, mean_squared_error\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import NuSVR\n\n# Initialize random generator\nnp.random.seed(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n\nFirst we load both datasets.\n\n

Note

We are using\n :func:`~sklearn.datasets.fetch_20newsgroups_vectorized` to download 20\n newsgroups dataset. It returns ready-to-use features.

\n\n

Note

``X`` of the 20 newsgroups dataset is a sparse matrix while ``X``\n of diabetes dataset is a numpy array.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def generate_data(case):\n \"\"\"Generate regression/classification data.\"\"\"\n if case == \"regression\":\n X, y = datasets.load_diabetes(return_X_y=True)\n train_size = 0.8\n elif case == \"classification\":\n X, y = datasets.fetch_20newsgroups_vectorized(subset=\"all\", return_X_y=True)\n train_size = 0.4 # to make the example run faster\n\n X_train, X_test, y_train, y_test = train_test_split(\n X, y, train_size=train_size, random_state=0\n )\n\n data = {\"X_train\": X_train, \"X_test\": X_test, \"y_train\": y_train, \"y_test\": y_test}\n return data\n\n\nregression_data = generate_data(\"regression\")\nclassification_data = generate_data(\"classification\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark influence\nNext, we can calculate the influence of the parameters on the given\nestimator. In each round, we will set the estimator with the new value of\n``changing_param`` and we will be collecting the prediction times, prediction\nperformance and complexities to see how those changes affect the estimator.\nWe will calculate the complexity using ``complexity_computer`` passed as a\nparameter.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def benchmark_influence(conf):\n \"\"\"\n Benchmark influence of `changing_param` on both MSE and latency.\n \"\"\"\n prediction_times = []\n prediction_powers = []\n complexities = []\n for param_value in conf[\"changing_param_values\"]:\n conf[\"tuned_params\"][conf[\"changing_param\"]] = param_value\n estimator = conf[\"estimator\"](**conf[\"tuned_params\"])\n\n print(\"Benchmarking %s\" % estimator)\n estimator.fit(conf[\"data\"][\"X_train\"], conf[\"data\"][\"y_train\"])\n conf[\"postfit_hook\"](estimator)\n complexity = conf[\"complexity_computer\"](estimator)\n complexities.append(complexity)\n start_time = time.time()\n for _ in range(conf[\"n_samples\"]):\n y_pred = estimator.predict(conf[\"data\"][\"X_test\"])\n elapsed_time = (time.time() - start_time) / float(conf[\"n_samples\"])\n prediction_times.append(elapsed_time)\n pred_score = conf[\"prediction_performance_computer\"](\n conf[\"data\"][\"y_test\"], y_pred\n )\n prediction_powers.append(pred_score)\n print(\n \"Complexity: %d | %s: %.4f | Pred. Time: %fs\\n\"\n % (\n complexity,\n conf[\"prediction_performance_label\"],\n pred_score,\n elapsed_time,\n )\n )\n return prediction_powers, prediction_times, complexities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choose parameters\n\nWe choose the parameters for each of our estimators by making\na dictionary with all the necessary values.\n``changing_param`` is the name of the parameter which will vary in each\nestimator.\nComplexity will be defined by the ``complexity_label`` and calculated using\n`complexity_computer`.\nAlso note that depending on the estimator type we are passing\ndifferent data.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def _count_nonzero_coefficients(estimator):\n a = estimator.coef_.toarray()\n return np.count_nonzero(a)\n\n\nconfigurations = [\n {\n \"estimator\": SGDClassifier,\n \"tuned_params\": {\n \"penalty\": \"elasticnet\",\n \"alpha\": 0.001,\n \"loss\": \"modified_huber\",\n \"fit_intercept\": True,\n \"tol\": 1e-1,\n \"n_iter_no_change\": 2,\n },\n \"changing_param\": \"l1_ratio\",\n \"changing_param_values\": [0.25, 0.5, 0.75, 0.9],\n \"complexity_label\": \"non_zero coefficients\",\n \"complexity_computer\": _count_nonzero_coefficients,\n \"prediction_performance_computer\": hamming_loss,\n \"prediction_performance_label\": \"Hamming Loss (Misclassification Ratio)\",\n \"postfit_hook\": lambda x: x.sparsify(),\n \"data\": classification_data,\n \"n_samples\": 5,\n },\n {\n \"estimator\": NuSVR,\n \"tuned_params\": {\"C\": 1e3, \"gamma\": 2**-15},\n \"changing_param\": \"nu\",\n \"changing_param_values\": [0.05, 0.1, 0.2, 0.35, 0.5],\n \"complexity_label\": \"n_support_vectors\",\n \"complexity_computer\": lambda x: len(x.support_vectors_),\n \"data\": regression_data,\n \"postfit_hook\": lambda x: x,\n \"prediction_performance_computer\": mean_squared_error,\n \"prediction_performance_label\": \"MSE\",\n \"n_samples\": 15,\n },\n {\n \"estimator\": GradientBoostingRegressor,\n \"tuned_params\": {\n \"loss\": \"squared_error\",\n \"learning_rate\": 0.05,\n \"max_depth\": 2,\n },\n \"changing_param\": \"n_estimators\",\n \"changing_param_values\": [10, 25, 50, 75, 100],\n \"complexity_label\": \"n_trees\",\n \"complexity_computer\": lambda x: x.n_estimators,\n \"data\": regression_data,\n \"postfit_hook\": lambda x: x,\n \"prediction_performance_computer\": mean_squared_error,\n \"prediction_performance_label\": \"MSE\",\n \"n_samples\": 15,\n },\n]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the code and plot the results\n\nWe defined all the functions required to run our benchmark. Now, we will loop\nover the different configurations that we defined previously. Subsequently,\nwe can analyze the plots obtained from the benchmark:\nRelaxing the `L1` penalty in the SGD classifier reduces the prediction error\nbut leads to an increase in the training time.\nWe can draw a similar analysis regarding the training time which increases\nwith the number of support vectors with a Nu-SVR. However, we observed that\nthere is an optimal number of support vectors which reduces the prediction\nerror. Indeed, too few support vectors lead to an under-fitted model while\ntoo many support vectors lead to an over-fitted model.\nThe exact same conclusion can be drawn for the gradient-boosting model. The\nonly the difference with the Nu-SVR is that having too many trees in the\nensemble is not as detrimental.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot_influence(conf, mse_values, prediction_times, complexities):\n \"\"\"\n Plot influence of model complexity on both accuracy and latency.\n \"\"\"\n\n fig = plt.figure()\n fig.subplots_adjust(right=0.75)\n\n # first axes (prediction error)\n ax1 = fig.add_subplot(111)\n line1 = ax1.plot(complexities, mse_values, c=\"tab:blue\", ls=\"-\")[0]\n ax1.set_xlabel(\"Model Complexity (%s)\" % conf[\"complexity_label\"])\n y1_label = conf[\"prediction_performance_label\"]\n ax1.set_ylabel(y1_label)\n\n ax1.spines[\"left\"].set_color(line1.get_color())\n ax1.yaxis.label.set_color(line1.get_color())\n ax1.tick_params(axis=\"y\", colors=line1.get_color())\n\n # second axes (latency)\n ax2 = fig.add_subplot(111, sharex=ax1, frameon=False)\n line2 = ax2.plot(complexities, prediction_times, c=\"tab:orange\", ls=\"-\")[0]\n ax2.yaxis.tick_right()\n ax2.yaxis.set_label_position(\"right\")\n y2_label = \"Time (s)\"\n ax2.set_ylabel(y2_label)\n ax1.spines[\"right\"].set_color(line2.get_color())\n ax2.yaxis.label.set_color(line2.get_color())\n ax2.tick_params(axis=\"y\", colors=line2.get_color())\n\n plt.legend(\n (line1, line2), (\"prediction error\", \"prediction latency\"), loc=\"upper center\"\n )\n\n plt.title(\n \"Influence of varying '%s' on %s\"\n % (conf[\"changing_param\"], conf[\"estimator\"].__name__)\n )\n\n\nfor conf in configurations:\n prediction_performances, prediction_times, complexities = benchmark_influence(conf)\n plot_influence(conf, prediction_performances, prediction_times, complexities)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n\nAs a conclusion, we can deduce the following insights:\n\n* a model which is more complex (or expressive) will require a larger\n training time;\n* a more complex model does not guarantee to reduce the prediction error.\n\nThese aspects are related to model generalization and avoiding model\nunder-fitting or over-fitting.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }