{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Balance model complexity and cross-validated score\n\nThis example demonstrates how to balance model complexity and cross-validated score by\nfinding a decent accuracy within 1 standard deviation of the best accuracy score while\nminimising the number of :class:`~sklearn.decomposition.PCA` components [1]_. It uses\n:class:`~sklearn.model_selection.GridSearchCV` with a custom refit callable to select\nthe optimal model.\n\nThe figure shows the trade-off between cross-validated score and the number\nof PCA components. The balanced case is when `n_components=10` and `accuracy=0.88`,\nwhich falls into the range within 1 standard deviation of the best accuracy\nscore.\n\n## References\n.. [1] Hastie, T., Tibshirani, R., Friedman, J. (2001). Model Assessment and\n Selection. The Elements of Statistical Learning (pp. 219-260). New York,\n NY, USA: Springer New York Inc.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport polars as pl\n\nfrom sklearn.datasets import load_digits\nfrom sklearn.decomposition import PCA\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV, ShuffleSplit\nfrom sklearn.pipeline import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n\nWhen tuning hyperparameters, we often want to balance model complexity and\nperformance. The \"one-standard-error\" rule is a common approach: select the simplest\nmodel whose performance is within one standard error of the best model's performance.\nThis helps to avoid overfitting by preferring simpler models when their performance is\nstatistically comparable to more complex ones.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helper functions\n\nWe define two helper functions:\n\n1. `lower_bound`: Calculates the threshold for acceptable performance\n (best score - 1 std)\n\n2. `best_low_complexity`: Selects the model with the fewest PCA components that\n exceeds this threshold\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def lower_bound(cv_results):\n \"\"\"\n Calculate the lower bound within 1 standard deviation\n of the best `mean_test_scores`.\n\n Parameters\n ----------\n cv_results : dict of numpy(masked) ndarrays\n See attribute cv_results_ of `GridSearchCV`\n\n Returns\n -------\n float\n Lower bound within 1 standard deviation of the\n best `mean_test_score`.\n \"\"\"\n best_score_idx = np.argmax(cv_results[\"mean_test_score\"])\n\n return (\n cv_results[\"mean_test_score\"][best_score_idx]\n - cv_results[\"std_test_score\"][best_score_idx]\n )\n\n\ndef best_low_complexity(cv_results):\n \"\"\"\n Balance model complexity with cross-validated score.\n\n Parameters\n ----------\n cv_results : dict of numpy(masked) ndarrays\n See attribute cv_results_ of `GridSearchCV`.\n\n Return\n ------\n int\n Index of a model that has the fewest PCA components\n while has its test score within 1 standard deviation of the best\n `mean_test_score`.\n \"\"\"\n threshold = lower_bound(cv_results)\n candidate_idx = np.flatnonzero(cv_results[\"mean_test_score\"] >= threshold)\n best_idx = candidate_idx[\n cv_results[\"param_reduce_dim__n_components\"][candidate_idx].argmin()\n ]\n return best_idx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up the pipeline and parameter grid\n\nWe create a pipeline with two steps:\n\n1. Dimensionality reduction using PCA\n\n2. Classification using LogisticRegression\n\nWe'll search over different numbers of PCA components to find the optimal complexity.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipe = Pipeline(\n [\n (\"reduce_dim\", PCA(random_state=42)),\n (\"classify\", LogisticRegression(random_state=42, C=0.01, max_iter=1000)),\n ]\n)\n\nparam_grid = {\"reduce_dim__n_components\": [6, 8, 10, 15, 20, 25, 35, 45, 55]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform the search with GridSearchCV\n\nWe use `GridSearchCV` with our custom `best_low_complexity` function as the refit\nparameter. This function will select the model with the fewest PCA components that\nstill performs within one standard deviation of the best model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grid = GridSearchCV(\n pipe,\n # Use a non-stratified CV strategy to make sure that the inter-fold\n # standard deviation of the test scores is informative.\n cv=ShuffleSplit(n_splits=30, random_state=0),\n n_jobs=1, # increase this on your machine to use more physical cores\n param_grid=param_grid,\n scoring=\"accuracy\",\n refit=best_low_complexity,\n return_train_score=True,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the digits dataset and fit the model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = load_digits(return_X_y=True)\ngrid.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the results\n\nWe'll create a bar chart showing the test scores for different numbers of PCA\ncomponents, along with horizontal lines indicating the best score and the\none-standard-deviation threshold.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_components = grid.cv_results_[\"param_reduce_dim__n_components\"]\ntest_scores = grid.cv_results_[\"mean_test_score\"]\n\n# Create a polars DataFrame for better data manipulation and visualization\nresults_df = pl.DataFrame(\n {\n \"n_components\": n_components,\n \"mean_test_score\": test_scores,\n \"std_test_score\": grid.cv_results_[\"std_test_score\"],\n \"mean_train_score\": grid.cv_results_[\"mean_train_score\"],\n \"std_train_score\": grid.cv_results_[\"std_train_score\"],\n \"mean_fit_time\": grid.cv_results_[\"mean_fit_time\"],\n \"rank_test_score\": grid.cv_results_[\"rank_test_score\"],\n }\n)\n\n# Sort by number of components\nresults_df = results_df.sort(\"n_components\")\n\n# Calculate the lower bound threshold\nlower = lower_bound(grid.cv_results_)\n\n# Get the best model information\nbest_index_ = grid.best_index_\nbest_components = n_components[best_index_]\nbest_score = grid.cv_results_[\"mean_test_score\"][best_index_]\n\n# Add a column to mark the selected model\nresults_df = results_df.with_columns(\n pl.when(pl.col(\"n_components\") == best_components)\n .then(pl.lit(\"Selected\"))\n .otherwise(pl.lit(\"Regular\"))\n .alias(\"model_type\")\n)\n\n# Get the number of CV splits from the results\nn_splits = sum(\n 1\n for key in grid.cv_results_.keys()\n if key.startswith(\"split\") and key.endswith(\"test_score\")\n)\n\n# Extract individual scores for each split\ntest_scores = np.array(\n [\n [grid.cv_results_[f\"split{i}_test_score\"][j] for i in range(n_splits)]\n for j in range(len(n_components))\n ]\n)\ntrain_scores = np.array(\n [\n [grid.cv_results_[f\"split{i}_train_score\"][j] for i in range(n_splits)]\n for j in range(len(n_components))\n ]\n)\n\n# Calculate mean and std of test scores\nmean_test_scores = np.mean(test_scores, axis=1)\nstd_test_scores = np.std(test_scores, axis=1)\n\n# Find best score and threshold\nbest_mean_score = np.max(mean_test_scores)\nthreshold = best_mean_score - std_test_scores[np.argmax(mean_test_scores)]\n\n# Create a single figure for visualization\nfig, ax = plt.subplots(figsize=(12, 8))\n\n# Plot individual points\nfor i, comp in enumerate(n_components):\n # Plot individual test points\n plt.scatter(\n [comp] * n_splits,\n test_scores[i],\n alpha=0.2,\n color=\"blue\",\n s=20,\n label=\"Individual test scores\" if i == 0 else \"\",\n )\n # Plot individual train points\n plt.scatter(\n [comp] * n_splits,\n train_scores[i],\n alpha=0.2,\n color=\"green\",\n s=20,\n label=\"Individual train scores\" if i == 0 else \"\",\n )\n\n# Plot mean lines with error bands\nplt.plot(\n n_components,\n np.mean(test_scores, axis=1),\n \"-\",\n color=\"blue\",\n linewidth=2,\n label=\"Mean test score\",\n)\nplt.fill_between(\n n_components,\n np.mean(test_scores, axis=1) - np.std(test_scores, axis=1),\n np.mean(test_scores, axis=1) + np.std(test_scores, axis=1),\n alpha=0.15,\n color=\"blue\",\n)\n\nplt.plot(\n n_components,\n np.mean(train_scores, axis=1),\n \"-\",\n color=\"green\",\n linewidth=2,\n label=\"Mean train score\",\n)\nplt.fill_between(\n n_components,\n np.mean(train_scores, axis=1) - np.std(train_scores, axis=1),\n np.mean(train_scores, axis=1) + np.std(train_scores, axis=1),\n alpha=0.15,\n color=\"green\",\n)\n\n# Add threshold lines\nplt.axhline(\n best_mean_score,\n color=\"#9b59b6\", # Purple\n linestyle=\"--\",\n label=\"Best score\",\n linewidth=2,\n)\nplt.axhline(\n threshold,\n color=\"#e67e22\", # Orange\n linestyle=\"--\",\n label=\"Best score - 1 std\",\n linewidth=2,\n)\n\n# Highlight selected model\nplt.axvline(\n best_components,\n color=\"#9b59b6\", # Purple\n alpha=0.2,\n linewidth=8,\n label=\"Selected model\",\n)\n\n# Set titles and labels\nplt.xlabel(\"Number of PCA components\", fontsize=12)\nplt.ylabel(\"Score\", fontsize=12)\nplt.title(\"Model Selection: Balancing Complexity and Performance\", fontsize=14)\nplt.grid(True, linestyle=\"--\", alpha=0.7)\nplt.legend(\n bbox_to_anchor=(1.02, 1),\n loc=\"upper left\",\n borderaxespad=0,\n)\n\n# Set axis properties\nplt.xticks(n_components)\nplt.ylim((0.85, 1.0))\n\n# # Adjust layout\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Print the results\n\nWe print information about the selected model, including its complexity and\nperformance. We also show a summary table of all models using polars.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Best model selected by the one-standard-error rule:\")\nprint(f\"Number of PCA components: {best_components}\")\nprint(f\"Accuracy score: {best_score:.4f}\")\nprint(f\"Best possible accuracy: {np.max(test_scores):.4f}\")\nprint(f\"Accuracy threshold (best - 1 std): {lower:.4f}\")\n\n# Create a summary table with polars\nsummary_df = results_df.select(\n pl.col(\"n_components\"),\n pl.col(\"mean_test_score\").round(4).alias(\"test_score\"),\n pl.col(\"std_test_score\").round(4).alias(\"test_std\"),\n pl.col(\"mean_train_score\").round(4).alias(\"train_score\"),\n pl.col(\"std_train_score\").round(4).alias(\"train_std\"),\n pl.col(\"mean_fit_time\").round(3).alias(\"fit_time\"),\n pl.col(\"rank_test_score\").alias(\"rank\"),\n)\n\n# Add a column to mark the selected model\nsummary_df = summary_df.with_columns(\n pl.when(pl.col(\"n_components\") == best_components)\n .then(pl.lit(\"*\"))\n .otherwise(pl.lit(\"\"))\n .alias(\"selected\")\n)\n\nprint(\"\\nModel comparison table:\")\nprint(summary_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n\nThe one-standard-error rule helps us select a simpler model (fewer PCA components)\nwhile maintaining performance statistically comparable to the best model.\nThis approach can help prevent overfitting and improve model interpretability\nand efficiency.\n\nIn this example, we've seen how to implement this rule using a custom refit\ncallable with :class:`~sklearn.model_selection.GridSearchCV`.\n\nKey takeaways:\n\n1. The one-standard-error rule provides a good rule of thumb to select simpler models\n\n2. Custom refit callables in :class:`~sklearn.model_selection.GridSearchCV` allow for\n flexible model selection strategies\n\n3. Visualizing both train and test scores helps identify potential overfitting\n\nThis approach can be applied to other model selection scenarios where balancing\ncomplexity and performance is important, or in cases where a use-case specific\nselection of the \"best\" model is desired.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Display the figure\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }