{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Custom refit strategy of a grid search with cross-validation\n\nThis examples shows how a classifier is optimized by cross-validation,\nwhich is done using the :class:`~sklearn.model_selection.GridSearchCV` object\non a development set that comprises only half of the available labeled data.\n\nThe performance of the selected hyper-parameters and trained model is\nthen measured on a dedicated evaluation set that was not used during\nthe model selection step.\n\nMore details on tools available for model selection can be found in the\nsections on `cross_validation` and `grid_search`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dataset\n\nWe will work with the `digits` dataset. The goal is to classify handwritten\ndigits images.\nWe transform the problem into a binary classification for easier\nunderstanding: the goal is to identify whether a digit is `8` or not.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import datasets\n\ndigits = datasets.load_digits()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to train a classifier on images, we need to flatten them into vectors.\nEach image of 8 by 8 pixels needs to be transformed to a vector of 64 pixels.\nThus, we will get a final data array of shape `(n_images, n_pixels)`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_samples = len(digits.images)\nX = digits.images.reshape((n_samples, -1))\ny = digits.target == 8\nprint(\n f\"The number of images is {X.shape[0]} and each image contains {X.shape[1]} pixels\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As presented in the introduction, the data will be split into a training\nand a testing set of equal size.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define our grid-search strategy\n\nWe will select a classifier by searching the best hyper-parameters on folds\nof the training set. To do this, we need to define\nthe scores to select the best candidate.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scores = [\"precision\", \"recall\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also define a function to be passed to the `refit` parameter of the\n:class:`~sklearn.model_selection.GridSearchCV` instance. It will implement the\ncustom strategy to select the best candidate from the `cv_results_` attribute\nof the :class:`~sklearn.model_selection.GridSearchCV`. Once the candidate is\nselected, it is automatically refitted by the\n:class:`~sklearn.model_selection.GridSearchCV` instance.\n\nHere, the strategy is to short-list the models which are the best in terms of\nprecision and recall. From the selected models, we finally select the fastest\nmodel at predicting. Notice that these custom choices are completely\narbitrary.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\n\ndef print_dataframe(filtered_cv_results):\n \"\"\"Pretty print for filtered dataframe\"\"\"\n for mean_precision, std_precision, mean_recall, std_recall, params in zip(\n filtered_cv_results[\"mean_test_precision\"],\n filtered_cv_results[\"std_test_precision\"],\n filtered_cv_results[\"mean_test_recall\"],\n filtered_cv_results[\"std_test_recall\"],\n filtered_cv_results[\"params\"],\n ):\n print(\n f\"precision: {mean_precision:0.3f} (\u00b1{std_precision:0.03f}),\"\n f\" recall: {mean_recall:0.3f} (\u00b1{std_recall:0.03f}),\"\n f\" for {params}\"\n )\n print()\n\n\ndef refit_strategy(cv_results):\n \"\"\"Define the strategy to select the best estimator.\n\n The strategy defined here is to filter-out all results below a precision threshold\n of 0.98, rank the remaining by recall and keep all models with one standard\n deviation of the best by recall. Once these models are selected, we can select the\n fastest model to predict.\n\n Parameters\n ----------\n cv_results : dict of numpy (masked) ndarrays\n CV results as returned by the `GridSearchCV`.\n\n Returns\n -------\n best_index : int\n The index of the best estimator as it appears in `cv_results`.\n \"\"\"\n # print the info about the grid-search for the different scores\n precision_threshold = 0.98\n\n cv_results_ = pd.DataFrame(cv_results)\n print(\"All grid-search results:\")\n print_dataframe(cv_results_)\n\n # Filter-out all results below the threshold\n high_precision_cv_results = cv_results_[\n cv_results_[\"mean_test_precision\"] > precision_threshold\n ]\n\n print(f\"Models with a precision higher than {precision_threshold}:\")\n print_dataframe(high_precision_cv_results)\n\n high_precision_cv_results = high_precision_cv_results[\n [\n \"mean_score_time\",\n \"mean_test_recall\",\n \"std_test_recall\",\n \"mean_test_precision\",\n \"std_test_precision\",\n \"rank_test_recall\",\n \"rank_test_precision\",\n \"params\",\n ]\n ]\n\n # Select the most performant models in terms of recall\n # (within 1 sigma from the best)\n best_recall_std = high_precision_cv_results[\"mean_test_recall\"].std()\n best_recall = high_precision_cv_results[\"mean_test_recall\"].max()\n best_recall_threshold = best_recall - best_recall_std\n\n high_recall_cv_results = high_precision_cv_results[\n high_precision_cv_results[\"mean_test_recall\"] > best_recall_threshold\n ]\n print(\n \"Out of the previously selected high precision models, we keep all the\\n\"\n \"the models within one standard deviation of the highest recall model:\"\n )\n print_dataframe(high_recall_cv_results)\n\n # From the best candidates, select the fastest model to predict\n fastest_top_recall_high_precision_index = high_recall_cv_results[\n \"mean_score_time\"\n ].idxmin()\n\n print(\n \"\\nThe selected final model is the fastest to predict out of the previously\\n\"\n \"selected subset of best models based on precision and recall.\\n\"\n \"Its scoring time is:\\n\\n\"\n f\"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}\"\n )\n\n return fastest_top_recall_high_precision_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuning hyper-parameters\n\nOnce we defined our strategy to select the best model, we define the values\nof the hyper-parameters and create the grid-search instance:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\nfrom sklearn.svm import SVC\n\ntuned_parameters = [\n {\"kernel\": [\"rbf\"], \"gamma\": [1e-3, 1e-4], \"C\": [1, 10, 100, 1000]},\n {\"kernel\": [\"linear\"], \"C\": [1, 10, 100, 1000]},\n]\n\ngrid_search = GridSearchCV(\n SVC(), tuned_parameters, scoring=scores, refit=refit_strategy\n)\ngrid_search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameters selected by the grid-search with our custom strategy are:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grid_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we evaluate the fine-tuned model on the left-out evaluation set: the\n`grid_search` object **has automatically been refit** on the full training\nset with the parameters selected by our custom refit strategy.\n\nWe can use the classification report to compute standard classification\nmetrics on the left-out set:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import classification_report\n\ny_pred = grid_search.predict(X_test)\nprint(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

The problem is too easy: the hyperparameter plateau is too flat and the\n output model is the same for precision and recall with ties in quality.

\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }