{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Successive Halving Iterations\n\nThis example illustrates how a successive halving search\n(:class:`~sklearn.model_selection.HalvingGridSearchCV` and\n:class:`~sklearn.model_selection.HalvingRandomSearchCV`)\niteratively chooses the best parameter combination out of\nmultiple candidates.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats import randint\n\nfrom sklearn import datasets\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.experimental import enable_halving_search_cv # noqa\nfrom sklearn.model_selection import HalvingRandomSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first define the parameter space and train a\n:class:`~sklearn.model_selection.HalvingRandomSearchCV` instance.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rng = np.random.RandomState(0)\n\nX, y = datasets.make_classification(n_samples=400, n_features=12, random_state=rng)\n\nclf = RandomForestClassifier(n_estimators=20, random_state=rng)\n\nparam_dist = {\n \"max_depth\": [3, None],\n \"max_features\": randint(1, 6),\n \"min_samples_split\": randint(2, 11),\n \"bootstrap\": [True, False],\n \"criterion\": [\"gini\", \"entropy\"],\n}\n\nrsh = HalvingRandomSearchCV(\n estimator=clf, param_distributions=param_dist, factor=2, random_state=rng\n)\nrsh.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use the `cv_results_` attribute of the search estimator to inspect\nand plot the evolution of the search.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = pd.DataFrame(rsh.cv_results_)\nresults[\"params_str\"] = results.params.apply(str)\nresults.drop_duplicates(subset=(\"params_str\", \"iter\"), inplace=True)\nmean_scores = results.pivot(\n index=\"iter\", columns=\"params_str\", values=\"mean_test_score\"\n)\nax = mean_scores.plot(legend=False, alpha=0.6)\n\nlabels = [\n f\"iter={i}\\nn_samples={rsh.n_resources_[i]}\\nn_candidates={rsh.n_candidates_[i]}\"\n for i in range(rsh.n_iterations_)\n]\n\nax.set_xticks(range(rsh.n_iterations_))\nax.set_xticklabels(labels, rotation=45, multialignment=\"left\")\nax.set_title(\"Scores of candidates over iterations\")\nax.set_ylabel(\"mean test score\", fontsize=15)\nax.set_xlabel(\"iterations\", fontsize=15)\nplt.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of candidates and amount of resource at each iteration\n\nAt the first iteration, a small amount of resources is used. The resource\nhere is the number of samples that the estimators are trained on. All\ncandidates are evaluated.\n\nAt the second iteration, only the best half of the candidates is evaluated.\nThe number of allocated resources is doubled: candidates are evaluated on\ntwice as many samples.\n\nThis process is repeated until the last iteration, where only 2 candidates\nare left. The best candidate is the candidate that has the best score at the\nlast iteration.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }