{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Tatyana Kudasova, ODS Slack @kudasova\n", " \n", "##
Tutorial\n", "##
Nested cross-validation\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why nested cross-validation?\n", "\n", "Often we want to tune the parameters of a model. That is, we want to find the value of a parameter that minimizes our loss function. The best way to do this, as we already know, is cross-validation.\n", "\n", "However, as Cawley and Talbot pointed out in their [2010 paper](http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf), since we used the test set to both select the values of the parameter and evaluate the model, we risk optimistically biasing our model evaluations. For this reason, if a test set is used to select model parameters, then we need a different test set to get an unbiased evaluation of that selected model. Mainly, we can think of model selection as another training procedure, and hence, we would need a decently-sized, independent test set that we have not seen before to get an unbiased estimate of the models’ performance. Often, this is not affordable. A good way to overcome this problem is to use nested cross-validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Nested cross-validation explained\n", "\n", "The nested cross-validation has an inner cross-validation nested in an outer cross-validation. First, an inner cross-validation is used to tune the parameters and select the best model. Second, an outer cross-validation is used to evaluate the model selected by the inner cross-validation.\n", "\n", "\n", "\n", "Imagine that we have _N_ models and we want to use _L_-fold inner cross-validation to tune hyperparameters and K-fold outer cross validation to evaluate the models. Then the algorithm is as follows:\n", "\n", "1. Divide the dataset into _K_ cross-validation folds at random.\n", "2. For each fold _k=1,2,…,K_: (outer loop for evaluation of the model with selected hyperparameter)
\n", "\n", " 2.1. Let `test` be fold _k_
\n", " 2.2. Let `trainval` be all the data except those in fold _k_
\n", " 2.3. Randomly split `trainval` into _L_ folds
\n", " 2.4. For each fold _l=1,2,…L_: (inner loop for hyperparameter tuning)
\n", "> 2.4.1 Let `val` be fold _l_
\n", "> 2.4.2 Let `train` be all the data except those in `test` or `val`
\n", "> 2.4.3 Train each of _N_ models with each hyperparameter on `train`, and evaluate it on `val`. Keep track of the performance metrics
\n", "\n", " 2.5. For each hyperparameter setting, calculate the average metrics score over the _L_ folds, and choose the best hyperparameter setting.
\n", " 2.6. Train each of _N_ models with the best hyperparameter on `trainval`. Evaluate its performance on `test` and save the score for fold _k_
\n", " \n", "3. For each of _N_ models calculate the mean score over all _K_ folds, and report as the generalization error.\n", "\n", "In the picture above and the code below we chose _L = 2_ and _K = 5_, but you can choose different numbers.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load required packages\n", "import numpy as np\n", "from sklearn import datasets\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.model_selection import (GridSearchCV, StratifiedKFold,\n", " cross_val_score, train_test_split)\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data for this tutorial is [breast cancer data](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) with 30 features and a binary target variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data\n", "dataset = datasets.load_breast_cancer()\n", "\n", "# Create X from the features\n", "X = dataset.data\n", "\n", "# Create y from the target\n", "y = dataset.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Making train set for Nested CV and test set for final model evaluation\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, train_size=0.8, test_size=0.2, random_state=1, stratify=y\n", ")\n", "\n", "# Initializing Classifiers\n", "clf1 = LogisticRegression(solver=\"liblinear\", random_state=1)\n", "clf2 = KNeighborsClassifier()\n", "clf3 = DecisionTreeClassifier(random_state=1)\n", "clf4 = SVC(kernel=\"rbf\", random_state=1)\n", "\n", "# Building the pipelines\n", "pipe1 = Pipeline([(\"std\", StandardScaler()), (\"clf1\", clf1)])\n", "\n", "pipe2 = Pipeline([(\"std\", StandardScaler()), (\"clf2\", clf2)])\n", "\n", "pipe4 = Pipeline([(\"std\", StandardScaler()), (\"clf4\", clf4)])\n", "\n", "\n", "# Setting up the parameter grids\n", "param_grid1 = [\n", " {\"clf1__penalty\": [\"l1\", \"l2\"], \"clf1__C\": np.power(10.0, np.arange(-4, 4))}\n", "]\n", "\n", "param_grid2 = [{\"clf2__n_neighbors\": list(range(1, 10)), \"clf2__p\": [1, 2]}]\n", "\n", "param_grid3 = [\n", " {\"max_depth\": list(range(1, 10)) + [None], \"criterion\": [\"gini\", \"entropy\"]}\n", "]\n", "\n", "param_grid4 = [\n", " {\n", " \"clf4__C\": np.power(10.0, np.arange(-4, 4)),\n", " \"clf4__gamma\": np.power(10.0, np.arange(-5, 0)),\n", " }\n", "]\n", "\n", "# Setting up multiple GridSearchCV objects as inner CV, 1 for each algorithm\n", "gridcvs = {}\n", "inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=1)\n", "\n", "for pgrid, est, name in zip(\n", " (param_grid1, param_grid2, param_grid3, param_grid4),\n", " (pipe1, pipe2, clf3, pipe4),\n", " (\"Logit\", \"KNN\", \"DTree\", \"SVM\"),\n", "):\n", " gcv = GridSearchCV(\n", " estimator=est,\n", " param_grid=pgrid,\n", " scoring=\"accuracy\",\n", " n_jobs=1,\n", " cv=inner_cv,\n", " verbose=0,\n", " refit=True,\n", " )\n", " gridcvs[name] = gcv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Making an outer CV\n", "outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)\n", "\n", "for name, gs_est in sorted(gridcvs.items()):\n", " nested_score = cross_val_score(gs_est, X=X_train, y=y_train, cv=outer_cv, n_jobs=1)\n", " print(\n", " \"%s | outer ACC %.2f%% +/- %.2f\"\n", " % (name, nested_score.mean() * 100, nested_score.std() * 100)\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fitting a model to the whole training set using the \"best\" algorithm\n", "best_algo = gridcvs[\"SVM\"]\n", "\n", "best_algo.fit(X_train, y_train)\n", "train_acc = accuracy_score(y_true=y_train, y_pred=best_algo.predict(X_train))\n", "test_acc = accuracy_score(y_true=y_test, y_pred=best_algo.predict(X_test))\n", "\n", "print(\"Accuracy %.2f%% (average over CV train folds)\" % (100 * best_algo.best_score_))\n", "print(\"Best Parameters: %s\" % gridcvs[\"SVM\"].best_params_)\n", "print(\"Training Accuracy: %.2f%%\" % (100 * train_acc))\n", "print(\"Test Accuracy: %.2f%%\" % (100 * test_acc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "In this tutorial we learned how to use nested cross-validation for hyperparameter tuning and model evaluation. Hope it will help you in your Kaggle competitions or your ML projects.\n", "\n", "Writing this tutorial we used the following sources:\n", "1. [Sebastian Rashka's article](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)\n", "2. [And also his code from GitHub](https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/nested_cv_code.ipynb.)\n", "3. [Weina Jin's article](https://weina.me/nested-cross-validation/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }