{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "### <center> Author: Tatyana Kudasova, ODS Slack @kudasova\n",
    "    \n",
    "## <center> Tutorial\n",
    "## <center> Nested cross-validation\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Why nested cross-validation?\n",
    "\n",
    "Often we want to tune the parameters of a model. That is, we want to find the value of a parameter that minimizes our loss function. The best way to do this, as we already know, is cross-validation.\n",
    "\n",
    "However, as Cawley and Talbot pointed out in their [2010 paper](http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf), since we used the test set to both select the values of the parameter and evaluate the model, we risk optimistically biasing our model evaluations. For this reason, if a test set is used to select model parameters, then we need a different test set to get an unbiased evaluation of that selected model. Mainly, we can think of model selection as another training procedure, and hence, we would need a decently-sized, independent test set that we have not seen before to get an unbiased estimate of the models’ performance. Often, this is not affordable. A good way to overcome this problem is to use nested cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nested cross-validation explained\n",
    "\n",
    "The nested cross-validation has an inner cross-validation nested in an outer cross-validation. First, an inner cross-validation is used to tune the parameters and select the best model. Second, an outer cross-validation is used to evaluate the model selected by the inner cross-validation.\n",
    "\n",
    "<img src=\"../../img/nestedCV.png\" width=\"800\"/>\n",
    "\n",
    "Imagine that we have _N_ models and we want to use _L_-fold inner cross-validation to tune hyperparameters and K-fold outer cross validation to evaluate the models. Then the algorithm is as follows:\n",
    "\n",
    "1. Divide the dataset into _K_ cross-validation folds at random.\n",
    "2. For each fold _k=1,2,…,K_: (outer loop for evaluation of the model with selected hyperparameter) <br>\n",
    "\n",
    "  2.1. Let `test` be fold _k_ <br>\n",
    "  2.2. Let `trainval` be all the data except those in fold _k_ <br>\n",
    "  2.3. Randomly split `trainval` into _L_ folds <br>\n",
    "  2.4. For each fold _l=1,2,…L_: (inner loop for hyperparameter tuning) <br>\n",
    "> 2.4.1 Let `val` be fold _l_ <br>\n",
    "> 2.4.2 Let `train` be all the data except those in `test` or `val` <br>\n",
    "> 2.4.3 Train each of _N_ models with each hyperparameter on `train`, and evaluate it on `val`. Keep track of the performance metrics <br> \n",
    "\n",
    "  2.5. For each hyperparameter setting, calculate the average metrics score over the _L_ folds, and choose the best hyperparameter setting. <br>\n",
    "  2.6. Train each of _N_ models with the best hyperparameter on `trainval`. Evaluate its performance on `test` and save the score for fold _k_ <br>\n",
    "  \n",
    "3. For each of _N_ models calculate the mean score over all _K_ folds, and report as the generalization error.\n",
    "\n",
    "In the picture above and the code below we chose _L = 2_ and _K = 5_, but you can choose different numbers.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load required packages\n",
    "import numpy as np\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data for this tutorial is [breast cancer data](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) with 30 features and a binary target variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the data\n",
    "dataset = datasets.load_breast_cancer()\n",
    "\n",
    "# Create X from the features\n",
    "X = dataset.data\n",
    "\n",
    "# Create y from the target\n",
    "y = dataset.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Making train set for Nested CV and test set for final model evaluation\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                    train_size=0.8, \n",
    "                                                    test_size=0.2,\n",
    "                                                    random_state=1,\n",
    "                                                    stratify=y)\n",
    "\n",
    "# Initializing Classifiers\n",
    "clf1 = LogisticRegression(solver='liblinear', random_state=1)\n",
    "clf2 = KNeighborsClassifier()\n",
    "clf3 = DecisionTreeClassifier(random_state=1)\n",
    "clf4 = SVC(kernel='rbf', random_state=1)\n",
    "\n",
    "# Building the pipelines\n",
    "pipe1 = Pipeline([('std', StandardScaler()),\n",
    "                  ('clf1', clf1)])\n",
    "\n",
    "pipe2 = Pipeline([('std', StandardScaler()),\n",
    "                  ('clf2', clf2)])\n",
    "\n",
    "pipe4 = Pipeline([('std', StandardScaler()),\n",
    "                  ('clf4', clf4)])\n",
    "\n",
    "\n",
    "# Setting up the parameter grids\n",
    "param_grid1 = [{'clf1__penalty': ['l1', 'l2'],\n",
    "                'clf1__C': np.power(10., np.arange(-4, 4))}]\n",
    "\n",
    "param_grid2 = [{'clf2__n_neighbors': list(range(1, 10)),\n",
    "                'clf2__p': [1, 2]}]\n",
    "\n",
    "param_grid3 = [{'max_depth': list(range(1, 10)) + [None],\n",
    "                'criterion': ['gini', 'entropy']}]\n",
    "\n",
    "param_grid4 = [{'clf4__C': np.power(10., np.arange(-4, 4)),\n",
    "                'clf4__gamma': np.power(10., np.arange(-5, 0))}]\n",
    "\n",
    "# Setting up multiple GridSearchCV objects as inner CV, 1 for each algorithm\n",
    "gridcvs = {}\n",
    "inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=1)\n",
    "\n",
    "for pgrid, est, name in zip((param_grid1, param_grid2,\n",
    "                             param_grid3, param_grid4),\n",
    "                            (pipe1, pipe2, clf3, pipe4),\n",
    "                            ('Logit', 'KNN', 'DTree', 'SVM')):\n",
    "    gcv = GridSearchCV(estimator=est,\n",
    "                       param_grid=pgrid,\n",
    "                       scoring='accuracy',\n",
    "                       n_jobs=1,\n",
    "                       cv=inner_cv,\n",
    "                       verbose=0,\n",
    "                       refit=True)\n",
    "    gridcvs[name] = gcv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DTree | outer ACC 92.75% +/- 1.92\n",
      "KNN | outer ACC 96.26% +/- 1.32\n",
      "Logit | outer ACC 97.36% +/- 2.56\n",
      "SVM | outer ACC 97.58% +/- 0.82\n"
     ]
    }
   ],
   "source": [
    "# Making an outer CV\n",
    "outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)\n",
    "\n",
    "for name, gs_est in sorted(gridcvs.items()):\n",
    "    nested_score = cross_val_score(gs_est, \n",
    "                                   X=X_train, \n",
    "                                   y=y_train, \n",
    "                                   cv=outer_cv,\n",
    "                                   n_jobs=1)\n",
    "    print('%s | outer ACC %.2f%% +/- %.2f' % \n",
    "          (name, nested_score.mean() * 100, nested_score.std() * 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy 97.80% (average over CV train folds)\n",
      "Best Parameters: {'clf4__C': 100.0, 'clf4__gamma': 0.001}\n",
      "Training Accuracy: 98.68%\n",
      "Test Accuracy: 98.25%\n"
     ]
    }
   ],
   "source": [
    "# Fitting a model to the whole training set using the \"best\" algorithm\n",
    "best_algo = gridcvs['SVM']\n",
    "\n",
    "best_algo.fit(X_train, y_train)\n",
    "train_acc = accuracy_score(y_true=y_train, y_pred=best_algo.predict(X_train))\n",
    "test_acc = accuracy_score(y_true=y_test, y_pred=best_algo.predict(X_test))\n",
    "\n",
    "print('Accuracy %.2f%% (average over CV train folds)' %\n",
    "      (100 * best_algo.best_score_))\n",
    "print('Best Parameters: %s' % gridcvs['SVM'].best_params_)\n",
    "print('Training Accuracy: %.2f%%' % (100 * train_acc))\n",
    "print('Test Accuracy: %.2f%%' % (100 * test_acc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "In this tutorial we learned how to use nested cross-validation for hyperparameter tuning and model evaluation. Hope it will help you in your Kaggle competitions or your ML projects.\n",
    "\n",
    "Writing this tutorial we used the following sources:\n",
    "1. [Sebastian Rashka's article](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)\n",
    "2. [And also his code from GitHub](https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/nested_cv_code.ipynb.)\n",
    "3. [Weina Jin's article](https://weina.me/nested-cross-validation/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}