{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LoSAze70D1bQ"
      },
      "source": [
        "# 머신 러닝 교과서 3판"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/rickiepark/python-machine-learning-book-3rd-edition/blob/master/ch06/HalvingGridSearchCV.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
        "  </td>\n",
        "</table>"
      ],
      "metadata": {
        "id": "P5NItMe5D2UB"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uXNSPAWmYUnX"
      },
      "source": [
        "# HalvingGridSearchCV"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:44.478169Z",
          "iopub.status.busy": "2021-10-23T05:55:44.477403Z",
          "iopub.status.idle": "2021-10-23T05:55:46.015049Z",
          "shell.execute_reply": "2021-10-23T05:55:46.015557Z"
        },
        "id": "nDzhE-IDYUnc"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "df = pd.read_csv('https://archive.ics.uci.edu/ml/'\n",
        "                 'machine-learning-databases'\n",
        "                 '/breast-cancer-wisconsin/wdbc.data', header=None)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:46.022646Z",
          "iopub.status.busy": "2021-10-23T05:55:46.021263Z",
          "iopub.status.idle": "2021-10-23T05:55:46.413172Z",
          "shell.execute_reply": "2021-10-23T05:55:46.412474Z"
        },
        "id": "I3JQoYQVYUnc"
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import LabelEncoder\n",
        "\n",
        "X = df.loc[:, 2:].values\n",
        "y = df.loc[:, 1].values\n",
        "le = LabelEncoder()\n",
        "y = le.fit_transform(y)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:46.419782Z",
          "iopub.status.busy": "2021-10-23T05:55:46.418612Z",
          "iopub.status.idle": "2021-10-23T05:55:46.436557Z",
          "shell.execute_reply": "2021-10-23T05:55:46.435772Z"
        },
        "id": "fttWQaNKYUnc"
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "X_train, X_test, y_train, y_test = \\\n",
        "    train_test_split(X, y,\n",
        "                     test_size=0.20,\n",
        "                     stratify=y,\n",
        "                     random_state=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Uu114ZpZYUnc"
      },
      "source": [
        "비교를 위해 `GridSearchCV` 실행 결과를 출력합니다."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:46.446201Z",
          "iopub.status.busy": "2021-10-23T05:55:46.445400Z",
          "iopub.status.idle": "2021-10-23T05:55:49.508691Z",
          "shell.execute_reply": "2021-10-23T05:55:49.507886Z"
        },
        "id": "g2okAhotYUnd",
        "outputId": "fbecda49-6ba3-4244-9f3e-8afe70f872bb"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.9846859903381642\n",
            "{'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
            "1.7915383100509645\n"
          ]
        }
      ],
      "source": [
        "from sklearn.model_selection import GridSearchCV\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "from sklearn.svm import SVC\n",
        "from sklearn.pipeline import make_pipeline\n",
        "import numpy as np\n",
        "\n",
        "pipe_svc = make_pipeline(StandardScaler(),\n",
        "                         SVC(random_state=1))\n",
        "\n",
        "param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]\n",
        "\n",
        "param_grid = [{'svc__C': param_range,\n",
        "               'svc__kernel': ['linear']},\n",
        "              {'svc__C': param_range,\n",
        "               'svc__gamma': param_range,\n",
        "               'svc__kernel': ['rbf']}]\n",
        "\n",
        "gs = GridSearchCV(estimator=pipe_svc,\n",
        "                  param_grid=param_grid,\n",
        "                  cv=10,\n",
        "                  n_jobs=-1)\n",
        "gs = gs.fit(X_train, y_train)\n",
        "print(gs.best_score_)\n",
        "print(gs.best_params_)\n",
        "print(np.sum(gs.cv_results_['mean_fit_time']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7w9kSe67YUne"
      },
      "source": [
        "사이킷런 0.24 버전에서 추가된 `HalvingGridsearchCV`는 모든 파라미터 조합에 대해 제한된 자원으로 실행한 다음 가장 좋은 후보를 골라서 더 많은 자원을 투여하는 식으로 반복적으로 탐색을 수행합니다. 이런 방식을 SH(Successive Halving)이라고 부릅니다. `HalvingGridsearchCV`의 `resource` 매개변수는 반복마다 늘려갈 자원을 정의합니다. 기본값은 `'n_samples'`로 샘플 개수입니다. 이 외에도 탐색 대상 모델에서 양의 정수 값을 가진 매개변수를 지정할 수 있습니다. 예를 들면 랜덤 포레스트의 `n_estimators`가 가능합니다.\n",
        "\n",
        "`factor` 매개변수는 반복마다 선택할 후보의 비율을 지정합니다. 기본값은 3으로 후보 중에서 성능이 높은 1/3만 다음 반복으로 전달합니다. `max_resources` 매개변수는 각 후보가 사용할 최대 자원을 지정합니다. 기본값은 `'auto'`로 `resources='n_samples'`일 때 샘플 개수가 됩니다.\n",
        "\n",
        "`min_resources`는 첫 번째 반복에서 각 후보가 사용할 최소 자원을 지정합니다. `resources='n_samples'`이고 `min_resources='smallest'`이면 회귀일 때 `cv` $\\times$ 2가 되고 분류일 때는 `cv` $\\times$ 클래스개수 $\\times$ 2가 됩니다. 그외에는 1입니다. `min_resources='exhaust'`이면 앞에서 계산한 값과 `max_resources`를 `factor`\\*\\*`n_required_iterations`으로 나눈 몫 중 큰 값입니다. 기본값은 `'exhaust'`입니다(`n_required_iterations`는 $ \\text{log}_{factor}(전체 후보 갯수) + 1$ 입니다).\n",
        "\n",
        "마지막으로 `aggressive_elimination` 매개변수를 `True`로 지정하면 마지막 반복에서 `factor`만큼 후보가 남을 수 있도록 자원을 늘리지 않고 초기에 반복을 여러 번 진행합니다. 기본값은 `False`입니다.\n",
        "\n",
        "`HalvingGridsearchCV` 아직 실험적이기 때문에 `sklearn.experimental` 패키지 아래에 있는 `enable_halving_search_cv`을 임포트해야 사용할 수 있습니다. `verbose=1`로 지정하면 각 반복 과정을 자세히 살펴 볼 수 있습니다."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:49.516202Z",
          "iopub.status.busy": "2021-10-23T05:55:49.515283Z",
          "iopub.status.idle": "2021-10-23T05:55:51.536611Z",
          "shell.execute_reply": "2021-10-23T05:55:51.535959Z"
        },
        "id": "57Jw0SQKYUne",
        "outputId": "92ec22c2-8a68-4371-ddb9-bdf19a01d8cb"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "n_iterations: 3\n",
            "n_required_iterations: 4\n",
            "n_possible_iterations: 3\n",
            "min_resources_: 40\n",
            "max_resources_: 455\n",
            "aggressive_elimination: False\n",
            "factor: 3\n",
            "----------\n",
            "iter: 0\n",
            "n_candidates: 72\n",
            "n_resources: 40\n",
            "Fitting 10 folds for each of 72 candidates, totalling 720 fits\n",
            "----------\n",
            "iter: 1\n",
            "n_candidates: 24\n",
            "n_resources: 120\n",
            "Fitting 10 folds for each of 24 candidates, totalling 240 fits\n",
            "----------\n",
            "iter: 2\n",
            "n_candidates: 8\n",
            "n_resources: 360\n",
            "Fitting 10 folds for each of 8 candidates, totalling 80 fits\n",
            "0.9803968253968254\n",
            "{'svc__C': 10.0, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}\n"
          ]
        }
      ],
      "source": [
        "from sklearn.experimental import enable_halving_search_cv\n",
        "from sklearn.model_selection import HalvingGridSearchCV\n",
        "\n",
        "hgs = HalvingGridSearchCV(estimator=pipe_svc,\n",
        "                          param_grid=param_grid,\n",
        "                          cv=10,\n",
        "                          n_jobs=-1, verbose=1)\n",
        "hgs = hgs.fit(X_train, y_train)\n",
        "print(hgs.best_score_)\n",
        "print(hgs.best_params_)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "07UFjwdnYUne"
      },
      "source": [
        "출력 결과를 보면 첫 번째 반복(iter: 0)에서 72개의 후보를 40개의 샘플로 교차 검증을 수행합니다. 여기에서 72/3 = 24개의 후보를 뽑아 두 번째 반복(iter: 1)을 수행합니다. 두 번째 반복에서는 40 * 3 = 120개의 샘플을 사용합니다. 같은 방식으로 세 번째 반복(iter: 2)에서는 8개의 후보가 360개의 샘플로 평가됩니다. 최종 결과는 98.3%로 `GridSearchCV` 보다 조금 낮습니다. 찾은 매개변수 조합도 달라진 것을 볼 수 있습니다.\n",
        "\n",
        "3번의 반복 동안 `HalvingGridSearchCV`가 수행한 교차 검증 횟수는 모두 104번입니다. 각 교차 검증에 걸린 시간은 `cv_results_` 속성의 `mean_fit_time`에 저장되어 있습니다. 이를 `GridSearchCV`와 비교해 보면 5배 이상 빠른 것을 볼 수 있습니다."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:51.542520Z",
          "iopub.status.busy": "2021-10-23T05:55:51.541768Z",
          "iopub.status.idle": "2021-10-23T05:55:51.544673Z",
          "shell.execute_reply": "2021-10-23T05:55:51.545154Z"
        },
        "id": "qiACKBZQYUne",
        "outputId": "0defa711-9bf3-48b9-a51e-04259ba414ac",
        "scrolled": true
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.4361170530319214\n"
          ]
        }
      ],
      "source": [
        "print(np.sum(hgs.cv_results_['mean_fit_time']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qYUsdm-jYUnf"
      },
      "source": [
        "각 반복 단계에서 사용한 샘플 개수와 후보 개수는 각각 `n_resources_` 속성과 `n_candidates_` 속성에 저장되어 있습니다."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-10-23T05:55:51.550259Z",
          "iopub.status.busy": "2021-10-23T05:55:51.549579Z",
          "iopub.status.idle": "2021-10-23T05:55:51.552112Z",
          "shell.execute_reply": "2021-10-23T05:55:51.552564Z"
        },
        "id": "zjN6R6yTYUnf",
        "outputId": "1a81e2d8-8ba6-4440-b779-7e00e5c9008a"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "자원 리스트: [40, 120, 360]\n",
            "후보 리스트: [72, 24, 8]\n"
          ]
        }
      ],
      "source": [
        "print('자원 리스트:', hgs.n_resources_)\n",
        "print('후보 리스트:', hgs.n_candidates_)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "HalvingGridSearchCV.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}