{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple version Random Forest 생성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(1) *Sklearn moons* 훈련 데이터를 1,000개 생성합니다. 각각의 훈련 데이터는 무작위로 선택된 10,000개의 샘플을 담고 있도록 합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(2) 먼저 트리분류기를 학습한 후 테스트 셋에서 최종 성능을 확인합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(3) 각 테스트 샘플에 대해 1,000개의 결정 트리 예측을 만들고 다수로 나온 예측만 취합니다. 그러면 테스트 세트에 대한 **다수결 예측(majority vote prediction)** 이 생성됩니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(4) 테스트 세트에서 이 예측을 평가합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 데이터 로딩 " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_moons\n", "\n", "X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 학습 및 테스트 세트 구분" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.3, random_state=42)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "학습 셋 : (7000, 2)\n", "테스트 셋 : (3000, 2)\n" ] } ], "source": [ "print('학습 셋 :', X_train.shape)\n", "print('테스트 셋 :', X_test.shape)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.84441684, 1.2423668 ],\n", " [ 0.16320378, 0.82374035],\n", " [ 1.24805333, 0.05579093],\n", " [ 0.35703881, -0.01696228],\n", " [ 0.69022909, -0.25945021]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[:5]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, 1, 1])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Hyperparameter 검색" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "params = {\n", " 'max_leaf_nodes': [2, 3, 4, 5, 6, 7], \n", " 'min_samples_split': [2, 3, 4],\n", " 'max_depth': [3, 5, 10, 15, 20]\n", "}\n", "grid_search_cv = GridSearchCV(\n", " DecisionTreeClassifier(random_state=42), \n", " params, \n", " n_jobs=-1, \n", " verbose=1, \n", " cv=3\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 90 candidates, totalling 270 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed: 0.7s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=3, error_score='raise',\n", " estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=42,\n", " splitter='best'),\n", " fit_params=None, iid=True, n_jobs=-1,\n", " param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7], 'min_samples_split': [2, 3, 4], 'max_depth': [3, 5, 10, 15, 20]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", " scoring=None, verbose=1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search_cv.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n", " max_features=None, max_leaf_nodes=4, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " presort=False, random_state=42, splitter='best')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 성능이 좋은 하이퍼파라미터를 찾는다.\n", "grid_search_cv.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. 단일 트리 성능 체크" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.856" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "y_pred = grid_search_cv.predict(X_test)\n", "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. 랜덤 포레스트 모델을 생성하기 위하여 학습셋 샘플들 생성" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import ShuffleSplit\n", "\n", "# 총 1000개의 tree\n", "n_trees = 1000\n", "n_instances = 100\n", "\n", "mini_sets = []\n", "\n", "rs = ShuffleSplit(\n", " n_splits=n_trees, \n", " test_size=len(X_train) - n_instances, \n", " random_state=42\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 샘플 개수 확인\n", "len(list(rs.split(X_train)))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "학습 세트 : 100\n", "테스트 세트 : 6900\n" ] } ], "source": [ "# 학습 세트는 사용한다.\n", "print('학습 세트 :', len(list(rs.split(X_train))[0][0]))\n", "# 테스트 세트는 사용하지 않을 것이다.\n", "print('테스트 세트 :', len(list(rs.split(X_train))[0][1]))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "for mini_train_index, _ in rs.split(X_train):\n", " X_mini_train = X_train[mini_train_index]\n", " y_mini_train = y_train[mini_train_index]\n", " mini_sets.append((X_mini_train, y_mini_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. 1000개의 개별 모델 학습" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8281106666666667" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "from sklearn.base import clone\n", "\n", "# 1000개의 학습트리\n", "forest = [clone(grid_search_cv.best_estimator_) \\\n", " for _ in range(n_trees)]\n", "\n", "accuracy_scores = []\n", "\n", "# 1000개의 트리에 대해서 학습한다.\n", "for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):\n", " # 개별 모델을 학습한다.\n", " tree.fit(X_mini_train, y_mini_train)\n", " # 학습한 개별 모델의 예측을 구한다.\n", " y_pred = tree.predict(X_test)\n", " accuracy_scores.append(accuracy_score(y_test, y_pred))\n", "\n", "np.mean(accuracy_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "단일 트리를 이용했을 때, 7000개의 샘플에서 학습을 하였기 때문에, 100개의 샘플에서 학습을 했을 때보다 정확도가 높다.\n", "\n", "100개의 샘플에서 학습한, 개별 학습기 1000개의 성능 평균값이 낮은 것을 확인 할 수 있다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. 1000개의 개별 모델에서 예측값을 얻고 앙상블" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)\n", "\n", "# 1000개의 개별 모델에서 예측값을 각각 구한다.\n", "for tree_index, tree in enumerate(forest):\n", " Y_pred[tree_index] = tree.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import mode\n", "\n", "# Majority vote\n", "y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 0, ..., 0, 1, 1]], dtype=uint8)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_majority_votes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8613333333333333" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }