{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grid search\n", "> This chapter introduces you to a popular automated hyperparameter tuning methodology called Grid Search. You will learn what it is, how it works and practice undertaking a Grid Search using Scikit Learn. You will then learn how to analyze the output of a Grid Search & gain practical experience doing this. This is the Summary of lecture \"Hyperparameter Tuning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from pprint import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing Grid Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build Grid Search functions\n", "In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.\n", "\n", "In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "credit_card = pd.read_csv('./dataset/credit-card-full.csv')\n", "# To change categorical variable with dummy variables\n", "credit_card = pd.get_dummies(credit_card, columns=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)\n", "\n", "X = credit_card.drop(['ID', 'default payment next month'], axis=1)\n", "y = credit_card['default payment next month']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Create the function\n", "def gbm_grid_search(learn_rate, max_depth):\n", " # Create the model\n", " model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth)\n", " \n", " # Use the model to make predictions\n", " predictions = model.fit(X_train, y_train).predict(X_test)\n", " \n", " # Return the hyperparameters and score\n", " return ([learn_rate, max_depth, accuracy_score(y_test, predictions)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Iteratively tune multiple hyperparameters\n", "In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.01, 2, 0.8214444444444444],\n", " [0.01, 4, 0.8198888888888889],\n", " [0.01, 6, 0.8172222222222222],\n", " [0.1, 2, 0.8205555555555556],\n", " [0.1, 4, 0.8204444444444444],\n", " [0.1, 6, 0.8204444444444444],\n", " [0.5, 2, 0.8188888888888889],\n", " [0.5, 4, 0.8042222222222222],\n", " [0.5, 6, 0.7894444444444444]]\n" ] } ], "source": [ "# Create the relevant lists\n", "results_list = []\n", "learn_rate_list = [0.01, 0.1, 0.5]\n", "max_depth_list = [2, 4, 6]\n", "\n", "# Create the for loop\n", "for learn_rate in learn_rate_list:\n", " for max_depth in max_depth_list:\n", " results_list.append(gbm_grid_search(learn_rate, max_depth))\n", " \n", "# Print the results\n", "pprint(results_list)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Extend the function input\n", "def gbm_grid_search_extended(learn_rate, max_depth, subsample):\n", " # Extend the model creation section\n", " model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth,\n", " subsample=subsample)\n", " \n", " predictions = model.fit(X_train, y_train).predict(X_test)\n", " \n", " # Extend the return part\n", " return([learn_rate, max_depth, subsample, accuracy_score(y_test, predictions)])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.01, 2, 0.8214444444444444],\n", " [0.01, 4, 0.8198888888888889],\n", " [0.01, 6, 0.8172222222222222],\n", " [0.1, 2, 0.8205555555555556],\n", " [0.1, 4, 0.8204444444444444],\n", " [0.1, 6, 0.8204444444444444],\n", " [0.5, 2, 0.8188888888888889],\n", " [0.5, 4, 0.8042222222222222],\n", " [0.5, 6, 0.7894444444444444],\n", " [0.01, 2, 0.4, 0.8192222222222222],\n", " [0.01, 2, 0.6, 0.8208888888888889],\n", " [0.01, 4, 0.4, 0.8183333333333334],\n", " [0.01, 4, 0.6, 0.8195555555555556],\n", " [0.01, 6, 0.4, 0.8177777777777778],\n", " [0.01, 6, 0.6, 0.8196666666666667],\n", " [0.1, 2, 0.4, 0.821],\n", " [0.1, 2, 0.6, 0.8201111111111111],\n", " [0.1, 4, 0.4, 0.8207777777777778],\n", " [0.1, 4, 0.6, 0.8196666666666667],\n", " [0.1, 6, 0.4, 0.8155555555555556],\n", " [0.1, 6, 0.6, 0.8183333333333334],\n", " [0.5, 2, 0.4, 0.8128888888888889],\n", " [0.5, 2, 0.6, 0.8156666666666667],\n", " [0.5, 4, 0.4, 0.7945555555555556],\n", " [0.5, 4, 0.6, 0.8065555555555556],\n", " [0.5, 6, 0.4, 0.7714444444444445],\n", " [0.5, 6, 0.6, 0.7743333333333333]]\n" ] } ], "source": [ "# Create the new list to test\n", "subsample_list = [0.4, 0.6]\n", "\n", "for learn_rate in learn_rate_list:\n", " for max_depth in max_depth_list:\n", " # Extend the for loop\n", " for subsample in subsample_list:\n", " # Extend the results to include the new hyperparameter\n", " results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))\n", " \n", "# Print the results\n", "pprint(results_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grid Search with Scikit Learn\n", "- Steps in a Grid Search\n", " 1. An algorithm to tune the hyperparameters (or estimator)\n", " 2. Defining which hyperparameters to tune\n", " 3. Defining a range of values for each hyperparameter\n", " 4. Setting a cross-validatoin scheme\n", " 5. Defining a score function so we can decide which square on our grid was 'the best'\n", " 6. Include extra useful information or functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GridSearchCV with Scikit Learn\n", "The `GridSearchCV` module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a `GridSearchCV` object with certain parameters.\n", "\n", "The desired options are:\n", "\n", "- A Random Forest Estimator, with the split criterion as 'entropy'\n", "- 5-fold cross validation\n", "- The hyperparameters `max_depth` (2, 4, 8, 15) and `max_features` ('auto' vs 'sqrt')\n", "- Use `roc_auc` to score the models\n", "- Use 4 cores for processing in parallel\n", "- Ensure you refit the best model and return training scores" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'),\n", " n_jobs=4,\n", " param_grid={'max_depth': [2, 4, 8, 15],\n", " 'max_features': ['auto', 'sqrt']},\n", " return_train_score=True, scoring='roc_auc')\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "# Create a Random Forest Classifier with specified criterion\n", "rf_class = RandomForestClassifier(criterion='entropy')\n", "\n", "# Create the parametergrid\n", "param_grid = {\n", " 'max_depth':[2, 4, 8, 15],\n", " 'max_features':['auto', 'sqrt']\n", "}\n", "\n", "# Create a GridSearchCV object\n", "grid_rf_class = GridSearchCV(\n", " estimator=rf_class,\n", " param_grid=param_grid,\n", " scoring='roc_auc',\n", " n_jobs=4,\n", " cv=5,\n", " refit=True,\n", " return_train_score=True\n", ")\n", "\n", "print(grid_rf_class)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding a grid search output\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the grid search results\n", "You will now explore the `cv_results_` property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.\n", "\n", "A reminder of the different column types in this property:\n", "\n", "- `time_` columns\n", "- `param_` columns (one for each hyperparameter) and the singular params column (with all hyperparameter settings)\n", "- a `train_score` column for each cv fold including the mean_train_score and std_train_score columns\n", "- a `test_score` column for each cv fold including the mean_test_score and std_test_score columns\n", "- a `rank_test_score` column with a number from 1 to n (number of iterations) ranking the rows based on their `mean_test_score`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "0 0.633362 0.012019 0.022155 0.000390 \n", "1 0.641156 0.005938 0.021984 0.000733 \n", "2 1.075554 0.013504 0.025514 0.000493 \n", "3 1.078453 0.010363 0.025199 0.000147 \n", "4 1.944722 0.012599 0.033582 0.000308 \n", "5 1.945425 0.028632 0.034088 0.000777 \n", "6 3.206863 0.097319 0.053727 0.002878 \n", "7 3.147933 0.044344 0.051127 0.000389 \n", "\n", " param_max_depth param_max_features \\\n", "0 2 auto \n", "1 2 sqrt \n", "2 4 auto \n", "3 4 sqrt \n", "4 8 auto \n", "5 8 sqrt \n", "6 15 auto \n", "7 15 sqrt \n", "\n", " params split0_test_score \\\n", "0 {'max_depth': 2, 'max_features': 'auto'} 0.762140 \n", "1 {'max_depth': 2, 'max_features': 'sqrt'} 0.764059 \n", "2 {'max_depth': 4, 'max_features': 'auto'} 0.770780 \n", "3 {'max_depth': 4, 'max_features': 'sqrt'} 0.771145 \n", "4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 \n", "5 {'max_depth': 8, 'max_features': 'sqrt'} 0.775533 \n", "6 {'max_depth': 15, 'max_features': 'auto'} 0.764570 \n", "7 {'max_depth': 15, 'max_features': 'sqrt'} 0.768063 \n", "\n", " split1_test_score split2_test_score ... mean_test_score std_test_score \\\n", "0 0.765329 0.761279 ... 0.766216 0.006197 \n", "1 0.763412 0.762643 ... 0.766549 0.006738 \n", "2 0.767779 0.767878 ... 0.772387 0.005807 \n", "3 0.767072 0.768481 ... 0.772561 0.006485 \n", "4 0.774657 0.774855 ... 0.778609 0.005532 \n", "5 0.773319 0.775794 ... 0.777846 0.005482 \n", "6 0.767184 0.773903 ... 0.771833 0.005240 \n", "7 0.770179 0.775381 ... 0.773331 0.003961 \n", "\n", " rank_test_score split0_train_score split1_train_score \\\n", "0 8 0.769725 0.770268 \n", "1 7 0.769879 0.770792 \n", "2 5 0.780189 0.780651 \n", "3 4 0.780138 0.781665 \n", "4 1 0.829683 0.829852 \n", "5 2 0.830955 0.830563 \n", "6 6 0.977494 0.973979 \n", "7 3 0.975494 0.973033 \n", "\n", " split2_train_score split3_train_score split4_train_score \\\n", "0 0.770262 0.769099 0.766369 \n", "1 0.771637 0.767746 0.765142 \n", "2 0.781025 0.780711 0.777592 \n", "3 0.781055 0.780815 0.777521 \n", "4 0.829912 0.829288 0.829285 \n", "5 0.830177 0.831415 0.827246 \n", "6 0.974724 0.974242 0.976732 \n", "7 0.976973 0.974084 0.973877 \n", "\n", " mean_train_score std_train_score \n", "0 0.769145 0.001453 \n", "1 0.769039 0.002340 \n", "2 0.780034 0.001249 \n", "3 0.780239 0.001444 \n", "4 0.829604 0.000270 \n", "5 0.830071 0.001471 \n", "6 0.975434 0.001412 \n", "7 0.974692 0.001388 \n", "\n", "[8 rows x 22 columns]\n", " params\n", "0 {'max_depth': 2, 'max_features': 'auto'}\n", "1 {'max_depth': 2, 'max_features': 'sqrt'}\n", "2 {'max_depth': 4, 'max_features': 'auto'}\n", "3 {'max_depth': 4, 'max_features': 'sqrt'}\n", "4 {'max_depth': 8, 'max_features': 'auto'}\n", "5 {'max_depth': 8, 'max_features': 'sqrt'}\n", "6 {'max_depth': 15, 'max_features': 'auto'}\n", "7 {'max_depth': 15, 'max_features': 'sqrt'}\n", " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "4 1.944722 0.012599 0.033582 0.000308 \n", "\n", " param_max_depth param_max_features \\\n", "4 8 auto \n", "\n", " params split0_test_score \\\n", "4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 \n", "\n", " split1_test_score split2_test_score ... mean_test_score std_test_score \\\n", "4 0.774657 0.774855 ... 0.778609 0.005532 \n", "\n", " rank_test_score split0_train_score split1_train_score \\\n", "4 1 0.829683 0.829852 \n", "\n", " split2_train_score split3_train_score split4_train_score \\\n", "4 0.829912 0.829288 0.829285 \n", "\n", " mean_train_score std_train_score \n", "4 0.829604 0.00027 \n", "\n", "[1 rows x 22 columns]\n" ] } ], "source": [ "grid_rf_class.fit(X_train, y_train)\n", "\n", "# Read the cv_results property into adataframe & print it out\n", "cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)\n", "print(cv_results_df)\n", "\n", "# Extract and print the column with a dictionary of hyperparameters used\n", "column = cv_results_df.loc[:, [\"params\"]]\n", "print(column)\n", "\n", "# Extract and print the row that had the best mean test score\n", "best_row = cv_results_df[cv_results_df['rank_test_score'] == 1]\n", "print(best_row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyzing the best results\n", "At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's `gridSearchCV` objects have a number of parameters that provide key information on just the best square (or row in `cv_results_`).\n", "\n", "Three properties you will explore are:\n", "\n", "- `best_score_` – The score (here ROC_AUC) from the best-performing square.\n", "- `best_index_` – The index of the row in `cv_results_` containing information on the best-performing square.\n", "- `best_params_` – A dictionary of the parameters that gave the best score, for example 'max_depth': 10" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7786085423910816\n", " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "4 1.944722 0.012599 0.033582 0.000308 \n", "\n", " param_max_depth param_max_features \\\n", "4 8 auto \n", "\n", " params split0_test_score \\\n", "4 {'max_depth': 8, 'max_features': 'auto'} 0.777029 \n", "\n", " split1_test_score split2_test_score ... mean_test_score std_test_score \\\n", "4 0.774657 0.774855 ... 0.778609 0.005532 \n", "\n", " rank_test_score split0_train_score split1_train_score \\\n", "4 1 0.829683 0.829852 \n", "\n", " split2_train_score split3_train_score split4_train_score \\\n", "4 0.829912 0.829288 0.829285 \n", "\n", " mean_train_score std_train_score \n", "4 0.829604 0.00027 \n", "\n", "[1 rows x 22 columns]\n", "8\n" ] } ], "source": [ "# Print out the ROC_AUC score from the best-performing square\n", "best_score = grid_rf_class.best_score_\n", "print(best_score)\n", "\n", "# Create a variable from the row related to the best-performing square\n", "cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)\n", "best_row = cv_results_df.loc[[grid_rf_class.best_index_]]\n", "print(best_row)\n", "\n", "# Get the max_depth parameter from the best-performing square and print\n", "best_max_depth = grid_rf_class.best_params_['max_depth']\n", "print(best_max_depth)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the best results\n", "While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.\n", "\n", "We can access this object through the `best_estimator_` property of our grid search object.\n", "\n", "In this exercise we will take a look inside the `best_estimator_` property and then use this to make predictions on our test set for credit card defaults and generate a variety of scores. Remember to use `predict_proba` rather than `predict` since we need probability values rather than class labels for our roc_auc score. We use a slice `[:,1]` to get probabilities of the positive class." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[1 0 0 1 0]\n", "Confusion Matrix \n", " [[6685 323]\n", " [1292 700]]\n", "ROC-AUC Score \n", " 0.7767071783137115\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix, roc_auc_score\n", "\n", "# See what type of object the best_estimator_property is\n", "print(type(grid_rf_class.best_estimator_))\n", "\n", "# Create an array of predictions directly using the best_estimator_property\n", "predictions = grid_rf_class.best_estimator_.predict(X_test)\n", "\n", "# Take a look to confirm it worked, this should be an array of 1's and 0's\n", "print(predictions[0:5])\n", "\n", "# Now create a confusion matrix\n", "print(\"Confusion Matrix \\n\", confusion_matrix(y_test, predictions))\n", "\n", "# Get the ROC-AUC score\n", "predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:, 1]\n", "print(\"ROC-AUC Score \\n\", roc_auc_score(y_test, predictions_proba))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.best_estimator_` property is a really powerful property to understand for streamlining your machine learning model building process. You now can run a grid search and seamlessly use the best model from that search to make predictions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }