{ "cells": [ { "cell_type": "markdown", "source": [ "# Tunning XGBoost, Catboost and Lightgbm Classifiers Hyperparameters with Hyperopt" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "**Abstract**. This notebook is an example on how to use `hyperopt` to train `XGBoost` and `CatBoost` classifiers. A cross-validation object is also defined." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# The Dataset" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "For this notebook, we wil use the breast cancer dataset" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "from sklearn.datasets import load_breast_cancer\n", "import pandas as pd\n", "\n", "data = load_breast_cancer()\n", "X = data['data']\n", "y = data['target']" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "To prepare for training and validation, we will split this dataset into training and test" ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "tran_size = 0.75\n", "n_splits = 1\n", "\n", "sss = StratifiedShuffleSplit(n_splits=n_splits, train_size=tran_size)\n", "for train_index, test_index in sss.split(X, y):\n", " X_train, X_test = X[train_index], X[test_index]\n", " y_train, y_test = y[train_index], y[test_index]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "# A note about the hyperparameters of `XGBClassifier` (X), and `CatBoostClassifier` (C)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Some basic parameters:\n", "\n", "* `learning_rate` [X/C]: learning rate (alias: `eta` )\n", "* `max_depth` [X/C]: maximum depth of trees\n", "* `n_estimators` [X/C]: no. of boosting iterations\n", "* `min_child_weight` [X]: minimum sum of instance (hessian) weight needed in a child\n", "* `min_child_samples` [C]: minimum no. of data in one leaf\n", "* `subsample` [X/C]: subsample ratio of the training instances (note that for CatBoost this parameter can be used only if either Poisson or Bernoulli bootstrap_type is * `selected`)\n", "* `colsample_bytree` [X]: subsample ratio of columns in tree building\n", "* `colsample_bylevel` [X/C]: subsample ratio of columns for each level in tree building\n", "* `colsample_bynode` [X]: subsample ratio of columns for each node\n", "* `tree_method` [X]: tree construction method\n", "* `boosting_type` [C]: Ordered for ordered boosting or Plain for classic\n", "* `early_stopping_rounds` [X/C]: parameter for fit() — stop the training if one metric of a validation data does not improve in last early_stopping_rounds rounds\n", "* `eval_metric` [X/C]: evaluation metrics for validation data" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# A note on `hyperopt`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "There two options for the search algortihm: \n", "* `hyperopt.tpe.suggest` \n", "* `hyperopt.rand.suggest`\n", "\n", "The search space is defined using uniform, loguniform, normal distribution along with its quantized variations. For instance, `uniform('x', -1, 1)` defines a search space with label `x` that will be sampled uniformly between -1 and 1. The expressions currently recognized by hyperopt’s optimization algorithms are:\n", "\n", "* `hp.choice(label, options)`: index of an option\n", "* `hp.randint(label, upper)`: random integer within $[0, \\text{upper})$\n", "* `hp.uniform(label, low, high)`: uniform value between low/high\n", "* `hp.quniform(label, low, high, q)`: `round(uniform(.)/q)*q` (note that the value is giving a float instead of integer)\n", "* `hp.loguniform(label, low, high)`: `exp(uniform(low, high)/q)*q`\n", "* `hp.qloguniform(label, low, high, q)`: `round(loguniform())`\n", "* `hp.normal(label, mu, sigma)`: sampling from normal distribution\n", "* `hp.qnormal(label, mu, sigma, q)`: `round(normal(nu, sigma)/q)*q`\n", "* `hp.lognormal(label, mu, sigma)`: `exp(normal(mu, sigma)`\n", "* `hp.qlognormal(label, mu, sigma, q)` : `round(exp(normal(.))/q)*q`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Finding optimal hyperparameters of `XGBClassifier` using `hyperopt`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "The search space for the hyperparameters is defined by the following distributions" ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "from hyperopt import hp\n", "\n", "space = {\n", " 'max_depth' : hp.randint('max_depth', 10),\n", " 'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),\n", " 'n_estimators' : hp.randint('n_estimators', 250),\n", " 'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),\n", " 'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),\n", " 'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),\n", " 'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01)\n", " }" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "A first cost function is defined for the `XGBoostClassifier` and it is compose of three steps\n", "1. Instatiate the estimator with a selected set of hyperparameters and fit\n", "2. Evaluate the estimator according to chosen performance metrics\n", "3. Calculate the loss (ie, how close the chosen performance metrics are close to their optimal value) and return this value" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "import warnings\n", "\n", "import numpy as np\n", "from hyperopt import STATUS_FAIL, STATUS_OK, Trials, fmin, tpe\n", "from sklearn.metrics import f1_score, log_loss, roc_auc_score\n", "from xgboost import XGBClassifier\n", "\n", "def xgb_cost(space, X_train, y_train):\n", "\n", " warnings.filterwarnings(action='ignore')\n", "\n", " # 1. Instantiate estimator with selected hyperparamters and fit\n", " classifier = XGBClassifier(n_estimators = space['n_estimators']\n", " ,max_depth = int(space['max_depth'])\n", " ,learning_rate = space['learning_rate']\n", " ,gamma = space['gamma']\n", " ,min_child_weight = space['min_child_weight']\n", " ,subsample = space['subsample']\n", " ,colsample_bytree = space['colsample_bytree']\n", " ,use_label_encoder=False\n", " ,eval_metric=\"logloss\"\n", " )\n", " classifier.fit(X_train, y_train)\n", "\n", " # 2. Evaluate the estimator according to chose performance metrics\n", " y_proba_train = classifier.predict_proba(X_train)[:, 1]\n", " y_class_train = classifier.predict(X_train)\n", "\n", " rocauc_train = roc_auc_score(y_train, y_proba_train)\n", " f1_train = f1_score(y_train, y_class_train)\n", " logloss_train = log_loss(y_train, y_class_train)\n", "\n", " # 3. Calculate the loss \n", " loss = ((1.0 - rocauc_train)**2 + (1.0 - f1_train)**2 + logloss_train**2)\n", "\n", " return {'loss': np.sqrt(loss), 'status': STATUS_OK }\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "In the next step, the objective function is defined from the cost function and the minization problem is solved. The output are the optimal hyperparameters found in the search space." ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "objective = lambda x: xgb_cost(x, X_train=X_train, y_train=y_train)\n", "trials = Trials()\n", "\n", "best = fmin(fn=objective,\n", " space=space,\n", " algo=tpe.suggest,\n", " max_evals=100,\n", " trials=trials)\n", "\n", "print(\"Optimal hyperparameters: \", best)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "100%|██████████| 100/100 [00:08<00:00, 11.28trial/s, best loss: 9.992007221626413e-16]\n", "Best: {'colsample_bytree': 0.68, 'gamma': 0.24, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1.0, 'n_estimators': 38, 'subsample': 0.9400000000000001}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Using `hyperopt` for Cross Validation" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Here, a cross validation class is defined" ], "metadata": {} }, { "cell_type": "code", "execution_count": 54, "source": [ "import sys\n", "import traceback\n", "from collections.abc import Iterable\n", "from typing import Union\n", "\n", "from hyperopt import STATUS_FAIL, STATUS_OK, Trials, fmin, hp, tpe\n", "import pandas as pd\n", "import numpy as np\n", "import scipy as sp\n", "from sklearn.model_selection import KFold, StratifiedKFold\n", "from sklearn.metrics import get_scorer\n", "\n", "class BayesSearchCV:\n", "\n", " def __init__(self, estimator, param_distributions: dict, scoring: dict\n", " ,n_iter: int=10, weights_matrix: np.ndarray=None, cv: Union[int, Iterable]=5, random_state: int=None\n", " ,algo=tpe.suggest, trials: Trials=Trials()) -> None:\n", " \"\"\"Use Bayesian optimisation to search for hyperpameters and selects best estimator based on validation sets\n", "\n", " Args:\n", " estimator: Estimator object\n", " param_distributions (dict): Search space containing hyperparameters\n", " scoring (dict): {metric: opt_value} Dict of performance metrics to measure the estimator performance and their corresponding optimal values.\n", " Select one from sklearn.metrics.SCORERS.keys()\n", " n_iter (int, optional): Max number of iterations. Defaults to 10.\n", " weights_matrix (np.ndarray, optional): Symmetric positive definite matrix used to calculate the quadratic loss function\n", " cv (int or Iterable, optional): int, cross-validation generator or an iterable. Defaults to 5.\n", " random_state (int, optional): Pseudo random number generator state used for random uniform sampling. Defaults to None.\n", " algo (optional): Algorithm to for distribution search. Defaults to tpe.suggest.\n", " trials (Trials, optional): [description]. Defaults to Trials().\n", " \"\"\"\n", " self.estimator = estimator\n", " self.param_distributions = param_distributions\n", " self.n_iter = n_iter\n", " self.random_state = random_state\n", " self.weights_matrix = weights_matrix or np.identity(len(scoring))\n", " self.cv = cv\n", " self.algo = algo\n", " self.trials = trials\n", " self.scoring = scoring\n", "\n", " def fit(self, X: pd.DataFrame, y=None) -> None:\n", " \"\"\"Find optimal hyperparameters and fit estimator\n", "\n", " Args:\n", " X (pd.DataFrame): Predictors\n", " y (pd.DataFrame): Target\n", " \"\"\"\n", " self.cv_results_ = pd.DataFrame()\n", " self.min_loss = np.inf\n", "\n", " if not self._check_spd(self.weights_matrix):\n", " msg = f\"Expected weights matrix to be symmetric positive definite.\"\n", " raise ValueError(msg)\n", "\n", " for iteration, (train_index, val_index) in enumerate(self._get_splits(X, y)):\n", " X_train, X_val = X[train_index], X[val_index]\n", " y_train, y_val = y[train_index], y[val_index]\n", "\n", " objective = lambda space: self._cost(X=X_train, y=y_train, hyperparameters=space)\n", "\n", " try:\n", " hyperparameters = fmin(fn=objective, space=self.param_distributions\n", " ,algo=self.algo, max_evals=self.n_iter\n", " ,trials=self.trials)\n", "\n", " except KeyError:\n", " exc_info = sys.exc_info()\n", " traceback.print_exception(*exc_info)\n", " return {'status': STATUS_FAIL,\n", " 'exception': str(sys.exc_info())}\n", "\n", " estimator = self._instantiate_estimator(X_train, y_train, hyperarameters=hyperparameters)\n", " loss_df, current_loss = self._cost(X_val, y_val, hyperparameters, estimator=estimator, return_loss_df=True)\n", " loss_df[\"cv_iteration\"] = iteration\n", "\n", " if current_loss < self.min_loss:\n", " self.min_loss = current_loss\n", " self.best_estimator_ = estimator\n", " self.best_hyperparameters_ = hyperparameters\n", "\n", " self.cv_results_ = pd.concat([self.cv_results_ , loss_df.copy()])\n", "\n", " self.cv_results_.rename(columns={col: f\"{col}_loss\" for col in self.cv_results_.columns if col != \"loss\"}, inplace=True)\n", " self.cv_results_.sort_values(by=\"loss\", inplace=True)\n", " self.cv_results_.reset_index(inplace=True, drop=True)\n", "\n", "\n", " def _get_splits(self, X: pd.DataFrame, y=None):\n", " \"\"\"Instantiate and/or get training and validation datasets\n", "\n", " Args:\n", " X (pd.DataFrame): Predictor\n", " y (pd.DataFrame): Target\n", "\n", " Yields:\n", " [type]: Train and test indices\n", "\n", " Raises:\n", " NotImplementedError: Only KFold and StratifiedKFold are implemented\n", " \"\"\"\n", "\n", " if isinstance(self.cv, int): \n", " self.cv = KFold(n_splits=self.cv, random_state=self.random_state)\n", "\n", " elif isinstance(self.cv, StratifiedKFold):\n", " pass\n", "\n", " else:\n", " msg = f\"Cross validation not yet implemented for type {type(self.cv)}\"\n", " NotImplementedError(msg)\n", " \n", " for train_index, test_index in self.cv.split(X, y):\n", " yield train_index, test_index\n", "\n", "\n", " def _cost(self, X: pd.DataFrame, y: pd.DataFrame, hyperparameters: dict\n", " ,estimator=None, return_loss_df: bool=False) -> dict:\n", " \"\"\"Evaluates the cost function for the trained estimator using a quadratic loss function\n", "\n", " Args:\n", " X (pd.DataFrame): Predictor\n", " y (pd.DataFrame): Target\n", " hyperarameters (dict): Estimator hyperparameters\n", " return_loss_df (bool): Returns fit loss data frame\n", "\n", " Returns:\n", " (dict)\n", " \"\"\"\n", " loss_dict = {metric_name: [] for metric_name in self.scoring}\n", "\n", " if not estimator:\n", " estimator = self._instantiate_estimator(X, y, hyperparameters)\n", "\n", " for p_metric, opt_value in self.scoring.items():\n", " scorer = get_scorer(p_metric)\n", " loss_dict[p_metric].append((opt_value - scorer(estimator, X, y))**2)\n", "\n", " loss_df = pd.DataFrame.from_dict(loss_dict)\n", " loss = loss_df.values.dot(self.weights_matrix.dot(loss_df.T.values))\n", " loss_df[\"loss\"] = loss\n", "\n", " if return_loss_df:\n", " return loss_df, loss\n", "\n", " return {'loss': np.sqrt(loss), 'status': STATUS_OK}\n", "\n", "\n", " def _instantiate_estimator(self, X: pd.DataFrame, y: pd.DataFrame\n", " ,hyperarameters: dict):\n", " \"\"\"Instantiate estimator with selected hyperparameters\n", "\n", " Args:\n", " X (pd.DataFrame): Predictors\n", " y (pd.DataFrame): Target\n", " hyperarameters (dict): Estimator hyperparameters\n", "\n", " Returns:\n", " [type]: Estimator\n", " \"\"\"\n", " estimator_cls = self.estimator.__class__\n", " estimator = estimator_cls(**hyperarameters)\n", " estimator.fit(X, y)\n", " return estimator\n", "\n", "\n", " @staticmethod\n", " def _check_spd(m: np.ndarray, rtol: float=1e-6, atol: float=1e-9) -> bool:\n", " \"\"\"Checks if a matrix is symmetric positive definite\n", "\n", " Args:\n", " m (np.ndarray): Matrix\n", " rtol (float): Relative tolerance to verify if m is symmetric\n", " atol (float): Absolute tolerance to verify if m is symmetric\n", "\n", " Returns:\n", " (bool): True if matrix is symmetric positive definite\n", " \"\"\"\n", "\n", " try:\n", " # Check if matrix is positive definite\n", " np.linalg.cholesky(m)\n", "\n", " except np.linalg.linalg.LinAlgError as err:\n", " if 'Matrix is not positive definite' in err.message:\n", " return False\n", "\n", " else:\n", " raise \n", "\n", " else:\n", " # Now that m is positive definite, check if it is symmetric\n", " return np.allclose(m, m.T, rtol=rtol, atol=atol)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Example with `XGBClassifier`" ], "metadata": {} }, { "cell_type": "code", "execution_count": 45, "source": [ "space = {\n", " 'max_depth' : hp.randint('max_depth', 10),\n", " 'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),\n", " 'n_estimators' : hp.randint('n_estimators', 250),\n", " 'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),\n", " 'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),\n", " 'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),\n", " 'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01),\n", " 'eval_metric': \"logloss\"\n", " }\n", "\n", "cv = BayesSearchCV(XGBClassifier(learning_rate=0.1), param_distributions=space, scoring={\"roc_auc\": 1, \"f1\": 1, \"neg_log_loss\": 0}, cv=StratifiedKFold(), n_iter=100)\n", "cv.fit(X_train, y_train)\n", "print(\"Loss value according to each performance metric:\\n\", cv.cv_results_)\n", "print(\"Optimal hyperparameters: \", cv.best_hyperparameters_)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "100%|██████████| 100/100 [00:09<00:00, 10.47trial/s, best loss: 5.735027908132748e-05]\n", "[15:52:35] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "100%|██████████| 100/100 [00:00