{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.简介\n", "xgboost分类分两种情况,二分类和多分类: \n", "\n", "(1) 二分类的思路与[logistic回归](https://nbviewer.jupyter.org/github/zhulei227/ML_Notes/blob/master/notebooks/02_%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B_%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92.ipynb)一样,先对线性函数套一个`sigmoid`函数,然后再求交叉熵作为损失函数,所以只需要一组回归树并可实现; \n", "\n", "(2)而多分类的实现,思路同[gbm_classifier](https://nbviewer.jupyter.org/github/zhulei227/ML_Notes/blob/master/notebooks/10_06_%E9%9B%86%E6%88%90%E5%AD%A6%E4%B9%A0_boosting_gbm_classifier.ipynb)一样,即同时训练多组回归树,每一组代表一个class,然后对其进行`softmax`操作,然后再求交叉熵做为损失函数 \n", "\n", "下面对多分类的情况再推一次损失函数、一阶导、二阶导: \n", "\n", "softmax转换: \n", "\n", "\n", "$$\n", "softmax(y^{hat})=softmax([y_1^{hat},y_2^{hat},...,y_n^{hat}])=\\frac{1}{\\sum_{i=1}^n e^{y_i^{hat}}}[e^{y_1^{hat}},e^{y_2^{hat}},...,e^{y_n^{hat}}]\n", "$$ \n", "\n", "交叉熵: \n", "\n", "$$\n", "cross\\_entropy(y,p)=-\\sum_{i=1}^n y_ilog p_i\n", "$$ \n", "\n", "将$p_i$替换为$\\frac{e^{y_i^{hat}}}{\\sum_{i=1}^n e^{y_i^{hat}}}$,得到损失函数如下: \n", "\n", "$$\n", "L(y^{hat},y)=-\\sum_{i=1}^n y_ilog \\frac{e^{y_i^{hat}}}{\\sum_{j=1}^n e^{y_j^{hat}}}\\\\\n", "=-\\sum_{i=1}^n y_i(y_i^{hat}-log\\sum_{j=1}^n e^{y_j^{hat}})\\\\\n", "=log\\sum_{i=1}^n e^{y_i^{hat}}-\\sum_{i=1}^ny_iy_i^{hat}(由于是onehot展开,所以\\sum_{i=1}^n y_i=1)\n", "$$ \n", "\n", "所以一阶导: \n", "\n", "$$\n", "\\frac{\\partial L(y^{hat},y)}{\\partial y^{hat}}=softmax([y_1^{hat},y_2^{hat},...,y_n^{hat}])-[y_1,y_2,...,y_n]\\\\\n", "=softmax(y^{hat})-y\n", "$$ \n", "\n", "二阶导: \n", "\n", "$$\n", "\\frac{\\partial^2 L(y^{hat},y)}{\\partial {y^{hat}}^2}=softmax(y^{hat})(1-softmax(y^{hat}))\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.代码实现" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('../')\n", "from ml_models.ensemble import XGBoostBaseTree\n", "from ml_models import utils\n", "import copy\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\"\"\"\n", "xgboost分类树的实现,封装到ml_models.ensemble\n", "\"\"\"\n", "\n", "\n", "class XGBoostClassifier(object):\n", " def __init__(self, base_estimator=None, n_estimators=10, learning_rate=1.0):\n", " \"\"\"\n", " :param base_estimator: 基学习器\n", " :param n_estimators: 基学习器迭代数量\n", " :param learning_rate: 学习率,降低后续基学习器的权重,避免过拟合\n", " \"\"\"\n", " self.base_estimator = base_estimator\n", " self.n_estimators = n_estimators\n", " self.learning_rate = learning_rate\n", " if self.base_estimator is None:\n", " self.base_estimator = XGBoostBaseTree()\n", " # 同质分类器\n", " if type(base_estimator) != list:\n", " estimator = self.base_estimator\n", " self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]\n", " # 异质分类器\n", " else:\n", " self.n_estimators = len(self.base_estimator)\n", "\n", " # 扩展class_num组分类器\n", " self.expand_base_estimators = []\n", "\n", " def fit(self, x, y):\n", " # 将y转one-hot编码\n", " class_num = np.amax(y) + 1\n", " y_cate = np.zeros(shape=(len(y), class_num))\n", " y_cate[np.arange(len(y)), y] = 1\n", "\n", " # 扩展分类器\n", " self.expand_base_estimators = [copy.deepcopy(self.base_estimator) for _ in range(class_num)]\n", "\n", " # 第一个模型假设预测为0\n", " y_pred_score_ = np.zeros(shape=(x.shape[0], class_num))\n", " # 计算一阶、二阶导数\n", " g = utils.softmax(y_pred_score_) - y_cate\n", " h = utils.softmax(y_pred_score_) * (1 - utils.softmax(y_pred_score_))\n", " # 训练后续模型\n", " for index in range(0, self.n_estimators):\n", " y_pred_score = []\n", " for class_index in range(0, class_num):\n", " self.expand_base_estimators[class_index][index].fit(x, g[:, class_index], h[:, class_index])\n", " y_pred_score.append(self.expand_base_estimators[class_index][index].predict(x))\n", " y_pred_score_ += np.c_[y_pred_score].T * self.learning_rate\n", " g = utils.softmax(y_pred_score_) - y_cate\n", " h = utils.softmax(y_pred_score_) * (1 - utils.softmax(y_pred_score_))\n", "\n", " def predict_proba(self, x):\n", " # TODO:并行优化\n", " y_pred_score = []\n", " for class_index in range(0, len(self.expand_base_estimators)):\n", " estimator_of_index = self.expand_base_estimators[class_index]\n", " y_pred_score.append(\n", " np.sum(\n", " [estimator_of_index[0].predict(x)] +\n", " [self.learning_rate * estimator_of_index[i].predict(x) for i in\n", " range(1, self.n_estimators - 1)] +\n", " [estimator_of_index[self.n_estimators - 1].predict(x)]\n", " , axis=0)\n", " )\n", " return utils.softmax(np.c_[y_pred_score].T)\n", "\n", " def predict(self, x):\n", " return np.argmax(self.predict_proba(x), axis=1)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#造伪数据\n", "from sklearn.datasets import make_classification\n", "data, target = make_classification(n_samples=100, n_features=2, n_classes=2, n_informative=1, n_redundant=0,\n", " n_repeated=0, n_clusters_per_class=1, class_sep=.5,random_state=21)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "classifier = XGBoostClassifier()\n", "classifier.fit(data, target)\n", "utils.plot_decision_function(data, target, classifier)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 2 }