{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.损失函数\n", "\n", "这一节对xgboost回归做介绍,xgboost共实现了5种类型的回归,分别是squarederror、logistic、poisson、gamma、tweedie回归,下面主要对前两种进行推导实现,剩余三种放到下一节 \n", "\n", "#### squarederror\n", "即损失函数为平方误差的回归模型: \n", "\n", "$$\n", "L(y,\\hat{y})=\\frac{1}{2}(y-\\hat{y})^2\n", "$$ \n", "\n", "所以一阶导和二阶导分别为: \n", "\n", "$$\n", "\\frac{\\partial L(y,\\hat{y})}{\\partial \\hat{y}}=\\hat{y}-y\\\\\n", "\\frac{\\partial^2 L(y,\\hat{y})}{{\\partial \\hat{y}}^2}=1.0\\\\\n", "$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### logistic\n", "\n", "由于是回归任务,所以y也要套上`sigmoid`函数(用$\\sigma(\\cdot)$表示),损失函数: \n", "\n", "$$\n", "L(y,\\hat{y})=(1-\\sigma(y))log(1-\\sigma(\\hat{y}))+\\sigma(y)log(\\sigma(\\hat{y}))\n", "$$\n", "\n", "一阶导和二阶导分别为: \n", "\n", "$$\n", "\\frac{\\partial L(y,\\hat{y})}{\\partial \\hat{y}}=\\sigma(\\hat{y})-\\sigma(y)\\\\\n", "\\frac{\\partial^2 L(y,\\hat{y})}{{\\partial \\hat{y}}^2}=\\sigma(\\hat{y})(1-\\sigma(\\hat{y}))\\\\\n", "$$ \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.代码实现\n", "具体流程与gbdt的回归类似,只是每次要计算一阶、二阶导数信息,同时基学习器要替换为上一节的xgboost回归树" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('../')\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from ml_models.ensemble import XGBoostBaseTree\n", "from ml_models import utils\n", "import copy\n", "import numpy as np\n", "\n", "\"\"\"\n", "xgboost回归树的实现,封装到ml_models.ensemble\n", "\"\"\"\n", "\n", "class XGBoostRegressor(object):\n", " def __init__(self, base_estimator=None, n_estimators=10, learning_rate=1.0, loss='squarederror'):\n", " \"\"\"\n", " :param base_estimator: 基学习器\n", " :param n_estimators: 基学习器迭代数量\n", " :param learning_rate: 学习率,降低后续基学习器的权重,避免过拟合\n", " :param loss:损失函数,支持squarederror、logistic\n", " \"\"\"\n", " self.base_estimator = base_estimator\n", " self.n_estimators = n_estimators\n", " self.learning_rate = learning_rate\n", " if self.base_estimator is None:\n", " # 默认使用决策树桩\n", " self.base_estimator = XGBoostBaseTree()\n", " # 同质分类器\n", " if type(base_estimator) != list:\n", " estimator = self.base_estimator\n", " self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]\n", " # 异质分类器\n", " else:\n", " self.n_estimators = len(self.base_estimator)\n", " self.loss = loss\n", "\n", " def _get_gradient_hess(self, y, y_pred):\n", " \"\"\"\n", " 获取一阶、二阶导数信息\n", " :param y:真实值\n", " :param y_pred:预测值\n", " :return:\n", " \"\"\"\n", " if self.loss == 'squarederror':\n", " return y_pred - y, np.ones_like(y)\n", " elif self.loss == 'logistic':\n", " return utils.sigmoid(y_pred) - utils.sigmoid(y), utils.sigmoid(y_pred) * (1 - utils.sigmoid(y_pred))\n", "\n", " def fit(self, x, y):\n", " y_pred = np.zeros_like(y)\n", " g, h = self._get_gradient_hess(y, y_pred)\n", " for index in range(0, self.n_estimators):\n", " self.base_estimator[index].fit(x, g, h)\n", " y_pred += self.base_estimator[index].predict(x) * self.learning_rate\n", " g, h = self._get_gradient_hess(y, y_pred)\n", "\n", " def predict(self, x):\n", " rst_np = np.sum(\n", " [self.base_estimator[0].predict(x)] +\n", " [self.learning_rate * self.base_estimator[i].predict(x) for i in\n", " range(1, self.n_estimators - 1)] +\n", " [self.base_estimator[self.n_estimators - 1].predict(x)]\n", " , axis=0)\n", " return rst_np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#测试\n", "data = np.linspace(1, 10, num=100)\n", "target = np.sin(data) + np.random.random(size=100) # 添加噪声\n", "data = data.reshape((-1, 1))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = XGBoostRegressor(loss='squarederror')\n", "model.fit(data, target)\n", "plt.scatter(data, target)\n", "plt.plot(data, model.predict(data), color='r')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = XGBoostRegressor(loss='logistic')\n", "model.fit(data, target)\n", "plt.scatter(data, target)\n", "plt.plot(data, model.predict(data), color='r')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }