{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.简介\n", "\n", "Dropout技术在深度学习中对防止过拟合起到了很好的作用,google甚至为其申请了专利,论文《Dart:Dropouts Meet Multiple Additive Regression Trees》将dropout应用到了gbdt中,这种技术称作DART。简单来说就是在训练过程中暂时丢弃部分已生成的树,使得模型中树的贡献更加均衡(一般最先生成的树的贡献更大),防止过拟合。\n", "\n", "### 二.流程\n", "分两步: \n", "\n", "(1)在进行每一轮训练时,对当前已经生成好的$n$颗树随机丢弃掉$k$颗,对对剩下的$n-k$颗树计算其负梯度,并训练一颗新的回归树去拟合该负梯度; \n", "\n", "(2)执行标准化操作,由于丢掉了部分的树,所以新训练的树的预测结果其实是超出了拟合目标的,需要对其做标准化操作,对丢弃的树乘以$\\frac{k}{k_+1}$的权重,对新训练的树乘以$\\frac{1}{k+1}$的权重\n", "\n", "### 三.代码实现\n", "代码实现很简单,就直接在GBDTRegressor和GBDTClassifier上面微调即可" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('../')\n", "from ml_models.tree import CARTRegressor\n", "import copy\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\n", "\"\"\"\n", "DART回归模型,封装到ml_models.ensemble\n", "\"\"\"\n", "\n", "\n", "class DARTRegressor(object):\n", " def __init__(self, base_estimator=None, n_estimators=10, loss='ls', huber_threshold=1e-1,\n", " quantile_threshold=0.5, dropout=0.5):\n", " \"\"\"\n", " :param base_estimator: 基学习器,允许异质;异质的情况下使用列表传入比如[estimator1,estimator2,...,estimator10],这时n_estimators会失效;\n", " 同质的情况,单个estimator会被copy成n_estimators份\n", " :param n_estimators: 基学习器迭代数量\n", " :param loss:表示损失函数ls表示平方误差,lae表示绝对误差,huber表示huber损失,quantile表示分位数损失\n", " :param huber_threshold:huber损失阈值,只有在loss=huber时生效\n", " :param quantile_threshold损失阈值,只有在loss=quantile时生效\n", " :param dropout:每个模型被dropout的概率\n", " \"\"\"\n", " self.base_estimator = base_estimator\n", " self.n_estimators = n_estimators\n", " if self.base_estimator is None:\n", " # 默认使用决策树桩\n", " self.base_estimator = CARTRegressor(max_depth=2)\n", " # 同质分类器\n", " if type(base_estimator) != list:\n", " estimator = self.base_estimator\n", " self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]\n", " # 异质分类器\n", " else:\n", " self.n_estimators = len(self.base_estimator)\n", " self.loss = loss\n", " self.huber_threshold = huber_threshold\n", " self.quantile_threshold = quantile_threshold\n", " self.dropout = dropout\n", " # 记录模型权重\n", " self.weights = []\n", "\n", " def _get_gradient(self, y, y_pred):\n", " if self.loss == 'ls':\n", " return y - y_pred\n", " elif self.loss == 'lae':\n", " return (y - y_pred > 0).astype(int) * 2 - 1\n", " elif self.loss == 'huber':\n", " return np.where(np.abs(y - y_pred) > self.huber_threshold,\n", " self.huber_threshold * ((y - y_pred > 0).astype(int) * 2 - 1), y - y_pred)\n", " elif self.loss == \"quantile\":\n", " return np.where(y - y_pred > 0, self.quantile_threshold, self.quantile_threshold - 1)\n", "\n", " def _dropout(self, y_pred):\n", " # 选择需要被dropout掉的indices\n", " dropout_indices = []\n", " no_dropout_indices = []\n", " for index in range(0, len(y_pred)):\n", " if np.random.random() <= self.dropout:\n", " dropout_indices.append(index)\n", " else:\n", " no_dropout_indices.append(index)\n", " if len(dropout_indices) == 0:\n", " np.random.shuffle(no_dropout_indices)\n", " dropout_indices.append(no_dropout_indices.pop())\n", " k = len(dropout_indices)\n", " # 调整对应的weights\n", " for index in dropout_indices:\n", " self.weights[index] *= (1.0 * k / (k + 1))\n", " # 返回新的pred结果以及dropout掉的数量\n", " y_pred_result = np.zeros_like(y_pred[0])\n", " for no_dropout_index in no_dropout_indices:\n", " y_pred_result += y_pred[no_dropout_index] * self.weights[no_dropout_index]\n", " return y_pred_result, k\n", "\n", " def fit(self, x, y):\n", " # 拟合第一个模型\n", " self.base_estimator[0].fit(x, y)\n", " self.weights.append(1.0)\n", " y_pred = [self.base_estimator[0].predict(x)]\n", " new_y_pred, k = self._dropout(y_pred)\n", " new_y = self._get_gradient(y, new_y_pred)\n", " for index in range(1, self.n_estimators):\n", " self.base_estimator[index].fit(x, new_y)\n", " self.weights.append(1.0 * (1 / (k + 1)))\n", " y_pred.append(self.base_estimator[index].predict(x))\n", " new_y_pred, k = self._dropout(y_pred)\n", " new_y = self._get_gradient(y, new_y_pred)\n", "\n", " def predict(self, x):\n", " return np.sum(\n", " [self.base_estimator[0].predict(x) * self.weights[0]] +\n", " [self.base_estimator[i].predict(x) * self.weights[i] for i in\n", " range(1, self.n_estimators - 1)] +\n", " [self.base_estimator[self.n_estimators - 1].predict(x) * self.weights[-1]]\n", " , axis=0)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data = np.linspace(1, 10, num=100)\n", "target = np.sin(data) + np.random.random(size=100) # 添加噪声\n", "data = data.reshape((-1, 1))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = DARTRegressor(base_estimator=CARTRegressor())\n", "model.fit(data, target)\n", "plt.scatter(data, target)\n", "plt.plot(data, model.predict(data), color='r')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from ml_models import utils\n", "\"\"\"\n", "DART分类模型,封装到ml_models.ensemble\n", "\"\"\"\n", "\n", "class DARTClassifier(object):\n", " def __init__(self, base_estimator=None, n_estimators=10, dropout=0.5):\n", " \"\"\"\n", " :param base_estimator: 基学习器,允许异质;异质的情况下使用列表传入比如[estimator1,estimator2,...,estimator10],这时n_estimators会失效;\n", " 同质的情况,单个estimator会被copy成n_estimators份\n", " :param n_estimators: 基学习器迭代数量\n", " :param dropout: dropout概率\n", " \"\"\"\n", " self.base_estimator = base_estimator\n", " self.n_estimators = n_estimators\n", " self.dropout = dropout\n", " if self.base_estimator is None:\n", " # 默认使用决策树桩\n", " self.base_estimator = CARTRegressor(max_depth=2)\n", " # 同质分类器\n", " if type(base_estimator) != list:\n", " estimator = self.base_estimator\n", " self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]\n", " # 异质分类器\n", " else:\n", " self.n_estimators = len(self.base_estimator)\n", "\n", " # 扩展class_num组分类器\n", " self.expand_base_estimators = []\n", "\n", " # 记录权重\n", " self.weights = None\n", "\n", " def _dropout(self, y_pred_score_):\n", " y_pred_score_results = []\n", " ks = []\n", " for class_index in range(0, self.class_num):\n", " dropout_indices = []\n", " no_dropout_indices = []\n", " for index in range(0, len(y_pred_score_[class_index])):\n", " if np.random.random() <= self.dropout:\n", " dropout_indices.append(index)\n", " else:\n", " no_dropout_indices.append(index)\n", " if len(dropout_indices) == 0:\n", " np.random.shuffle(no_dropout_indices)\n", " dropout_indices.append(no_dropout_indices.pop())\n", " k = len(dropout_indices)\n", " # 调整对应的weights\n", " for index in dropout_indices:\n", " self.weights[class_index][index] *= (1.0 * k / (k + 1))\n", " # 返回新的pred结果以及dropout掉的数量\n", " y_pred_result = np.zeros_like(y_pred_score_[class_index][0])\n", " for no_dropout_index in no_dropout_indices:\n", " y_pred_result += y_pred_score_[class_index][no_dropout_index] * self.weights[class_index][\n", " no_dropout_index]\n", " y_pred_score_results.append(y_pred_result)\n", " ks.append(k)\n", " return y_pred_score_results, ks\n", "\n", " def fit(self, x, y):\n", " # 将y转one-hot编码\n", " class_num = np.amax(y) + 1\n", " self.class_num = class_num\n", " y_cate = np.zeros(shape=(len(y), class_num))\n", " y_cate[np.arange(len(y)), y] = 1\n", "\n", " self.weights = [[] for _ in range(0, class_num)]\n", "\n", " # 扩展分类器\n", " self.expand_base_estimators = [copy.deepcopy(self.base_estimator) for _ in range(class_num)]\n", "\n", " # 拟合第一个模型\n", " y_pred_score_ = [[] for _ in range(0, self.class_num)]\n", " # TODO:并行优化\n", " for class_index in range(0, class_num):\n", " self.expand_base_estimators[class_index][0].fit(x, y_cate[:, class_index])\n", " y_pred_score_[class_index].append(self.expand_base_estimators[class_index][0].predict(x))\n", " self.weights[class_index].append(1.0)\n", " y_pred_result, ks = self._dropout(y_pred_score_)\n", " y_pred_result = np.c_[y_pred_result].T\n", " # 计算负梯度\n", " new_y = y_cate - utils.softmax(y_pred_result)\n", " # 训练后续模型\n", " for index in range(1, self.n_estimators):\n", " for class_index in range(0, class_num):\n", " self.expand_base_estimators[class_index][index].fit(x, new_y[:, class_index])\n", " y_pred_score_[class_index].append(self.expand_base_estimators[class_index][index].predict(x))\n", " self.weights[class_index].append(1.0 / (ks[class_index] + 1))\n", " y_pred_result, ks = self._dropout(y_pred_score_)\n", " y_pred_result = np.c_[y_pred_result].T\n", " new_y = y_cate - utils.softmax(y_pred_result)\n", "\n", " def predict_proba(self, x):\n", " # TODO:并行优化\n", " y_pred_score = []\n", " for class_index in range(0, len(self.expand_base_estimators)):\n", " estimator_of_index = self.expand_base_estimators[class_index]\n", " y_pred_score.append(\n", " np.sum(\n", " [estimator_of_index[0].predict(x)* self.weights[class_index][0]] +\n", " [self.weights[class_index][i] * estimator_of_index[i].predict(x) for i in\n", " range(1, self.n_estimators - 1)] +\n", " [estimator_of_index[self.n_estimators - 1].predict(x) * self.weights[class_index][-1]]\n", " , axis=0)\n", " )\n", " return utils.softmax(np.c_[y_pred_score].T)\n", "\n", " def predict(self, x):\n", " return np.argmax(self.predict_proba(x), axis=1)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#造伪数据\n", "from sklearn.datasets import make_classification\n", "data, target = make_classification(n_samples=100, n_features=2, n_classes=2, n_informative=1, n_redundant=0,\n", " n_repeated=0, n_clusters_per_class=1, class_sep=.5,random_state=21)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from ml_models.linear_model import LinearRegression\n", "classifier = DARTClassifier(base_estimator=[LinearRegression(),LinearRegression(),LinearRegression(),CARTRegressor(max_depth=2)])\n", "classifier.fit(data, target)\n", "utils.plot_decision_function(data, target, classifier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 四.讨论\n", "\n", "(1)DART其实可以看做介于随机森林和GBDT之间的一种树,当`dropout=0`时,等价于GBDT,当`dropout=1`时,等价于randomforest; \n", "\n", "(2)另外需要注意一下的是,当xgboost使用dart时,由于进入了随机性,会使得early stopping操作变得不稳定" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }