{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.算法流程\n", "adaboost回归模型与分类模型类似,主要的不同点在于错误率的计算、基模型的权重计算以及样本权重的更新,下面就直接介绍算法流程部分\n", "\n", "输入:训练集$T=\\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\\}$,其中$x_i\\in R^n,y_i\\in R,i=1,2,...,N$ \n", "\n", "输出:最终回归模型$G(x)$ \n", "\n", ">(1)初始化训练数据的权重分布: \n", "$$\n", "D_1=(w_{11},...,w_{1i},...,w_{1N}),w_{1i}=\\frac{1}{N},i=1,2,...,N\n", "$$ \n", "\n", ">(2)对$m=1,2,...,M:$ \n", "\n", ">>(2.1)使用具有权重分布$D_m$的训练数据集学习,得到基回归模型:$G_m(x)$ \n", "\n", ">>(2.2)计算$G_m(x)$在训练集上的误差率: \n", "\n", ">>>(2.2.1)计算训练集上的最大误差:$E_m=max\\mid y_i-G_m(x_i)\\mid,i=1,2,...,N$ \n", "\n", ">>>(2.2.2)计算每个样本的相对误差,这里有三种计算方式可选: \n", ">>>> a)线性误差:$e_{mi}=\\frac{\\mid y_i-G_m(x_i)\\mid}{E_m},i=1,2,...,N$ \n", "\n", ">>>> b)平方误差:$e_{mi}=\\frac{(y_i-G_m(x_i))^2}{E_m^2},i=1,2,...,N$ \n", "\n", ">>>> c)指数误差:$e_{mi}=1-exp(\\frac{-\\mid y_i-G_m(x_i)\\mid}{E_m}),i=1,2,...,N$\n", "\n", ">>>(2.2.3)计算误差率:$e_m=\\sum_{i=1}^N w_{mi}e_{mi},i=1,2,...,N$\n", "\n", ">>(2.3)计算$G_m(x)$的权重系数:$\\alpha_m=\\frac {e_m}{1-e_m}$ \n", "\n", ">>(2.4)更新训练样本权重: \n", "$$\n", "w_{m+1,i}=\\frac{w_{mi}}{Z_m}\\alpha_m^{1-e_{mi}},i=1,2,...,N\n", "$$ \n", "这里$Z_m$是归一化因子 \n", "\n", ">(3)最终强学习器: \n", "\n", "$$\n", "G(x)=\\sum_{i=1}^M \\frac {ln\\frac{1}{\\alpha_m}}{L}G_m(x),L=\\sum_{i=1}^M ln\\frac{1}{\\alpha_m}\n", "$$\n", "\n", "### 二.代码实现" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('../')\n", "from ml_models.tree import CARTRegressor\n", "import copy\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\n", "class AdaBoostRegressor(object):\n", " def __init__(self, base_estimator=None, n_estimators=10, learning_rate=1.0):\n", " \"\"\"\n", " :param base_estimator: 基学习器,允许异质;异质的情况下使用列表传入比如[estimator1,estimator2,...,estimator10],这时n_estimators会失效;\n", " 同质的情况,单个estimator会被copy成n_estimators份\n", " :param n_estimators: 基学习器迭代数量\n", " :param learning_rate: 学习率,降低后续基学习器的权重,避免过拟合\n", " \"\"\"\n", " self.base_estimator = base_estimator\n", " self.n_estimators = n_estimators\n", " self.learning_rate = learning_rate\n", " if self.base_estimator is None:\n", " # 默认使用决策树桩\n", " self.base_estimator = CARTRegressor(max_depth=2)\n", " # 同质分类器\n", " if type(base_estimator) != list:\n", " estimator = self.base_estimator\n", " self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]\n", " # 异质分类器\n", " else:\n", " self.n_estimators = len(self.base_estimator)\n", "\n", " # 记录estimator权重\n", " self.estimator_weights = []\n", "\n", " # 记录最终中位数值弱学习器的index\n", " self.median_index = None\n", "\n", " def fit(self, x, y):\n", " n_sample = x.shape[0]\n", " sample_weights = np.asarray([1.0] * n_sample)\n", " for index in range(0, self.n_estimators):\n", " self.base_estimator[index].fit(x, y, sample_weight=sample_weights)\n", "\n", " errors = np.abs(self.base_estimator[index].predict(x) - y)\n", " error_max = np.max(errors)\n", "\n", " # 计算线性误差,其他误差类型,可以自行扩展\n", " linear_errors = errors / error_max\n", " # 计算误分率\n", " error_rate = np.dot(linear_errors, sample_weights / n_sample)\n", "\n", " # 计算权重系数\n", " alpha_rate = error_rate / (1.0 - error_rate + 1e-10)\n", " self.estimator_weights.append(alpha_rate)\n", "\n", " # 更新样本权重\n", " for j in range(0, n_sample):\n", " sample_weights[j] = sample_weights[j] * np.power(alpha_rate, 1 - linear_errors[j])\n", " sample_weights = sample_weights / np.sum(sample_weights) * n_sample\n", "\n", " # 更新estimator权重\n", " self.estimator_weights = np.log(1 / np.asarray(self.estimator_weights))\n", " for i in range(0, self.n_estimators):\n", " self.estimator_weights[i] *= np.power(self.learning_rate, i)\n", " self.estimator_weights /= np.sum(self.estimator_weights)\n", "\n", " def predict(self, x):\n", " return np.sum(\n", " [self.estimator_weights[i] * self.base_estimator[i].predict(x) for i in\n", " range(0, self.n_estimators)],\n", " axis=0)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#构造数据\n", "data = np.linspace(1, 10, num=100)\n", "target = np.sin(data) + np.random.random(size=100)#添加噪声\n", "data = data.reshape((-1, 1))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#训练模型\n", "model=AdaBoostRegressor(base_estimator=CARTRegressor(max_bins=20),n_estimators=10)\n", "model.fit(data,target)\n", "plt.scatter(data, target)\n", "plt.plot(data, model.predict(data), color='r')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }