{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import os\n", "os.chdir('../')\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.最大熵原理\n", "最大熵的思想很朴素，即将已知事实以外的未知部分看做“等可能”的，而熵是描述“等可能”大小很合适的量化指标，熵的公式如下： \n", "\n", "$$\n", "H(p)=-\\sum_{i}p_i log p_i\n", "$$ \n", "\n", "这里分布$p$的取值有$i$种情况，每种情况的概率为$p_i$，下图绘制了二值随机变量的熵：" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "p=np.linspace(0.1,0.9,90)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def entropy(p):\n", " return -np.log(p)*p-np.log(1-p)*(1-p)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(p,entropy(p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当两者概率均为0.5时，熵取得最大值，通过最大化熵，可以使得分布更“等可能”；另外，熵还有优秀的性质，它是一个凹函数，所以最大化熵其实是一个凸问题。 \n", "\n", "对于“已知事实”，可以用约束条件来描述，比如4个值的随机变量分布，其中已知$p_1+p_2=0.4$，它的求解可以表述如下： \n", "\n", "$$\n", "\\max_{p} -\\sum_{i=1}^4 p_ilogp_i \\\\\n", "s.t. p_1+p_2=0.4\\\\\n", "p_i\\geq 0,i=1,2,3,4\\\\\n", "\\sum_i p_i=1\n", "$$ \n", "显然，最优解为：$p_1=0.2,p_2=0.2,p_3=0.3,p_4=0.3$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.最大熵模型\n", "最大熵模型是最大熵原理在分类问题上的应用，它假设分类模型是一个条件概率分布$P(Y|X)$，即对于给定的输入$X$，以概率$P(Y|X)$输出$Y$，这时最大熵模型的目标函数定义为条件熵： \n", "\n", "$$\n", "H(P)=-\\sum_{x,y}\\tilde{P}(x)P(y|x)logP(y|x)\n", "$$ \n", "\n", "这里，$\\tilde{P}(x)$表示边缘分布$P(X)$的经验分布，$\\tilde{P}(x)=\\frac{v(X=x)}{N}$，$v(X=x)$表示训练样本中输入$x$出现的次数，$N$表示训练样本的总数。 \n", "\n", "而最大熵模型的“已知事实”可以通过如下等式来约束： \n", "\n", "$$\n", "\\sum_{x,y}\\tilde{P}(x)P(y|x)f(x,y)=\\sum_{x,y}\\tilde{P}(x,y)f(x,y)\n", "$$\n", "\n", "为了方便，左边式子记着$E_P(f)$，右边式子记着$E_{\\tilde{P}}(f)$，等式描述的是某函数$f(x,y)$关于模型$P(Y|X)$与经验分布$\\tilde{P}(X)$的期望与函数$f(x,y)$关于经验分布$\\tilde{P}(X,Y)$的期望相同。(这里$\\tilde{P}(x,y)=\\frac{v(X=x,Y=y)}{N}$) \n", "所以重要的约束信息将由$f(x,y)$来表示，它的定义如下： \n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & x与y满足某一事实\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "故最大熵模型可以理解为，模型在某些事实发生的期望和训练集相同的条件下，使得条件熵最大化。所以，对于有$n$个约束条件的最大熵模型可以表示为： \n", "\n", "$$\n", "\\max_P -\\sum_{x,y}\\tilde{P}(x)P(y|x)logP(y|x) \\\\\n", "s.t. E_P(f_i)=E_{\\tilde{P}}(f_i),i=1,2,...,n\\\\\n", "\\sum_y P(y|x)=1\n", "$$ \n", "\n", "按照优化问题的习惯，可以改写为如下： \n", "\n", "$$\n", "\\min_P \\sum_{x,y}\\tilde{P}(x)P(y|x)logP(y|x) \\\\\n", "s.t. E_P(f_i)-E_{\\tilde{P}}(f_i)=0,i=1,2,...,n\\\\\n", "\\sum_y P(y|x)-1=0\n", "$$ \n", "\n", "由于目标函数为凸函数，约束条件为仿射，所以我们可以通过求解对偶问题，得到原始问题的最优解，首先引入拉格朗日乘子$w_0,w_1,...,w_n$，定义拉格朗日函数$L(P,w)$： \n", "\n", "$$\n", "L(P,w)=-H(P)+w_0(1-\\sum_yP(y|x)+\\sum_{i=1}^nw_i(E_{\\tilde{P}}(f_i))-E_P(f_i))\n", "$$ \n", "\n", "所以原问题等价于： \n", "$$\n", "\\min_P\\max_w L(P,w)\n", "$$ \n", "它的对偶问题： \n", "$$\n", "\\max_w\\min_P L(P,w)\n", "$$ \n", "\n", "首先，解里面的 $\\min_P L(P,w)$，其实对于$\\forall w$，$L(P,w)$都是关于$P$的凸函数，因为$-H(P)$是关于$P$的凸函数，而后面的$w_0(1-\\sum_yP(y|x)+\\sum_{i=1}^nw_i(E_{\\tilde{P}}(f_i))-E_P(f_i))$是关于$P(y|x)$的仿射函数，所以求$L(P,w)$对$P$的偏导数，并令其等于0，即可解得最优的$P(y|x)$,记为$P_w(y|x)$，即： \n", "$$\n", "\\frac{\\partial L(P,w)}{\\partial P(y|x)}=\\sum_{x,y}\\tilde{P}(x)(logP(y|x)+1)-\\sum_yw_0+\\sum_{i=1}^n\\sum_{x,y}\\tilde{P}(x)f_i(x,y)w_i\\\\\n", "=\\sum_{x,y}\\tilde{P}(x)(logP(y|x)+1-w_0-\\sum_{i=1}^nw_if_i(x,y))\\\\\n", "=0\n", "$$ \n", "\n", "在训练集中对任意样本$\\forall x,y$，都有$\\tilde{P}(x)(logP(y|x)+1-w_0-\\sum_{i=1}^nw_if_i(x,y))=0$，显然$\\tilde{P}(x)>0$($x$本来就是训练集中的一个样本，自然概率大于0)，所以$logP(y|x)+1-w_0-\\sum_{i=1}^nw_if_i(x,y)=0$，所以： \n", "$$\n", "P_w(y|x)=exp(\\sum_{i=1}^nw_if_i(x,y)+w_0-1)\\\\\n", "=\\frac{exp(\\sum_{i=1}^nw_if_i(x,y))}{exp(1-w_0)}\\\\\n", " =\\frac{exp(\\sum_{i=1}^nw_if_i(x,y))}{\\sum_y exp(\\sum_{i=1}^nw_if_i(x,y))}\n", "$$ \n", "\n", "这就是最大熵模型的表达式（最后一步变换用到了$\\sum_y P(y|x)=1$），这里$w$即是模型的参数，聪明的童鞋其实已经发现，最大熵模型其实就是一个线性函数外面套了一个**softmax**函数，它大概就是如下图所示的这么回事： \n", "![avatar](./source/05_最大熵模型.svg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来，将$L(P_w,w)$带入外层的$max$函数，即可求解最优的参数$w^*$： \n", "\n", "$$\n", "w^*=arg\\max_w L(P_w,w)\n", "$$ \n", "\n", "推导一下模型的梯度更新公式： \n", "$$\n", "L(P_w,w)=\\sum_{x,y}\\tilde{P}(x)P_w(y|x)logP_w(y|x)+\\sum_{i=1}^nw_i\\sum_{x,y}(\\tilde{P}(x,y)f_i(x,y)-\\tilde{P}(x)P_w(y|x)f_i(x,y))\\\\\n", "=\\sum_{x,y}\\tilde{P}(x,y)\\sum_{i=1}^nw_if_i(x,y)+\\sum_{x,y}\\tilde{P}(x)P_w(y|x)(logP_w(y|x)-\\sum_{i=1}^nw_if_i(x,y))\\\\\n", "=\\sum_{x,y}\\tilde{P}(x,y)\\sum_{i=1}^nw_if_i(x,y)-\\sum_{x,y}\\tilde{P}(x)P_w(y|x)log(\\sum_{y^{'}}exp(\\sum_{i=1}^nw_if_i(x,y^{'})))\\\\\n", "=\\sum_{x,y}\\tilde{P}(x,y)\\sum_{i=1}^nw_if_i(x,y)-\\sum_{x}\\tilde{P}(x)log(\\sum_{y^{'}}exp(\\sum_{i=1}^nw_if_i(x,y^{'})))\\\\\n", "=\\sum_{x,y}\\tilde{P}(x,y)w^Tf(x,y)-\\sum_{x}\\tilde{P}(x)log(\\sum_{y^{'}}exp(w^Tf(x,y^{'})))\n", "$$ \n", "这里，倒数第三步到倒数第二步用到了$\\sum_yP(y|x)=1$，最后一步中$w=[w_1,w_2,...,w_n]^T,f(x,y)=[f_1(x,y),f_2(x,y),...,f_n(x,y)]^T$，所以： \n", "$$\n", "\\frac{\\partial L(P_w,w)}{\\partial w}=\\sum_{x,y}\\tilde{P}(x,y)f(x,y)-\\sum_x\\tilde{P}(x)\\frac{exp(w^Tf(x,y))f(x,y)}{\\sum_{y^{'}}exp(w^Tf(x,y^{'}))} \n", "$$ \n", "\n", "所以，自然$w$的更新公式： \n", "$$\n", "w=w+\\eta\\frac{\\partial L(P_w,w)}{\\partial w}\n", "$$ \n", "这里，$\\eta$是学习率" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 三.对特征函数的进一步理解\n", "上面推导出了最大熵模型的梯度更新公式，想必大家对$f(x,y)$还是有点疑惑，**“满足某一事实”**这句话该如何理解？这其实与我们的学习目的相关，学习目的决定了我们的**“事实”**，比如有这样一个任务，判断“打”这个词是量词还是动词，我们收集了如下的语料： \n", "\n", "| 句子/$x$ | 目标/$y$ |\n", "|-|-|\n", "| $x_1:$一打火柴 | $y_1:$量词 |\n", "| $x_2:$三打啤酒 | $y_2:$量词 |\n", "| $x_3:$打电话 | $y_3:$ 动词 |\n", "| $x_4:$打篮球 | $y_4:$ 动词 | \n", "\n", "通过观察，我们可以设计如下的两个特征函数来分别识别\"量词\"和\"动词\"任务： \n", "$$\n", "f_1(x,y)=\\left\\{\\begin{matrix}\n", "1 & \"打\"前是数字\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "$$\n", "f_2(x,y)=\\left\\{\\begin{matrix}\n", "1 & \"打\"后是名词，且前面无数字\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "当然，你也可以设计这样的特征函数来做识别“量词”的任务： \n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & \"打\"前是\"一\",\"打\"后是\"火柴\"\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & \"打\"前是\"三\",\"打\"后是\"啤酒\"\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "只是，这样的特征函数设计会使得模型学习能力变弱，比如遇到“三打火柴”，采用后面的特征函数设计就识别不出“打”是量词，而采用第一种特征函数设计就能很好的识别出来，所以要使模型具有更好的泛化能力，就需要设计更好的特征函数，而这往往依赖于人工经验，对于自然语言处理这类任务（比如上面的例子），我们可以较容易的归纳总结出一些有用的经验知识，但是对于其他情况，人工往往难以总结出一般性的规律，所以对于这些问题，我们需要设计更**“一般”**的特征函数。 \n", "#### 一种简单的特征函数设计\n", "我们可以简单考虑$x$的某个特征取某个值和$y$取某个类的组合做特征函数（对于连续型特征，可以采用分箱操作），所以我们可以设计这样两类特征函数： \n", "\n", "（1）离散型： \n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & x_i=某值,y=某类\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "（2）连续型： \n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & 某值1\\leq x_i< 某值2,y=某类\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### 四.代码实现\n", "为了方便演示，首先构建训练数据和测试数据" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# 测试\n", "from sklearn import datasets\n", "from sklearn import model_selection\n", "from sklearn.metrics import f1_score\n", "\n", "iris = datasets.load_iris()\n", "data = iris['data']\n", "target = iris['target']\n", "X_train, X_test, y_train, y_test = model_selection.train_test_split(data, target, test_size=0.2,random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了方便对数据进行分箱操作，封装一个DataBinWrapper类，并对X_train和X_test进行转换（该类放到ml_models.wrapper_models中）" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class DataBinWrapper(object):\n", " def __init__(self, max_bins=10):\n", " # 分段数\n", " self.max_bins = max_bins\n", " # 记录x各个特征的分段区间\n", " self.XrangeMap = None\n", "\n", " def fit(self, x):\n", " n_sample, n_feature = x.shape\n", " # 构建分段数据\n", " self.XrangeMap = [[] for _ in range(0, n_feature)]\n", " for index in range(0, n_feature):\n", " tmp = x[:, index]\n", " for percent in range(1, self.max_bins):\n", " percent_value = np.percentile(tmp, (1.0 * percent / self.max_bins) * 100.0 // 1)\n", " self.XrangeMap[index].append(percent_value)\n", "\n", " def transform(self, x):\n", " \"\"\"\n", " 抽取x_bin_index\n", " :param x:\n", " :return:\n", " \"\"\"\n", " if x.ndim == 1:\n", " return np.asarray([np.digitize(x[i], self.XrangeMap[i]) for i in range(0, x.size)])\n", " else:\n", " return np.asarray([np.digitize(x[:, i], self.XrangeMap[i]) for i in range(0, x.shape[1])]).T" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "data_bin_wrapper=DataBinWrapper(max_bins=10)\n", "data_bin_wrapper.fit(X_train)\n", "X_train=data_bin_wrapper.transform(X_train)\n", "X_test=data_bin_wrapper.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[7, 6, 8, 7],\n", " [3, 5, 5, 6],\n", " [2, 8, 2, 2],\n", " [6, 5, 6, 7],\n", " [7, 2, 8, 8]], dtype=int64)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[:5,:]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[5, 2, 7, 9],\n", " [5, 0, 4, 3],\n", " [3, 9, 1, 2],\n", " [9, 3, 9, 7],\n", " [1, 8, 2, 2]], dtype=int64)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test[:5,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于特征函数可以有不同的形式，这里我们将特征函数解耦出来，构造一个SimpleFeatureFunction类（后续构造其他复杂的特征函数，需要定义和该类相同的函数名，该类放置到ml_models.linear_model中）" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "class SimpleFeatureFunction(object):\n", " def __init__(self):\n", " \"\"\"\n", " 记录特征函数\n", " {\n", " (x_index,x_value,y_index)\n", " }\n", " \"\"\"\n", " self.feature_funcs = set()\n", "\n", " # 构建特征函数\n", " def build_feature_funcs(self, X, y):\n", " n_sample, _ = X.shape\n", " for index in range(0, n_sample):\n", " x = X[index, :].tolist()\n", " for feature_index in range(0, len(x)):\n", " self.feature_funcs.add(tuple([feature_index, x[feature_index], y[index]]))\n", "\n", " # 获取特征函数总数\n", " def get_feature_funcs_num(self):\n", " return len(self.feature_funcs)\n", "\n", " # 分别命中了那几个特征函数\n", " def match_feature_funcs_indices(self, x, y):\n", " match_indices = []\n", " index = 0\n", " for feature_index, feature_value, feature_y in self.feature_funcs:\n", " if feature_y == y and x[feature_index] == feature_value:\n", " match_indices.append(index)\n", " index += 1\n", " return match_indices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来对MaxEnt类进行实现，首先实现一个softmax函数的功能(ml_models.utils)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def softmax(x):\n", " if x.ndim == 1:\n", " return np.exp(x) / np.exp(x).sum()\n", " else:\n", " return np.exp(x) / np.exp(x).sum(axis=1, keepdims=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "进行MaxEnt类的具体实现（ml_models.linear_model）" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from ml_models import utils\n", "class MaxEnt(object):\n", " def __init__(self, feature_func, epochs=5, eta=0.01):\n", " self.feature_func = feature_func\n", " self.epochs = epochs\n", " self.eta = eta\n", "\n", " self.class_num = None\n", " \"\"\"\n", " 记录联合概率分布:\n", " {\n", " (x_0,x_1,...,x_p,y_index):p\n", " }\n", " \"\"\"\n", " self.Pxy = {}\n", " \"\"\"\n", " 记录边缘概率分布:\n", " {\n", " (x_0,x_1,...,x_p):p\n", " }\n", " \"\"\"\n", " self.Px = {}\n", "\n", " \"\"\"\n", " w[i]-->feature_func[i]\n", " \"\"\"\n", " self.w = None\n", "\n", " def init_params(self, X, y):\n", " \"\"\"\n", " 初始化相应的数据\n", " :return:\n", " \"\"\"\n", " n_sample, n_feature = X.shape\n", " self.class_num = np.max(y) + 1\n", "\n", " # 初始化联合概率分布、边缘概率分布、特征函数\n", " for index in range(0, n_sample):\n", " range_indices = X[index, :].tolist()\n", "\n", " if self.Px.get(tuple(range_indices)) is None:\n", " self.Px[tuple(range_indices)] = 1\n", " else:\n", " self.Px[tuple(range_indices)] += 1\n", "\n", " if self.Pxy.get(tuple(range_indices + [y[index]])) is None:\n", " self.Pxy[tuple(range_indices + [y[index]])] = 1\n", " else:\n", " self.Pxy[tuple(range_indices + [y[index]])] = 1\n", "\n", " for key, value in self.Pxy.items():\n", " self.Pxy[key] = 1.0 * self.Pxy[key] / n_sample\n", " for key, value in self.Px.items():\n", " self.Px[key] = 1.0 * self.Px[key] / n_sample\n", "\n", " # 初始化参数权重\n", " self.w = np.zeros(self.feature_func.get_feature_funcs_num())\n", "\n", " def _sum_exp_w_on_all_y(self, x):\n", " \"\"\"\n", " sum_y exp(self._sum_w_on_feature_funcs(x))\n", " :param x:\n", " :return:\n", " \"\"\"\n", " sum_w = 0\n", " for y in range(0, self.class_num):\n", " tmp_w = self._sum_exp_w_on_y(x, y)\n", " sum_w += np.exp(tmp_w)\n", " return sum_w\n", "\n", " def _sum_exp_w_on_y(self, x, y):\n", " tmp_w = 0\n", " match_feature_func_indices = self.feature_func.match_feature_funcs_indices(x, y)\n", " for match_feature_func_index in match_feature_func_indices:\n", " tmp_w += self.w[match_feature_func_index]\n", " return tmp_w\n", "\n", " def fit(self, X, y):\n", " self.eta = max(1.0 / np.sqrt(X.shape[0]), self.eta)\n", " self.init_params(X, y)\n", " x_y = np.c_[X, y]\n", " for epoch in range(self.epochs):\n", " count = 0\n", " np.random.shuffle(x_y)\n", " for index in range(x_y.shape[0]):\n", " count += 1\n", " x_point = x_y[index, :-1]\n", " y_point = x_y[index, -1:][0]\n", " # 获取联合概率分布\n", " p_xy = self.Pxy.get(tuple(x_point.tolist() + [y_point]))\n", " # 获取边缘概率分布\n", " p_x = self.Px.get(tuple(x_point))\n", " # 更新w\n", " dw = np.zeros(shape=self.w.shape)\n", " match_feature_func_indices = self.feature_func.match_feature_funcs_indices(x_point, y_point)\n", " if len(match_feature_func_indices) == 0:\n", " continue\n", " if p_xy is not None:\n", " for match_feature_func_index in match_feature_func_indices:\n", " dw[match_feature_func_index] = p_xy\n", " if p_x is not None:\n", " sum_w = self._sum_exp_w_on_all_y(x_point)\n", " for match_feature_func_index in match_feature_func_indices:\n", " dw[match_feature_func_index] -= p_x * np.exp(self._sum_exp_w_on_y(x_point, y_point)) / (\n", " 1e-7 + sum_w)\n", " # 更新\n", " self.w += self.eta * dw\n", " # 打印训练进度\n", " if count % (X.shape[0] // 4) == 0:\n", " print(\"processing:\\tepoch:\" + str(epoch + 1) + \"/\" + str(self.epochs) + \",percent:\" + str(\n", " count) + \"/\" + str(X.shape[0]))\n", "\n", " def predict_proba(self, x):\n", " \"\"\"\n", " 预测为y的概率分布\n", " :param x:\n", " :return:\n", " \"\"\"\n", " y = []\n", " for x_point in x:\n", " y_tmp = []\n", " for y_index in range(0, self.class_num):\n", " match_feature_func_indices = self.feature_func.match_feature_funcs_indices(x_point, y_index)\n", " tmp = 0\n", " for match_feature_func_index in match_feature_func_indices:\n", " tmp += self.w[match_feature_func_index]\n", " y_tmp.append(tmp)\n", " y.append(y_tmp)\n", " return utils.softmax(np.asarray(y))\n", "\n", " def predict(self, x):\n", " return np.argmax(self.predict_proba(x), axis=1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing:\tepoch:1/5,percent:30/120\n", "processing:\tepoch:1/5,percent:60/120\n", "processing:\tepoch:1/5,percent:90/120\n", "processing:\tepoch:1/5,percent:120/120\n", "processing:\tepoch:2/5,percent:30/120\n", "processing:\tepoch:2/5,percent:60/120\n", "processing:\tepoch:2/5,percent:90/120\n", "processing:\tepoch:2/5,percent:120/120\n", "processing:\tepoch:3/5,percent:30/120\n", "processing:\tepoch:3/5,percent:60/120\n", "processing:\tepoch:3/5,percent:90/120\n", "processing:\tepoch:3/5,percent:120/120\n", "processing:\tepoch:4/5,percent:30/120\n", "processing:\tepoch:4/5,percent:60/120\n", "processing:\tepoch:4/5,percent:90/120\n", "processing:\tepoch:4/5,percent:120/120\n", "processing:\tepoch:5/5,percent:30/120\n", "processing:\tepoch:5/5,percent:60/120\n", "processing:\tepoch:5/5,percent:90/120\n", "processing:\tepoch:5/5,percent:120/120\n", "f1: 0.9295631904327557\n" ] } ], "source": [ "# 构建特征函数类\n", "feature_func=SimpleFeatureFunction()\n", "feature_func.build_feature_funcs(X_train,y_train)\n", "\n", "maxEnt = MaxEnt(feature_func=feature_func)\n", "maxEnt.fit(X_train, y_train)\n", "y = maxEnt.predict(X_test)\n", "\n", "print('f1:', f1_score(y_test, y, average='macro'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通过前面的分析，我们知道特征函数的复杂程度决定了模型的复杂度，下面我们添加更复杂的特征函数来增强MaxEnt的效果，上面的特征函数仅考虑了单个特征与目标的关系，我们进一步考虑二个特征与目标的关系，即： \n", "\n", "$$\n", "f(x,y)=\\left\\{\\begin{matrix}\n", "1 & x_i=某值,x_j=某值,y=某类\\\\ \n", "0 & 否则\n", "\\end{matrix}\\right.\n", "$$ \n", "\n", "如此，我们可以定义一个新的UserDefineFeatureFunction类（**注意:类中的方法名称要和SimpleFeatureFunction一样**）" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "class UserDefineFeatureFunction(object):\n", " def __init__(self):\n", " \"\"\"\n", " 记录特征函数\n", " {\n", " (x_index1,x_value1,x_index2,x_value2,y_index)\n", " }\n", " \"\"\"\n", " self.feature_funcs = set()\n", "\n", " # 构建特征函数\n", " def build_feature_funcs(self, X, y):\n", " n_sample, _ = X.shape\n", " for index in range(0, n_sample):\n", " x = X[index, :].tolist()\n", " for feature_index in range(0, len(x)):\n", " self.feature_funcs.add(tuple([feature_index, x[feature_index], y[index]]))\n", " for new_feature_index in range(0,len(x)):\n", " if feature_index!=new_feature_index:\n", " self.feature_funcs.add(tuple([feature_index, x[feature_index],new_feature_index,x[new_feature_index],y[index]]))\n", "\n", " # 获取特征函数总数\n", " def get_feature_funcs_num(self):\n", " return len(self.feature_funcs)\n", "\n", " # 分别命中了那几个特征函数\n", " def match_feature_funcs_indices(self, x, y):\n", " match_indices = []\n", " index = 0\n", " for item in self.feature_funcs:\n", " if len(item)==5:\n", " feature_index1, feature_value1,feature_index2,feature_value2, feature_y=item\n", " if feature_y == y and x[feature_index1] == feature_value1 and x[feature_index2]==feature_value2:\n", " match_indices.append(index)\n", " else:\n", " feature_index1, feature_value1, feature_y=item\n", " if feature_y == y and x[feature_index1] == feature_value1:\n", " match_indices.append(index)\n", " index += 1\n", " return match_indices" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing:\tepoch:1/5,percent:30/120\n", "processing:\tepoch:1/5,percent:60/120\n", "processing:\tepoch:1/5,percent:90/120\n", "processing:\tepoch:1/5,percent:120/120\n", "processing:\tepoch:2/5,percent:30/120\n", "processing:\tepoch:2/5,percent:60/120\n", "processing:\tepoch:2/5,percent:90/120\n", "processing:\tepoch:2/5,percent:120/120\n", "processing:\tepoch:3/5,percent:30/120\n", "processing:\tepoch:3/5,percent:60/120\n", "processing:\tepoch:3/5,percent:90/120\n", "processing:\tepoch:3/5,percent:120/120\n", "processing:\tepoch:4/5,percent:30/120\n", "processing:\tepoch:4/5,percent:60/120\n", "processing:\tepoch:4/5,percent:90/120\n", "processing:\tepoch:4/5,percent:120/120\n", "processing:\tepoch:5/5,percent:30/120\n", "processing:\tepoch:5/5,percent:60/120\n", "processing:\tepoch:5/5,percent:90/120\n", "processing:\tepoch:5/5,percent:120/120\n", "f1: 0.957351290684624\n" ] } ], "source": [ "# 检验\n", "feature_func=UserDefineFeatureFunction()\n", "feature_func.build_feature_funcs(X_train,y_train)\n", "\n", "maxEnt = MaxEnt(feature_func=feature_func)\n", "maxEnt.fit(X_train, y_train)\n", "y = maxEnt.predict(X_test)\n", "\n", "print('f1:', f1_score(y_test, y, average='macro'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们可以根据自己对数据的认识，不断为模型添加一些新特征函数去增强模型的效果，只需要修改build_feature_funcs和match_feature_funcs_indices这两个函数即可（**但注意控制函数的数量规模**） \n", "简单总结一下MaxEnt的优缺点，优点很明显：我们可以diy任意复杂的特征函数进去，缺点也很明显：训练很耗时，而且特征函数的设计好坏需要先验知识，对于某些任务很难直观获取" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }