{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 第4章 朴素贝叶斯\n", "朴素贝叶斯的基本思想是: 基于特征条件独立假设学习输入和输出的联合概率分布,然后,基于此模型,对给定的输入$x$,利用贝叶斯订立求出后验概率最大的输出$y$,根据《统计学习方法》书中的规划,这里将简述朴素贝叶斯的学习和分类、朴素贝叶斯的参数估计方法。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.1 朴素贝叶斯的学习和分类\n", "### 4.1.1简述\n", "输入:训练数据集\n", "$$T=\\{ \\left( x_1,y_1 \\right),\\left( x_2,y_2 \\right),\\left( x_i,y_i \\right),...,\\left( x_N,y_N \\right) \\}$$\n", "\n", "朴素贝叶斯的目的是: 通过训练数据集$T$学习到联合概率分布$P(X,Y)$,具体来讲,这个联合概率分布的求解根据联合概率密度求解的公式$P(x|y)=\\frac{P(x,y)}{P_Y(y)}$又分为两个部分:\n", ">先验概率分布:\n", "$$P(Y=c_k),k=1,2,...,K$$\n", ">条件概率分布:\n", "$$P(X=x|Y=c_k)=P(X^\\left(1\\right)=x^\\left(1\\right),...,X^\\left(n\\right)=x^\\left(n\\right)|Y=c_k),k=1,2,...,K$$\n", "\n", "其中,条件独立性假设\n", ">$$ P\\left( X=x|Y=c_k \\right)=P( X^\\left(1\\right)=x^\\left(1\\right),...,X^\\left(n\\right)=x^\\left(n\\right) )\n", "=\\prod_{j=1}^n P \\left(X^\\left(j \\right)=x^\\left( j \\right)| Y=c_k\\right)$$\n", "\n", "关于**生成模型**和**判别模型**: \n", "根据上面的式子,这里由于要求解的是$P(x|y)=\\frac{P(x,y)}{P_Y(y)}$,实质上是一种生成数据的机制,所以属于**生成模型**,可能有的同学会听说过生成对抗网络(GAN)的概念,没错,里面的生成也是生成模型,在GAN中,生成模型也是用来生成数据的,但是可能它生成的数据更加复杂,不再单单是一个高斯分布了,很有可能是一张图什么的,这个是后话了,这里只要知道朴素贝叶斯是生成模型、它学习的是一种生成数据的机制,就可以了。但是话又说回来了,你都知道如何生成数据了,当然就可以知道如何对数据进行区分了,不是么哈哈~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1.2算法描述\n", "好了,这里简言之,直接描述朴素贝叶斯的算法过程 \n", "输入: \n", ">* 训练数据$T=\\{ \\left( x_1,y_1 \\right),\\left( x_2,y_2 \\right),\\left( x_i,y_i \\right),...,\\left( x_N,y_N \\right) \\}$,其中,$x_i = \\left(x_i^\\left(1\\right),x_i^\\left(2\\right),...,x_i^\\left(n\\right) \\right)^T$; \n", "* $x_i^\\left(j\\right)$是第$i$个样本的第$j$个特征,$x_i^\\left(j\\right) \\in \\left\\{a_j1,a_j2,...,a_{jS_j} \\right\\}$,$a_{jl}$是第$j$个特征可能取的第$l$个值,$j=1,2,...,n$,$l=1,2,...,S_j$,$y_i \\in \\left\\{c_1,c_2,...,c_K \\right\\}$; \n", "* 实例$x$;\n", "\n", "输出: \n", ">* 实例$x$的分类结果" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "算法过程: \n", ">1. 计算先验概率以及条件概率 \n", "$$P\\left( Y = c_k \\right) = \\frac{\\sum_{i=1}^N I(y_i=c_i)}{N},k=1,2,...,K$$ \n", "$$P\\left( X^\\left(j\\right)=a_{jl} | Y=c_k \\right) = \\frac{\\sum_{i=1}^N I(x^\\left(j\\right)=a_{jl},y_i=c_k)}{\\sum_{i=1}^N I(y_i=c_i)}$$ \n", "$$j=1,2,...,n; l=1,2,...,S_j,k=1,2,...,K$$ \n", "2. 对于给定的实例$x$,计算\n", "$$P(Y=c_k)\\prod_{j=1}^n P(X^\\left(j\\right) = x^\\left(j\\right)|Y=c_k),k=1,2,...,K $$\n", "3. 确定实例$x$的类\n", "$$y=\\arg\\underset{c_k}{\\max} P(Y=c_k) P(X^\\left(j\\right) = x^\\left(j\\right)|Y=c_k)$$\n", "\n", "### 4.1.3编写代码测试朴素贝叶斯算法" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 给定数据集,其中SML分别替换为1,2,3\n", "import numpy as np\n", "x = np.asarray([[1,1],[1,2],[1,2],[1,1],[1,1],\n", " [2,1],[2,2],[2,2],[2,3],[2,3],\n", " [3,3],[3,2],[3,2],[3,3],[3,3]])\n", "y = np.asarray([-1,-1,1,1,-1,\n", " -1,-1,1,1,1,\n", " 1,1,1,1,-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">计算$P(Y=1)$和$P(Y=-1)$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Y=1)= 0.6 , P(Y=-1)= 0.4\n" ] } ], "source": [ "AllNum = np.size(y)\n", "# P(Y=1)\n", "PosY = np.sum(y==1)/AllNum\n", "#P(Y=-1)\n", "NegY = np.sum(y==-1)/AllNum\n", "print('P(Y=1)=',PosY,', P(Y=-1)=',NegY)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">计算$P\\left( X^\\left(j\\right)=a_{jl} | Y=c_k \\right) = \\frac{\\sum_{i=1}^N I(x^\\left(j\\right)=a_{jl},y_i=c_k)}{\\sum_{i=1}^N I(y_i=c_i)}$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Y = -1 ) = 0.4\n", "P(X特征维度= 0 特征值= 1 | Y= -1 )= 0.5\n", "P(X特征维度= 0 特征值= 2 | Y= -1 )= 0.3333333333333333\n", "P(X特征维度= 0 特征值= 3 | Y= -1 )= 0.16666666666666666\n", "P(X特征维度= 1 特征值= 1 | Y= -1 )= 0.5\n", "P(X特征维度= 1 特征值= 2 | Y= -1 )= 0.3333333333333333\n", "P(X特征维度= 1 特征值= 3 | Y= -1 )= 0.16666666666666666\n", "P(Y = 1 ) = 0.6\n", "P(X特征维度= 0 特征值= 1 | Y= 1 )= 0.2222222222222222\n", "P(X特征维度= 0 特征值= 2 | Y= 1 )= 0.3333333333333333\n", "P(X特征维度= 0 特征值= 3 | Y= 1 )= 0.4444444444444444\n", "P(X特征维度= 1 特征值= 1 | Y= 1 )= 0.1111111111111111\n", "P(X特征维度= 1 特征值= 2 | Y= 1 )= 0.4444444444444444\n", "P(X特征维度= 1 特征值= 3 | Y= 1 )= 0.4444444444444444\n" ] } ], "source": [ "# y 取值范围\n", "y_ = np.unique(y)\n", "# x 的维度\n", "feanum = np.size(x,1)\n", "min_result_dictionary = {}\n", "for k in range(len(y_)):# 对于每个Y类别\n", " P_Y = np.sum(y==y_[k])/np.size(y) # 计算P(Y=k)的值\n", " print('P(Y =',y_[k],') =',P_Y)\n", " k_x = x[y==y_[k],:]# 保留在P(Y=k)条件下所有的x样本\n", " for j in range(feanum): # 对于每个特征\n", " fea_val_range = np.unique(x[:,j])# 该维度下属性值的取值范围\n", " for l in range(len(fea_val_range)): # 对于特征下的第l个取值\n", " P_X_Y = np.sum(k_x[:,j] == fea_val_range[l])/np.sum(y==y_[k])\n", " print('P(X特征维度=',j,'特征值=',fea_val_range[l],'| Y=',y_[k],')=',P_X_Y)\n", " split = ','\n", " key = split.join([str(y_[k]),str(j),str(fea_val_range[l])])\n", " min_result_dictionary[key] = P_X_Y\n", "# min_result_dictionary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">当给定$x=(2,S)^T$,也即$x=(2,1)^T$,计算$P(Y=c_k)\\prod_{j=1}^n P(X^\\left(j\\right) = x^\\left(j\\right)|Y=c_k)$:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "testx = np.array([2,1])\n", "labelvalue =[]\n", "for k in range(len(y_)):\n", " P_Y = np.sum(y==y_[k])/np.size(y) # 计算P(Y=k)的值\n", " j_p = 1# 记录每个类别下的概率值\n", " for j in range(feanum):\n", " split = ','\n", " key = split.join([str(y_[k]),str(j),str(testx[j])])\n", " j_p = j_p * min_result_dictionary[key]\n", " labelvalue.append(P_Y*j_p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">返回值最大时候得到的类别标签$y=\\arg\\underset{c_k}{\\max} P(Y=c_k) P(X^\\left(j\\right) = x^\\left(j\\right)|Y=c_k)$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_[labelvalue.index(max(labelvalue))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "综上可以得知,实际上朴素贝叶斯是在给定每个类别标签的情况下,查看这个数据“合理性”有多高,“合理性”越高,说明越有可能是这个类别,而这个“合理性”的度量,就是使用了概率来衡量,最后动过一个$\\arg \\max$操作,得到了对应的类别标签。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1.4贝叶斯估计\n", "\n", "用心的读者也许会发现,如果用以上的方法(实际是“极大似然估计”)可能会得到概率值为0的情况,然后会影响后续一系列操作(如求最大值等)。根据《统计学习方法》一书,我们在这里加一个步骤,将原来的式子做小的修改:\n", "$$P\\left( X^\\left(j\\right)=a_{jl} | Y=c_k \\right) = \\frac{\\sum_{i=1}^N I(x^\\left(j\\right)=a_{jl},y_i=c_k)}{\\sum_{i=1}^N I(y_i=c_i)} \\rightarrow P\\left( X^\\left(j\\right)=a_{jl} | Y=c_k \\right) = \\frac{\\sum_{i=1}^N I(x^\\left(j\\right)=a_{jl},y_i=c_k)+\\lambda}{\\sum_{i=1}^N I(y_i=c_i)+S_j\\lambda}$$ \n", "其中,$\\lambda \\geq 0$,当$\\lambda = 0$的时候等同于原始的极大似然估计,而通常,这里取$\\lambda = 1$,称为**拉普拉斯平滑**,这时,对于$l=1,2,...,S_j$,$k=1,2,...,K$,有\n", "$$P_\\lambda(X^\\left(j\\right)=a_{jl}|Y=c_k)>0$$\n", "$$\\sum_{l=1}^{S_j}P_\\lambda(X^\\left(j\\right)=a_{jl}|Y=c_k) = 1$$\n", "\n", "这两个式子说明拉普拉斯平滑之后的式子确实是一个概率分布(即有每个概率大于0,总概率和等于1),此时,得到鲜艳概率贝叶斯估计为:\n", "$$P_\\lambda(Y=c_k)=\\frac{\\sum_{i=1}^N I(y_i=c_k)+\\lambda}{N+K\\lambda}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "此时我们修改上面的代码,运行可以得到如下结果" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Y = -1 ) = 0.4117647058823529\n", "P(X特征维度= 0 特征值= 1 | Y= -1 )= 0.4444444444444444\n", "P(X特征维度= 0 特征值= 2 | Y= -1 )= 0.3333333333333333\n", "P(X特征维度= 0 特征值= 3 | Y= -1 )= 0.2222222222222222\n", "P(X特征维度= 1 特征值= 1 | Y= -1 )= 0.4444444444444444\n", "P(X特征维度= 1 特征值= 2 | Y= -1 )= 0.3333333333333333\n", "P(X特征维度= 1 特征值= 3 | Y= -1 )= 0.2222222222222222\n", "P(Y = 1 ) = 0.5882352941176471\n", "P(X特征维度= 0 特征值= 1 | Y= 1 )= 0.25\n", "P(X特征维度= 0 特征值= 2 | Y= 1 )= 0.3333333333333333\n", "P(X特征维度= 0 特征值= 3 | Y= 1 )= 0.4166666666666667\n", "P(X特征维度= 1 特征值= 1 | Y= 1 )= 0.16666666666666666\n", "P(X特征维度= 1 特征值= 2 | Y= 1 )= 0.4166666666666667\n", "P(X特征维度= 1 特征值= 3 | Y= 1 )= 0.4166666666666667\n" ] } ], "source": [ "# y 取值范围\n", "y_ = np.unique(y)\n", "# x 的维度\n", "feanum = np.size(x,1)\n", "lmda = 1\n", "min_result_dictionary = {}\n", "for k in range(len(y_)):# 对于每个Y类别\n", " P_Y = (np.sum(y==y_[k])+lmda)/(np.size(y)+lmda*len(y_)) # 计算P(Y=k)的值\n", " print('P(Y =',y_[k],') =',P_Y)\n", " k_x = x[y==y_[k],:]# 保留在P(Y=k)条件下所有的x样本\n", " for j in range(feanum): # 对于每个特征\n", " fea_val_range = np.unique(x[:,j])# 该维度下属性值的取值范围\n", " for l in range(len(fea_val_range)): # 对于特征下的第l个取值\n", " P_X_Y = (np.sum(k_x[:,j] == fea_val_range[l])+lmda)/(np.sum(y==y_[k])+lmda*len(fea_val_range))\n", " print('P(X特征维度=',j,'特征值=',fea_val_range[l],'| Y=',y_[k],')=',P_X_Y)\n", " split = ','\n", " key = split.join([str(y_[k]),str(j),str(fea_val_range[l])])\n", " min_result_dictionary[key] = P_X_Y\n", "# min_result_dictionary" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.05925925925925926, 0.03333333333333333]\n" ] } ], "source": [ "testx = np.array([2,1])\n", "labelvalue =[]\n", "for k in range(len(y_)):\n", " P_Y = np.sum(y==y_[k])/np.size(y) # 计算P(Y=k)的值\n", " j_p = 1# 记录每个类别下的概率值\n", " for j in range(feanum):\n", " split = ','\n", " key = split.join([str(y_[k]),str(j),str(testx[j])])\n", " j_p = j_p * min_result_dictionary[key]\n", " labelvalue.append(P_Y*j_p)\n", "print(labelvalue)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">返回值最大时候得到的类别标签" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_[labelvalue.index(max(labelvalue))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注:上面由于浮点精度问题,这里的结果与《统计学习方法》书中的结果存在一点差异,但是原理就是这样了!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 4.2 定义自己的朴素贝叶斯算法" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# 4.2定义自己的朴素贝叶斯算法\n", "import numpy as np\n", "class myNaiveBayes():\n", " def train(self,x,y,lmda=None):\n", " \"\"\"\n", " x: 训练集 \n", " y: 类别标签向量\n", " lmda: 拉普拉斯平滑参数值,默认为1\n", " \"\"\"\n", " self.x = x\n", " self.y = y\n", " self.lmda = lmda\n", " model = []\n", " # y 取值范围\n", " y_ = np.unique(y)\n", " # x 的维度\n", " feanum = np.size(x,1)\n", " min_result_dictionary = {}\n", " P_Y = np.empty(len(y_))\n", " for k in range(len(y_)):# 对于每个Y类别\n", " P_Y[k] = np.sum(y==y_[k])/np.size(y) # 计算P(Y=k)的值\n", " k_x = x[y==y_[k],:]# 保留在P(Y=k)条件下所有的x样本\n", " for j in range(feanum): # 对于每个特征\n", " fea_val_range = np.unique(x[:,j])# 该维度下属性值的取值范围\n", " for l in range(len(fea_val_range)): # 对于特征下的第l个取值\n", " P_X_Y = (np.sum(k_x[:,j] == fea_val_range[l])+lmda)/(np.sum(y==y_[k])+lmda*len(fea_val_range))\n", " split = ','\n", " key = split.join([str(y_[k]),str(j),str(fea_val_range[l])])\n", " min_result_dictionary[key] = P_X_Y\n", " \n", " model.append(feanum) # 特征维度数目 0\n", " model.append(y_) # 类别取值范围 1\n", " model.append(min_result_dictionary) # dictionary保存的概率值 2\n", " model.append(np.size(y)) # 样本个数 3\n", " model.append(P_Y)\n", " return model\n", " \n", " def predict(self,x,model):\n", " \"\"\"\n", " x 应该与train中的x具有相同的维度数\n", " y 为预测类别标签\n", " \"\"\"\n", " y = []\n", " self.model = model\n", " self.x = x\n", " if self.x.ndim==1:\n", " self.x = self.x.reshape(1,len(x))\n", " for i in range(len(self.x)):\n", " testx = np.asarray(self.x[i,:])\n", " labelvalue =[]\n", " for k in range(len(self.model[1])):\n", " P_Y = model[4][k]\n", " j_p = 1# 记录每个类别下的概率值\n", " for j in range(model[0]):\n", " split = ','\n", " key = split.join([str(model[1][k]),str(j),str(testx[j])])\n", " j_p = j_p * model[2][key]\n", " labelvalue.append(P_Y*j_p)\n", " y.append(y_[labelvalue.index(max(labelvalue))])\n", " return y\n", " \n", " def accuracy(self,y,realy):\n", " self.y = y\n", " self.realy = realy\n", " return 1-sum(np.sign(np.abs(y-realy)))/len(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">定义课本的15个实例" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "x = np.asarray([[1,1],[1,2],[1,2],[1,1],[1,1],\n", " [2,1],[2,2],[2,2],[2,3],[2,3],\n", " [3,3],[3,2],[3,2],[3,3],[3,3]])\n", "y = np.asarray([-1,-1,1,1,-1,\n", " -1,-1,1,1,1,\n", " 1,1,1,1,-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">验证课本的15个实例" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "正确率: 0.7333333333333334\n" ] } ], "source": [ "mynb = myNaiveBayes()\n", "model = mynb.train(x,y,lmda=1)\n", "newy = mynb.predict(x,model)\n", "print('正确率:',mynb.accuracy(y,newy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">课本测试用例" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-1]\n" ] } ], "source": [ "temp = np.asarray([2,1])\n", "newy = mynb.predict(temp,model)\n", "print(newy)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }