{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "训练数据集\n", "\\begin{align*} \\\\& T = \\left\\{ \\left( x_{1}, y_{1} \\right), \\left( x_{2}, y_{2} \\right), \\cdots, \\left( x_{N}, y_{N} \\right) \\right\\} \\end{align*} \n", "由$P \\left( X, Y \\right)$独立同分布产生。其中,$x_{i} \\in \\mathcal{X} \\subseteq R^{n}, y_{i} \\in \\mathcal{Y} = \\left\\{ c_{1}, c_{2}, \\cdots, c_{K} \\right\\}, i = 1, 2, \\cdots, N$,$x_{i}$为第$i$个特征向量(实例),$y_{i}$为$x_{i}$的类标记,\n", "$X$是定义在输入空间$\\mathcal{X}$上的随机向量,$Y$是定义在输出空间$\\mathcal{Y}$上的随机变量。$P \\left( X, Y \\right)$是$X$和$Y$的联合概率分布。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "条件独立性假设\n", "\\begin{align*} \\\\& P \\left( X = x | Y = c_{k} \\right) = P \\left( X^{\\left( 1 \\right)} = x^{\\left( 1 \\right)} , \\cdots, X^{\\left( n \\right)} = x^{\\left( n \\right)} | Y = c_{k}\\right) \n", "\\\\ & \\quad\\quad\\quad\\quad\\quad\\quad = \\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right) \\end{align*} \n", "即,用于分类的特征在类确定的条件下都是条件独立的。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由\n", "\\begin{align*} \\\\& P \\left( X = x, Y = c_{k} \\right) = P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right)\n", "\\\\ & P \\left( X = x, Y = c_{k} \\right) = P \\left( Y = c_{k}| X = x \\right) P \\left( X = x \\right)\\end{align*} \n", "得\n", "\\begin{align*} \\\\& P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right) = P \\left( Y = c_{k}| X = x \\right) P \\left( X = x \\right)\n", "\\\\ & P \\left( Y = c_{k}| X = x \\right) = \\dfrac{P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right)}{P \\left( X = x \\right)} \n", "\\\\ & \\quad\\quad\\quad\\quad\\quad\\quad = \\dfrac{P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right)}{\\sum_{Y} P \\left( X = x, Y = c_{k} \\right)}\n", "\\\\ & \\quad\\quad\\quad\\quad\\quad\\quad = \\dfrac{P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right)}{\\sum_{Y} P \\left(X = x | Y = c_{k} \\right) P \\left( Y = c_{k} \\right)}\n", "\\\\ & \\quad\\quad\\quad\\quad\\quad\\quad = \\dfrac{ P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right)}{\\sum_{Y} P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right)}\\end{align*} \n", "朴素贝叶斯分类器可表示为\n", "\\begin{align*} \\\\& y = f \\left( x \\right) = \\arg \\max_{c_{k}} \\dfrac{ P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right)}{\\sum_{Y} P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right)}\n", "\\\\ & \\quad\\quad\\quad = \\arg \\max_{c_{k}} P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right)\\end{align*} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "朴素贝叶斯模型参数的极大似然估计 \n", "1. 先验概率$P \\left( Y = c_{k} \\right)$的极大似然估计 \n", "\\begin{align*} \\\\& P \\left( Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right)}{N} \\quad k = 1, 2, \\cdots, K\\end{align*} \n", "2. 设第$j$个特征$x^{\\left( j \\right)}$可能取值的集合为$\\left\\{ a_{j1}, a_{j2}, \\cdots, a_{j S_{j}} \\right\\}$,条件概率$P \\left( X^{\\left( j \\right)} = a_{jl} | Y = c_{k} \\right)$的极大似然估计\n", "\\begin{align*} \\\\& P \\left( X^{\\left( j \\right)} = a_{jl} | Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left(x_{i}^{\\left( j \\right)}=a_{jl}, y_{i} = c_{k} \\right)}{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right)}\n", "\\\\ & j = 1, 2, \\cdots, n;\\quad l = 1, 2, \\cdots, S_{j};\\quad k = 1, 2, \\cdots, K\\end{align*} \n", "其中,$x_{i}^{\\left( j \\right)}$是第$i$个样本的第$j$个特征;$a_{jl}$是第$j$个特征可能取的第$l$个值;$I$是指示函数。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "朴素贝叶斯算法: \n", "输入:线性可分训练数据集$T = \\left\\{ \\left( x_{1}, y_{1} \\right), \\left( x_{2}, y_{2} \\right), \\cdots, \\left( x_{N}, y_{N} \\right) \\right\\}$,其中$x_{i}= \\left( x_{i}^{\\left(1\\right)},x_{i}^{\\left(2\\right)},\\cdots, x_{i}^{\\left(n\\right)} \\right)^{T}$,$x_{i}^{\\left( j \\right)}$是第$i$个样本的第$j$个特征,$x_{i}^{\\left( j \\right)} \\in \\left\\{ a_{j1}, a_{j2}, \\cdots, a_{j S_{j}} \\right\\}$,$a_{jl}$是第$j$个特征可能取的第$l$个值,$j = 1, 2, \\cdots, n; l = 1, 2, \\cdots, S_{j},y_{i} \\in \\left\\{ c_{1}, c_{2}, \\cdots, c_{K} \\right\\}$;实例$x$; \n", "输出:实例$x$的分类\n", "1. 计算先验概率及条件概率\n", "\\begin{align*} \\\\ & P \\left( Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right)}{N} \\quad k = 1, 2, \\cdots, K\n", "\\\\ & P \\left( X^{\\left( j \\right)} = a_{jl} | Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left(x_{i}^{\\left( j \\right)}=a_{jl}, y_{i} = c_{k} \\right)}{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right)}\n", "\\\\ & j = 1, 2, \\cdots, n;\\quad l = 1, 2, \\cdots, S_{j};\\quad k = 1, 2, \\cdots, K\\end{align*} \n", "2. 对于给定的实例$x=\\left( x^{\\left( 1 \\right)}, x^{\\left( 2 \\right)}, \\cdots, x^{\\left( n \\right)}\\right)^{T}$,计算\n", "\\begin{align*} \\\\ & P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right) \\quad k=1,2,\\cdots,K\\end{align*} \n", "3. 确定实例$x$的类别 \n", "\\begin{align*} \\\\& y = f \\left( x \\right) = \\arg \\max_{c_{k}} P \\left( Y = c_{k} \\right)\\prod_{j=1}^{n} P \\left( X^{\\left( j \\right)} = x^{\\left( j \\right)} | Y = c_{k} \\right) \\end{align*} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "朴素贝叶斯模型参数的贝叶斯估计 \n", "1. 条件概率的贝叶斯估计\n", "\\begin{align*} \\\\& P_{\\lambda} \\left( X^{\\left( j \\right)} = a_{jl} | Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left(x_{i}^{\\left( j \\right)}=a_{jl}, y_{i} = c_{k} \\right) + \\lambda}{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right) + S_{j} \\lambda} \\end{align*} \n", "式中$\\lambda \\geq 0$。当$\\lambda = 0$时,是极大似然估计;当$\\lambda = 1$时,称为拉普拉斯平滑。 \n", "2. 先验概率的贝叶斯估计\n", "\\begin{align*} \\\\& P \\left( Y = c_{k} \\right) = \\dfrac{\\sum_{i=1}^{N} I \\left( y_{i} = c_{k} \\right) + \\lambda}{N + K \\lambda}\\end{align*} " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }