{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SVM\n", "\n", "Chapter 12. Support Vector Machines and Flexible Discriminants\n", "\n", "终于来到大名鼎鼎的 SVM。其实在2010年左右,已经好奇了解过 SVM,网络上看过很多的解释。当时感觉就是个平面分类,就是在学习之前,先把数据映射到高维空间,好让他们可以被不同的平面分割。其实我的理解没错,但似懂非懂。*这才是最糟糕的,你以为你懂了,但那只是皮毛。*\n", "\n", "SVM 的精髓在于 support vector,以及该算法背后严密的数学逻辑(求解凸优化问题),让其自成一派。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Outline\n", "\n", "1. *Support Vector Classifier*, 注意这里没有空间映射,直接考虑一个 hyperplane $\\{x: f(x)=x^T\\beta + \\beta_0 = 0\\}$。\n", " - 对于可以完全分类,最大化 边距 M \n", " $$ \\max_{\\beta, \\beta_0, \\|\\beta\\|=1 } M \\\\\n", " s.t. \\ y_i(x_i^T \\beta + \\beta_0) \\geq M, i=1,2,\\cdots,M\n", " $$\n", " \n", " - 对于不可以完全分类,引入松弛变量 slack variable $\\xi$。当分类在边界外时,$\\xi_i = 0 $;当分类在边界内, $\\xi_i > 0$(注意,此时依旧可能分类正确,$\\xi_i < 1$)。\n", " $$ \\max_{\\beta, \\beta_0, \\|\\beta\\|=1 } M \\\\\n", " s.t. \\ y_i(x_i^T \\beta + \\beta_0) \\geq M(1-\\xi_0), i=1,2,\\cdots,M \\\\\n", " \\sum_{i=1}^N \\xi_i \\leq constant\n", " $$\n", " \n", " - 定义 $M = 1/\\|\\beta\\|$,上述问题可以转化为\n", " $$ \\min \\|\\beta\\| \\\\\n", " s.t. \\ y_i(x_i^T \\beta + \\beta_0) \\geq (1-\\xi_0), i=1,2,\\cdots,M \\\\\n", " \\xi_i\\geq 0,\\sum_{i=1}^N \\xi_i \\leq constant\n", " $$\n", " *SVM 优势之一:问题的最优解只与那些落在边界内部的点相关,而每个点对应一个 support vector;与落在边界外面的点无关($\\xi_i =0$)。点的总数越少,算法估计就越快。* \n", "\n", "1. *Computing the Support Vector Classifiers*。\n", " *SVM 优势之二:数学严格的最优化求解过程。*\n", " - 把上述问题转化成凸优化问题\n", " $$ \\min_{\\beta, \\beta_0} \\frac{1}{2}\\|\\beta\\| + C \\sum_{i=1}^N \\xi_i \\\\\n", " s.t. \\ y_i(x_i^T \\beta + \\beta_0) \\geq (1-\\xi_0), i=1,2,\\cdots,M \\\\\n", " \\xi_i\\geq 0\n", " $$\n", " - 标准的凸优化解法:拉格朗日方程,对偶问题,KKT求解。这里只列部分相关结果:\n", " \\begin{align}\n", " \\alpha_i [y_i(x_i^T\\beta + \\beta_0) - (1-\\xi_i)] = & 0 \\\\\n", " \\hat{\\beta} = & \\sum_{i=1}^N \\hat{\\alpha}_i y_i x_i\n", " \\end{align}\n", " *这里非零的 $\\hat{\\alpha}_i$ 就是 support vectors*。请留意到,对于边界外的点,$y_i(x_i^T\\beta + \\beta_0) = 1$,所以有 $\\alpha_i = 0$.\n", " \n", "1. *Support Vector Machines*, 引入 *Kernel* $K(x,x')=$ 到 *support vector*,\n", " \\begin{align} f(x) =& h(x)^T\\beta+\\beta_0 \\\\\n", " =& \\sum_{i=1}^N \\hat{\\alpha}_i y_i K(x,x_i) + \\hat{\\beta}_0\n", " \\end{align}\n", " 关于 kernel 的理解可以参考[第六章节](./KernelSmoothing.ipynb)。*Kernel 的妙处是,不管映射的空间有多少维,无穷也行,最终的计算都是一个有限维度 $N \\times N$ 的内积。*\n", " \n", "1. *Generalizing Linear Discriminant Analysis*,拓展 LDA 至:\n", " - Flexible Discriminant Analysis 。 没看懂,作者说是 LDA 的 basis expansions。 反正就是最小化 average squared residual\n", " $$ ASR = \\frac{1}{N} \\sum_{l=1}^L \\left[ \\sum_{i=1}^N (\\theta_l(g_i) - \\eta_l(x_i))^2 + \\lambda J(\\eta_l)\\right]\n", " $$\n", " - Penalized Discriminant Analysis,当 predictors 太多,不需要 basis expansions 时就用 PDA。\n", " $$ ASR = \\frac{1}{N} \\sum_{l=1}^L \\left[ \\sum_{i=1}^N (\\theta_l(g_i) - h^T(x_i)\\beta_l )^2 + \\lambda\\beta_l^T \\mathbf{\\Omega}\\beta_l \\right]\n", " $$\n", " - Mixture Discriminant Analysis, 一个 Gaussian 不够用,我们就用两个或者更多 Gaussian,但每一个 Gaussian sharing the same covariance matrix。\n", " \n", " *好吧,不可否认 LDA 玩的溜很重要。但是,这几个 LDA 的衍生版本,根本不配与 SVM 相提并论。可能,在作者看来,这些都与 LDA 有关联,但是,这几个 FDA,PDA, MDA 都不存在严谨的数学逻辑。属于打补丁性质。*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 具体内容\n", "\n", "### SVM\n", "\n", "1. 常用的 kernel:\n", " - $d$th-Degree polynomial: $K(x,x') = (1+)^d$,\n", " - Radial basis: $K(x,x') = exp(1-\\gamma \\|x-x'\\|^2)$,\n", " - Neural network: $K(x,x') = \\tanh(k_1+k_2)$.\n", "\n", "1. Sequential minimal optimization (SMO) 算法。网上看到这个算法([参考博客](http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988419.html)),是 SVM 里面的经典。可惜作者没有提。可见作者还是属于统计学流派,而不是数学流派。\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }