{ "cells": [ { "cell_type": "markdown", "id": "eb50e2ce", "metadata": {}, "source": [ "**实验五:支持向量机**" ] }, { "cell_type": "code", "execution_count": 3, "id": "95f5ebed", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math\n", "import cvxopt\n", "from cvxopt import matrix\n", "from cvxopt import solvers" ] }, { "cell_type": "markdown", "id": "714b16e8", "metadata": {}, "source": [ "**第一部分:函数介绍**" ] }, { "cell_type": "markdown", "id": "d3531f8a", "metadata": {}, "source": [ "二次规划问题是形式如下的一类最优化问题:\n", "$$\n", "\\begin{align}\n", "\\min_x \\quad &\\frac{1}{2}x^TPx+q^Tx \\\\\n", "s.t. \\quad &Gx\\leq h \\\\\n", " &Ax=b\n", "\\end{align}\n", "$$\n", "对于这一类问题可以使用[cvxopt](https://cvxopt.org/userguide/coneprog.html#quadratic-programming)库的solvers.qp()函数进行求解。" ] }, { "cell_type": "markdown", "id": "20b0f664", "metadata": {}, "source": [ "以下是一个例子(参考[Solving a quadratic program](https://cvxopt.org/examples/tutorial/qp.html)):\n", "$$\n", "\\begin{align}\n", "\\min_x \\quad &2x_1^2+x_2^2+x_1x_2+x_1+x_2 \\\\\n", "s.t. \\quad &x_1\\geq 0 \\\\\n", " &x_2\\geq 0 \\\\\n", " &x_1+x_2=1\n", "\\end{align}\n", "$$\n", "为了使用solvers.qp()函数,我们需要知道在该二次规划问题中的$P,q,G,h,A,b$矩阵分别是什么。\n", "在该优化问题中,\n", "\n", "* $P:=\\begin{bmatrix}\n", " 4 & 1 \\\\ 1 & 2\n", " \\end{bmatrix}$,\n", "* $q:=\\begin{bmatrix}\n", " 1 \\\\ 1\n", " \\end{bmatrix}$,\n", "* $G:=\\begin{bmatrix}\n", " -1 & 0 \\\\ 0 & -1\n", " \\end{bmatrix}$,\n", "* $h:=\\begin{bmatrix}\n", " 0 \\\\ 0\n", " \\end{bmatrix}$,\n", "* $A:=\\begin{bmatrix}\n", " 1 & 1\n", " \\end{bmatrix}$,\n", "* $b:=\\begin{bmatrix}\n", " 1\n", " \\end{bmatrix}$,\n", " \n", "把这些参数送入solvers.qp()函数中即可求出解。" ] }, { "cell_type": "code", "execution_count": 4, "id": "9e5b5531", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " pcost dcost gap pres dres\n", " 0: 1.8889e+00 7.7778e-01 1e+00 3e-16 2e+00\n", " 1: 1.8769e+00 1.8320e+00 4e-02 2e-16 6e-02\n", " 2: 1.8750e+00 1.8739e+00 1e-03 2e-16 5e-04\n", " 3: 1.8750e+00 1.8750e+00 1e-05 6e-17 5e-06\n", " 4: 1.8750e+00 1.8750e+00 1e-07 2e-16 5e-08\n", "Optimal solution found.\n", "[ 2.50e-01]\n", "[ 7.50e-01]\n", "\n" ] } ], "source": [ "# Tips1: cvxopt库中的matrix只接受double类型的数据\n", "# Tips2: matrix使用列表作为参数创建矩阵和numpy.array使用列表作为参数创建矩阵是不同的\n", "# print(matrix([[1.0, 1.0]]))\n", "# print(np.array([[1.0, 1.0]]))\n", "# print(matrix(np.array([[1.0, 1.0]])))\n", "Q = 2*matrix([ [2, .5], [.5, 1] ])\n", "p = matrix([1.0,1.0])\n", "G = matrix([[-1.0,0.0],[0.0,-1.0]])\n", "h = matrix([0.0,0.0])\n", "A = matrix([1.0, 1.0], (1,2))\n", "b = matrix(1.0)\n", "sol=solvers.qp(Q, p, G, h, A, b)\n", "print(sol['x'])" ] }, { "cell_type": "markdown", "id": "2384abb1", "metadata": {}, "source": [ "**第二部分:实验任务**" ] }, { "cell_type": "markdown", "id": "68091bb5", "metadata": {}, "source": [ "1.线性可分支持向量机与硬间隔最大化" ] }, { "cell_type": "markdown", "id": "ad91cf52", "metadata": {}, "source": [ "1) 读入数据集'dataset1.csv',把数据类型都转换成np.double类型,并画出数据集的散点图,给正样本(y为+1)和负样本(y为-1)分别标上不同的颜色。" ] }, { "cell_type": "code", "execution_count": null, "id": "1b25ddd4", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "bbcf4613", "metadata": {}, "source": [ "2) 求解如下对偶问题(参考课件):\n", "$$\n", "\\begin{align}\n", "\\min_\\alpha \\quad &\\frac{1}{2}\\sum_{i=1}^m\\sum_{j=1}^m\\alpha_i\\alpha_jy_iy_j\\pmb{x}_i^T\\pmb{x}_j-\\sum_{i=1}^m\\alpha_i \\\\\n", "s.t. \\quad &\\sum_{i=1}^m\\alpha_iy_i=0 \\\\\n", " &\\pmb{\\alpha}\\geq \\pmb{0}\n", "\\end{align}\n", "$$\n", "\n", "这个优化问题是一个二次规划问题。\n", "* $P是一个m\\times m的矩阵,其中P_{ij}=y_iy_j\\pmb{x}_i^T\\pmb{x}_j$,\n", "* $q是一个m\\times 1的所有值都为-1的列向量,即q:=\\begin{bmatrix}\n", " -1 & -1 & \\cdots & -1\n", " \\end{bmatrix}^T$,\n", "* $G:=\\begin{bmatrix}\n", " -1 & 0 & \\cdots & 0 \\\\\n", " 0 & -1 & \\cdots & 0 \\\\\n", " \\vdots & \\vdots & \\ddots &0 \\\\\n", " 0 & 0 & 0 & -1\n", " \\end{bmatrix}_{m\\times m}=-\\pmb{I},\\pmb{I}为单位矩阵,$\n", "* $h是一个m\\times 1的零向量,即h:=\\begin{bmatrix}\n", " 0 & 0 & \\cdots & 0\n", " \\end{bmatrix}^T$,\n", "* $A:=\\begin{bmatrix}\n", " y_1 & y_2 & \\cdots & y_m\n", " \\end{bmatrix}^T$,\n", "* $b:=\\begin{bmatrix}\n", " 0\n", " \\end{bmatrix},一个标量$\n", " \n", "把上述参数送入求解器solvers.qp()中即可得到最优解$\\alpha^*$。 \n", " \n", "附:$P$矩阵的一个计算方法:\n", "设$X=\\begin{bmatrix}\n", " x_{11} & x_{12} \\\\\n", " x_{21} & x_{22} \\\\\n", " \\vdots & \\vdots \\\\\n", " x_{m1} & x_{m2}\n", " \\end{bmatrix}$,\n", " $Y=\\begin{bmatrix}\n", " y_{1} \\\\\n", " y_{2} \\\\\n", " \\vdots \\\\\n", " y_{m}\n", " \\end{bmatrix}$,\n", " \n", "计算$X'=\\begin{bmatrix}\n", " x_{11}y_1 & x_{12}y_1 \\\\\n", " x_{21}y_2 & x_{22}y_2 \\\\\n", " \\vdots & \\vdots \\\\\n", " x_{m1}y_m & x_{m2}y_m\n", " \\end{bmatrix}=X*Y(注意这里是星乘)$\n", " \n", "则$P=X'X'^T$。" ] }, { "cell_type": "code", "execution_count": null, "id": "1537fe6b", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "#如果求解报错可以尝试在solvers.qp()中添加参数kktsolver='ldl'\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "825c6609", "metadata": {}, "source": [ "3) 求出$\\pmb{\\omega}^*=\\sum_{i=1}^m\\alpha_i^*y_i\\pmb{x}_i$和$b^*=y_j-\\pmb{\\omega}^{*T}\\pmb{x_j}$, 其中$j$为$\\alpha^*$中的一个正分量$\\alpha_j^*>0$的下标。注意:由于求解器求出来的是一个近似解,所以$\\alpha^*$中很多实际上为0的分量会略大于0,这时候可以设置一个阈值把非常靠近0的那些分量筛去,再从剩下的分量中选取一个正分量来计算$b^*$,或者也可以直接取$\\alpha^*$中最大的分量来计算$b^*$。" ] }, { "cell_type": "code", "execution_count": null, "id": "388c3510", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "ad94df88", "metadata": {}, "source": [ "4) 画出数据集的散点图,给正样本(y为+1)和负样本(y为-1)分别标上不同的颜色,再为支持向量(训练数据中$\\alpha_j^*>0$的对应的样本)标上不同的颜色,并画出决策边界$\\pmb{\\omega}^{*T}\\pmb{x}+b=0$和间隔边界$\\pmb{\\omega}^{*T}\\pmb{x}+b=1$与$\\pmb{\\omega}^{*T}\\pmb{x}+b=-1$。" ] }, { "cell_type": "code", "execution_count": null, "id": "36428566", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "f33d9ee3", "metadata": {}, "source": [ "2.线性支持向量机与软间隔最大化" ] }, { "cell_type": "markdown", "id": "bdbeff31", "metadata": {}, "source": [ "1) 读入数据集'dataset2.csv',把数据类型都转换成np.double类型,并画出数据集的散点图,给正样本(y为+1)和负样本(y为-1)分别标上不同的颜色。" ] }, { "cell_type": "code", "execution_count": null, "id": "5caeaadf", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "3d108a81", "metadata": {}, "source": [ "2) 选择一个参数C,求解如下对偶问题(参考课件):\n", "$$\n", "\\begin{align}\n", "\\min_\\alpha \\quad &\\frac{1}{2}\\sum_{i=1}^m\\sum_{j=1}^m\\alpha_i\\alpha_jy_iy_j\\pmb{x}_i^T\\pmb{x}_j-\\sum_{i=1}^m\\alpha_i \\\\\n", "s.t. \\quad &\\sum_{i=1}^m\\alpha_iy_i=0 \\\\\n", " &\\pmb{0}\\leq \\pmb{\\alpha}\\leq C \n", "\\end{align}\n", "$$\n", "* $G:=\\begin{bmatrix}\n", " -1 & 0 & \\cdots & 0 \\\\\n", " 0 & -1 & \\cdots & 0 \\\\\n", " \\vdots & \\vdots & \\ddots &0 \\\\\n", " 0 & 0 & 0 & -1 \\\\\n", " 1 & 0 & \\cdots & 0 \\\\\n", " 0 & 1 & \\cdots & 0 \\\\\n", " \\vdots & \\vdots & \\ddots &0 \\\\\n", " 0 & 0 & 0 & 1\n", " \\end{bmatrix}_{2m\\times m}=\\begin{bmatrix}\n", " -\\pmb{I} \\\\\n", " \\pmb{I}\n", " \\end{bmatrix},\\pmb{I}为单位矩阵,$\n", "* $h:=\\begin{bmatrix}\n", " 0 \\\\\n", " 0 \\\\\n", " \\vdots \\\\\n", " 0 \\\\\n", " C \\\\\n", " C \\\\\n", " \\vdots \\\\\n", " C\n", " \\end{bmatrix}_{2m\\times 1}, 即一个m\\times 1的零列向量与一个m\\times 1的分量全为C的列向量上下拼接$,\n", "* $P,q,A,b$与硬间隔优化问题中的矩阵相同。" ] }, { "cell_type": "code", "execution_count": null, "id": "de031f65", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "7093e867", "metadata": {}, "source": [ "3) 求出$\\pmb{\\omega}^*=\\sum_{i=1}^m\\alpha_i^*y_i\\pmb{x}_i$和$b^*=y_j-\\pmb{\\omega}^{*T}\\pmb{x_j}$, 其中$j$为$\\alpha^*$中的一个正分量$0<\\alpha_j^*" ] }, { "cell_type": "code", "execution_count": null, "id": "7a7801c9", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "7b97abec", "metadata": {}, "source": [ "4) 画出数据集的散点图,给正样本(y为+1)和负样本(y为-1)分别标上不同的颜色,再为支持向量(训练数据中$\\alpha_j^*>0$的对应的样本)标上不同的颜色,并画出决策边界$\\pmb{\\omega}^{*T}\\pmb{x}+b=0$和间隔边界$\\pmb{\\omega}^{*T}\\pmb{x}+b=1$与$\\pmb{\\omega}^{*T}\\pmb{x}+b=-1$。" ] }, { "cell_type": "code", "execution_count": null, "id": "4446b508", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "e20c4293", "metadata": {}, "source": [ "3.非线性支持向量机与核函数" ] }, { "cell_type": "markdown", "id": "f830c3b3", "metadata": {}, "source": [ "[Raisin Dataset](https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset)是一个葡萄干的数据集,总共有900个样本,每个样本包含7个(都是连续的)特征以及1个标签,每个标签只有两种可能取值。本次实验已经按照8:2的比例划分成了训练数据集'Raisin_train.csv'以及测试数据集'Raisin_test.csv',且每个数据集都已经做了特征归一化处理以及把标签的值替换成了+1和-1。" ] }, { "cell_type": "markdown", "id": "7a6c8ca9", "metadata": {}, "source": [ "1) 读入训练数据集'Raisin_train.csv',把数据类型都转换成np.double类型。" ] }, { "cell_type": "code", "execution_count": null, "id": "5b020812", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "89cf5ba7", "metadata": {}, "source": [ "2) 选择一个核函数$K(\\pmb{x},\\pmb{z})$以及参数C,求解如下对偶问题(参考课件):\n", "$$\n", "\\begin{align}\n", "\\min_\\alpha\\quad &\\frac{1}{2}\\sum_{i=1}^m\\sum_{j=1}^m\\alpha_i\\alpha_jy_iy_jK(\\pmb{x}_i,\\pmb{x}_j)-\\sum_{i=1}^m\\alpha_i \\\\\n", "s.t. \\quad &\\sum_{i=1}^m\\alpha_iy_i=0 \\\\\n", " &\\pmb{0}\\leq \\pmb{\\alpha}\\leq C \n", "\\end{align}\n", "$$\n", "\n", "相较于软间隔最大化的优化问题,该优化问题仅需要对矩阵$P$做改动。\n", "从以下常用的核函数中选择一个作为该优化问题中的$K$(参数自己进行调整):\n", "* 线性核:$K(\\pmb{x},\\pmb{z})=\\pmb{x}^T\\pmb{z}$\n", "* 多项式核:$K(\\pmb{x},\\pmb{z})=(\\pmb{x}^T\\pmb{z}+1)^p$\n", "* 高斯核:$K(\\pmb{x},\\pmb{z})=exp(-\\frac{\\parallel \\pmb{x}-\\pmb{z} \\parallel^2}{2\\sigma^2})$\n", "* 拉普拉斯核:$K(\\pmb{x},\\pmb{z})=exp(-\\frac{\\parallel \\pmb{x}-\\pmb{z} \\parallel}{\\sigma})$\n", "* Sigmoid核:$K(\\pmb{x},\\pmb{z})=tanh(\\beta\\pmb{x}^T\\pmb{z}+\\theta)$\n", "\n", "则$P是一个m\\times m的矩阵,其中P_{ij}=y_iy_jK(\\pmb{x_i},\\pmb{x_j})$。" ] }, { "cell_type": "code", "execution_count": null, "id": "2716c531", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "1a1a084b", "metadata": {}, "source": [ "3) 求出$b^*=y_j-\\sum_{i=1}^m \\alpha_i^*y_iK(\\pmb{x_i},\\pmb{x_j})$, 其中$j$为$\\alpha^*$中的一个正分量$0<\\alpha_j^*" ] }, { "cell_type": "code", "execution_count": null, "id": "51cca8c9", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "16b41092", "metadata": {}, "source": [ "4) 读入测试数据集'Raisin_test.csv',用分类决策函数$f(\\pmb{x})=sign(\\sum_{i=1}^m \\alpha_i^*y_iK(\\pmb{x}_i,\\pmb{x})+b^*)$(注意这里的$m,\\alpha_i^*,y_i,\\pmb{x}_i$是训练集的, $\\pmb{x}$是测试集的)进行预测,输出预测准确率。" ] }, { "cell_type": "code", "execution_count": null, "id": "d2e528d4", "metadata": {}, "outputs": [], "source": [ "# ---- Your code here ----\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }