{ "cells": [ { "cell_type": "markdown", "id": "498d70fa", "metadata": {}, "source": [ "# 线性回归(二)" ] }, { "cell_type": "code", "execution_count": null, "id": "226badd7", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "70d2906a", "metadata": {}, "source": [ "1) 使用pandas库的read_csv()函数(可以参考[pandas的官方文档](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))将训练数据集'train.csv'和测试数据集'test.csv'载入到Dataframe对象中。" ] }, { "cell_type": "code", "execution_count": 1, "id": "f59640f4", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "\n", "#读取数据集\n", "#使用函数 pd.read_csv\n", "\n", "\n", "#转化成numpy矩阵\n", "#使用函数 np.array\n", "\n" ] }, { "cell_type": "markdown", "id": "1bccb84d", "metadata": {}, "source": [ "2) 假设模型为一元线性回归模型$\\hat{y}=wx+b$, 损失函数为$l(w,b)=\\frac{1}{2}\\sum_{i=1}^m(\\hat{y}^{(i)}-y^{(i)})^2$, 其中$\\hat{y}^{(i)}$表示第$i$个样本的预测值,$y^{(i)}$表示第$i$个样本的实际标签值, $m$为训练集中样本的个数。求出使得损失函数最小化的参数$w$和$b$。" ] }, { "cell_type": "markdown", "id": "e011976d", "metadata": {}, "source": [ "方法① \n", "\n", "将$l(w,b)$分别对$w$和$b$求导,得到\n", "$$\n", "\\frac{\\partial l(w,b)}{\\partial w}=w\\sum_{i=1}^m x_i^2 -\\sum_{i=1}^m (y_i-b)x_i,\n", "$$\n", "$$\n", "\\frac{\\partial l(w,b)}{\\partial b}=mb -\\sum_{i=1}^m (y_i-wx_i),\n", "$$\n", "令上述两式为零即可得到$w$和$b$的解析解:\n", "$$\n", "w=\\frac{\\sum_{i=1}^m y_i (x_i-\\bar{x})}{\\sum_{i=1}^m x_i^2-\\frac{1}{m}(\\sum_{i=1}^m x_i)^2},\n", "$$\n", "$$\n", "b=\\frac{1}{m}\\sum_{i=1}^m(y_i-wx_i),\n", "$$\n", "其中$\\bar{x}=\\frac{1}{m}\\sum_{i=1}^m x_i$为$x$的均值。\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "210a8d62", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "#使用函数 np.sum\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "1a04bc78", "metadata": {}, "source": [ "方法② 梯度下降法。手动实现梯度下降法(不使用机器学习框架,如PyTorch、TensorFlow等)来进行模型的训练。算法步骤如下:1.初始化模型参数$w$和$b$的值;2.在负梯度的方向上更新参数(批量梯度下降、小批量随机梯度下降或者随机梯度下降均可),并不断迭代这一步骤,更新公式(以小批量随机梯度下降为例)可以写成:$$w\\gets w-\\frac{\\eta}{\\left|B\\right|}\\sum_{i\\in{B}}x^{(i)}(wx^{(i)}+b-y^{(i)}),$$ 和$$b\\gets b-\\frac{\\eta}{\\left|B\\right|}\\sum_{i\\in{B}}(wx^{(i)}+b-y^{(i)}),$$ 其中$\\eta$表示学习率,$B$表示每次迭代中随机抽样的小批量,$\\left|B\\right|$则表示$B$中的样本数量。3. 终止条件为迭代次数达到某一上限或者参数更新的幅度小于某个阈值。" ] }, { "cell_type": "code", "execution_count": null, "id": "bad649ed", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "#使用 for循环,函数 np.sum等\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "844f2a87", "metadata": {}, "source": [ "方法③ \n", "\n", "用矩阵表示,假设数据集有$m$个样本,特征有$n$维$。X=\\left[ \\begin{matrix} x_{11} & x_{12} & \\cdots & x_{1n} & 1 \\\\\n", " x_{21} & x_{22} & \\cdots & x_{2n} & 1 \\\\\n", " \\vdots & \\vdots & & \\vdots & \\vdots \\\\\n", " x_{m1} & x_{m2} & \\cdots & x_{mn} & 1 \\end{matrix} \\right]$,\n", " 实际标签$Y=\\left[ \\begin{matrix} y_{1} \\\\\n", " y_{2} \\\\\n", " \\vdots \\\\\n", " y_{m}\\end{matrix} \\right]$,\n", " 参数$B=\\left[ \\begin{matrix} w_{1} \\\\\n", " w_{2} \\\\\n", " \\vdots \\\\\n", " w_{n} \\\\\n", " b\\end{matrix} \\right]$,则解析解为$B^*=(X^T X)^{-1}X^T Y$。推导过程可参考[这篇文章](https://zhuanlan.zhihu.com/p/74157986)。" ] }, { "cell_type": "code", "execution_count": null, "id": "7f634cfa", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "#使用函数 np.c_ np.ones np.linalg.inv\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "eb16c177", "metadata": {}, "source": [ "3) 使用求解出来的线性回归模型对测试数据集'test.csv'进行预测,输出可视化结果(比如用seaborn或者matplotlib等可视化库来画出测试数据的散点图以及训练好的模型函数图像)。" ] }, { "cell_type": "code", "execution_count": null, "id": "b19a5413", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "#A=np.array([[1,2,3],[4,5,6],[7,8,9]])\n", "#x = A[0, :] #从一个矩阵中提取出一行作为一个向量\n", "#y1 = np.array([2, 3, 5])\n", "#plt.plot(x, y1) #画出折线图\n", "#y2 = np.array([2.5, 2.8, 5.3])\n", "#plt.plot(x, y2, '.') #画出散点图\n", "#plt.show()\n", "\n" ] }, { "cell_type": "markdown", "id": "f227579c", "metadata": {}, "source": [ "4) 在训练数据集'train2.csv'上求一个三元线性回归模型$\\hat{y}=w_0 + w_1 x_1 + w_2 x_2 + w_3 x_3$的使得损失函数$l(w_0,w_1,w_2,w_3)=\\frac{1}{2}\\sum_{i=1}^m(\\hat{y}^{(i)}-y^{(i)})^2$最小的参数$w_0,w_1,w_2$以及$w_3$。并在测试数据集'test2.csv'上进行预测,输出预测结果的均方误差$MSE(\\hat{y},y)=\\frac{1}{n}\\sum^n_{i=1}(y^{(i)}-\\hat{y}^{(i)})^2$, $n$为测试集中样本个数。" ] }, { "cell_type": "markdown", "id": "2a1f6178", "metadata": {}, "source": [ "方法① 同2)中的方法③。" ] }, { "cell_type": "markdown", "id": "f4ea8a93", "metadata": {}, "source": [ "方法② 类似2)中的方法②。算法步骤如下:1.初始化模型参数$w_0,w_1,w_2,w_3$的值;2.在负梯度的方向上更新参数(批量梯度下降、小批量随机梯度下降或者随机梯度下降均可),并不断迭代这一步骤,更新公式(以小批量随机梯度下降为例)可以写成:$$w_j\\gets w_j-\\frac{\\eta}{\\left|B\\right|}\\sum_{i\\in{B}}x_j^{(i)}(w_0 + w_1 x_1^{(i)}+w_2 x_2^{(i)}+w_3 x_3^{(i)}-y^{(i)}), j=0,1,2,3,$$ 其中$x_0^{(i)}=1$, 其中$\\eta$表示学习率,$B$表示每次迭代中随机抽样的小批量,$\\left|B\\right|$则表示$B$中的样本数量。3. 终止条件为迭代次数达到某一上限或者参数更新的幅度小于某个阈值。" ] }, { "cell_type": "code", "execution_count": null, "id": "115f58e2", "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }