{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 随机梯度下降和独热编码" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 介绍" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "不知不觉中,大数据时代已经到来。想象一下,如果你的训练数据集为 100G 或者更大,你在训练模型时,会怎么做呢?为了解释这个问题,本节将介绍什么是随机梯度下降算法,在线学习以及独热编码和哈希技巧。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 知识点" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 随机梯度下降\n", "- 在线学习\n", "- 独热编码\n", "- 哈希技巧" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 随机梯度下降" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "梯度下降是一种优化算法,因为其理解起来相对比较简单,所以梯度下降往往都是许多人在学习机器学习时最先接触到的优化算法。但它只是最基本的优化算法之一,在面对复杂的模型或数据时,很难达到较好的优化效果。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "梯度下降的主要思想很简单,就是通过在下降最快的方向移动来逐步逼近某些函数的最小值。 一般情况下,增长最快的方向指的是某个函数点的偏导数所指的方向,也就是某个函数点的斜率。也就是说,如果通过向相反方向移动,也就是函数下降最快的方向,就可以以最快的速度找到函数的最小值。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "梯度下降的想法就跟上图所示的滑雪运动一样。 如果你想尽可能快地到达山脚,你就需要选择最陡的下降路线。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 实验例子" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了更好的理解梯度下降算法的工作原理,现在通过一个例子来进行说明,先导入实验所需模块。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "import warnings\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from scipy.sparse import csr_matrix\n", "from sklearn.datasets import fetch_20newsgroups, load_files\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import (accuracy_score, classification_report,\n", " confusion_matrix, log_loss, roc_auc_score,\n", " roc_curve)\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "from tqdm import tqdm_notebook\n", "\n", "%matplotlib inline\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "实验所用到的数据为 SOCR 数据集, 数据集记录的是每个人的体重和身高信息。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "导入数据集。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IndexHeightWeight
0165.78331112.9925
1271.51521136.4873
2369.39874153.0269
3468.21660142.3354
4567.78781144.2971
\n", "
" ], "text/plain": [ " Index Height Weight\n", "0 1 65.78331 112.9925\n", "1 2 71.51521 136.4873\n", "2 3 69.39874 153.0269\n", "3 4 68.21660 142.3354\n", "4 5 67.78781 144.2971" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_demo = pd.read_csv(\"../../data/weights_heights.csv\") # 导入数据集\n", "data_demo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了直观地看出体重与身高的关系,画出数据分布图。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Height in inches')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEKCAYAAAAfGVI8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJztnXuYXWV56H/v7NlJ9gTNJBKtjIQEq0kbYxIYBY1aA6eJlyakiEYe7PH2FHs5tFAMBuUhoeWUaPQBTm21HLy1Uk64OUJTRSrU00aJnTCJMUiq3BI2ouHABM1sMntm3vPH2muyZ8+67b3X2rf1/p5nnpm99rp8a823vvf73quoKoZhGEZ66Wp2AwzDMIzmYoLAMAwj5ZggMAzDSDkmCAzDMFKOCQLDMIyUY4LAMAwj5ZggMAzDSDkmCAzDMFKOCQLDMIyU093sBkTh5JNP1oULFza7GYZhGG3Fnj17nlXV+WH7tYUgWLhwIYODg81uhmEYRlshIk9G2c9UQ4ZhGCnHBIFhGEbKMUFgGIaRckwQGIZhpBwTBIZhGCmnLbyGDMNoTwaG8my/9yBPDxc4pTfHprWL2bCyr9nNMiowQWAYRiIMDOW58q79FIrjAOSHC1x5136AWISBCZn4MNWQYRiJsP3eg5NCwKVQHGf7vQfrPrcrZPLDBZQTQmZgKF/3udOICQLDMBLh6eFCVdurIUkhk0ZMNWQYPpjqoT5O6c2R9xj0T+nN1X3uJIVMGrEVgWF4YKqH+tm0djG5bGbKtlw2w6a1i+s+t58wiUPIpBETBIbhgake6mfDyj6uO38Zfb05BOjrzXHd+ctiWVX5CZnVS+azatv9LNq8k1Xb7jfBHRFTDRmGB6Z6iIcNK/sSUae55yxX3a1eMp879+QT81LqZEwQGIYHSeq300SSdpZKIbNq2/2+qzgTBMGYIDAMDzatXTzFBx7i02/70WnG6YGhPJtu30dxQgFnhr7p9n2T38d9r7aKqx0TBIbhgZfqIcmBOengq1rbVM/9b737wKQQcClOKFfe9SNAYr9XW8XVjgkCw/AhKf22F0HG6WYIgjgE03Ch6Lm9UJzw2Fb/vTZjFefS7qs5EwSG0QK0mlojqmCKcwDMDxdYuHknAL25LFvXL63qXI1exbnEvZprhlAxQWAYDcbrRW81tYafAMoPF1i17f7JWXbQADi3J8vzI96rgjCGC8VJe0K1wqDRM/E4V3PNUhFaHIFhNBC/QLXVS+YnFnxVC3NyWd/v3DZvvftAYKzFlnVLyWak5jYUJ7Qt4jbiXM01K34lsRWBiCwGdpRtOh24WlVvKH1/OfBZYL6qPptUOwwjCWpdvvu96A88coTrzl8W6Zxxqg78ziUh43ehOD7tPlzcAbBcVeO12olCO3j8xLmaa5aKMDFBoKoHgRUAIpIB8sA3Sp9PBdYAh5K6vmEkRT3L96AXPYpaI07VQdC5hmtU6QB0iTAwlJ+8nw0r+1i17X7PwbI3l+VooYh6nAfCB9NWMNLGaaRuloqwUaqhc4FHVfXJ0ufrgSvA9/9vGC1LPcv3enPk+F17690HAGdgjJpiIeg+orbHS/UzrjotL5NfSggR/0GgS+DY8THfe2mVfFBxptJIMj9TEI0SBO8HbgUQkfOAvKruCz7EMFqTepbv9b7oftcYLhS5amA/m+7YN2Vg3HTHPt+B0U9dkx8ueNosvJg9o5uMhx6pUjD6DZZBK4+MCMOl1YLXIN9K+aA2rOxj1+ZzeHzbu9m1+ZyaVyVJ5mcKQlSTnZSLyAzgaWAp8CvgAWCNqh4VkSeAfi8bgYhcDFwMsGDBgjOffPLJyl0Moyn4qTn6enPs2nxO6PH1qDP8rg3ODHrC43We25Nl6Oo109pw2Y69vrPxXDbDe87s44FHjvB0SbB44YoAr+8FeHzbu32OdPC7n4wI4x5jU28uy+yZ3aFtCrtuWhCRParaH7ZfI9xH3wk8pKq/EJFlwCJgnziziFcBD4nIG1X1mfKDVPUm4CaA/v5+UyEZLUO9OuEwW0CQoNi0djGX7tjreZyXEAA8XTi333swUC/rGrB3bT4nUGi4KqQoem2v+/J7ln6G6OFC0TdQze+6RjiNUA1dSEktpKr7VfXlqrpQVRcCTwFnVAoBw2hlkly+h+m9Xf/8eomixnJjBi4NWDmMjI6RHy7g5WS08GUnBmSv+7psx14Gn3zO81n21TiYN9Pltp1JdEUgIrOB3wU+luR1DKPRJBW4FCU4acu6pZ6zaFDP9A29HjEBc3LZ0Jm14G9HcHFXG16CYtejz3HVwH6u3bDM874UuOVBb8dBr5VCWFvbMbVDK3g9QcKCQFWPAS8L+H5hktc3jHYjiiG6MpXCnFwWEW8VULZL2Lp+6ZRtA0N5jo2OhbYlDn3srbsPc+2GZb735QoD91ruCui685dNi6v4+dGCp/orI8Kj171ryraBoTxb7z4wKezm9mTZsq66lBVJ00qJBi3FhGE0kcoZod9MfU4uy6pt90+ZObr6+8qZs+AMsBmRKdG55QKkON4Ys5tr8PXzj4fpAqdQHOfSHXvpq5ghu3mIvK5R/l0u28XxsYkpQuP5kSKb7qg+ZUWStFKiwcS9huKgv79fBwcHm90Mo4NpxhLdaxD3ItslIEwbvHsDVgKuMCg/x0mzuhke8Q/eSgJ3tj4wlPc1cgeRy2Ym7S9BHlNRierZ1QgWbd5Zs7dVVKJ6DVmuISP1NCMwaWAoz+W37QsVAgLM6O7ynMEPF4q+Sd0q9y5OKM83WAgAXHjWqZN/d9WQdqhQHOeae5xgOa8YjGrJDxe4amB/XeeIi3qDC+PEVENG6kl6iV652nBr63r5yVeiwLHRaAbTViIjwoVnnUr/afNCZ/LZjASqqp4fKU6mrAC4/LZ9kZ6dH18vGaiv3bAscL/y/1tvTxZVOFooxrZibGb9hEpsRWCkniQTfXmtNm558FBkb5h2pK83x6PXvYv+0+ZN3nsQs2d0h7qLXn7bvklhMBGDOvvW3YcDv6/8vz0/UgyMcq6FZkURe2ErAiP1JJHoy51Nep239a1y9eHes9dKy4ujhSJ7t6wJtCO4+YsgmutrGOOqk3UVvAbesLbHtWJsRv0EL2xFYKSeuBJ9uQnfFm7eyWU79tZt2HTb0Y4MDOUjr6h6Zjj3uGFln2fMg4s7+BbHp8dK1ELQzD5K29shRXZUTBAYqaeeJbrf4F/rrF9wfN7L2+GV1K3V+cSdP6I3YgT0sdFxll79bcf3f/3SQOH39HAhVpuJX5K6KKvBTkplYaohI5V4uYt6uRUGuZVWun/GpfKpTBA3+ORzkwbOduH42ATHx6LP3I+NjrPpjn1sv2A5ZyyYw65Hn/PcTwTi9njPDxemGKMhPLK501JZmCAwYqVVQuaDCIvoLNfvl/vju/sNPvkcDzxyJBbVTyVes0zXu+Wfdh/yTSzXCRTHNTTWIKn79/r/F4rjk1lQ5ybgNdRKmCAwYqOVQuaDCMtjHzTLLxTHp6REiJPKWWalUH3prPqNpIY3fv//cVVy2UxV6SnaYTJUiUUWG7FRb57+qNT7ogVFdAalQqiVvt4cI6NjnsFf7ozT/e2mVQCqSrpmxEOvj0eS24fD+p5XtHh5dHSjschio+E0ovB2HFHAQRGdcXuCCLBr8zlsWTfdCJrLZrjwrFPJZTOTAVLu/VxzzwETAjGSy0Yb6vxWXE+X7Ahhfa+VqqZVgwkCIzYaETIfx4u2ae1iJ39PGdkuYdPaxaFtrdZ/p/x8s8oGI8Fp9627D3vej1/qCKN6erJdnum5q2FOLhup7zViMpQEJgiM2GhE4e3YXrTKEb30edPaxZ4F2QFmdndVbRsYGR3jqoH9XHnX/imDu3ueelIlGOF0CYzUKQQAjpUK8HhR3veiToZct+NFm3eyatv9iea1ioIJAiM2GhEyH8eqwysNc3FcJyNFZ8/w9qGoxh3S5fmRYsenlGhlavEy8poGFMfVN56jvO9FmQw1I8lhGOY1ZMRK0iHzcSTq8ls9uKUZ4/bMsTl/e+H3/xpXJdvl1Hhwqex7lUWDvAzKrVSHwMUEgdFWRHnRwvDzDIpSmtFIOeJ4FgXFE4RNhlrRjpCYIBCRxcCOsk2nA1cDfcA6YBR4FPiwqg4n1Q6j8/B60cqDwCpdMSv39YsatZm7EUZxXJk9s5u9W9aE7+xDEkkO6yUxG4GqHlTVFaq6AjgTGAG+AdwHvE5VXw/8F3BlUm0w0kG5zhWY5orppXud2W3mMaM26p25N8KpoloapRo6F3hUVZ8Enizb/iBwQYPaYHQIlUE9x46P+RpjK3WvUctDGoYf9c7c41Bvxk2jBMH7gVs9tn+EqeqjSUTkYuBigAULFiTXMqNtGBjKs/XuA1OMuVF0+q4ReNPaxZFz5BvtS2W95jipZuYeFIXcKnUIXBJPMSEiM4CngaWq+ouy7Z8C+oHzNaQRlmLCiGMmn8tmTAgYNRNmeyqnVVJNtFKKiXcCD1UIgQ8BvwdcFCYEDAOiV7sKolAcpw1T+xstgBDN9uSy9e7pKUJaOdVEIwTBhZSphUTkHcAVwHpVHWnA9Y0OIC7XOpt2GH4IsOrV86YZcr1UTUGD+sBQ3jcWxVVTtkpEsUuigkBEZgO/C9xVtvnzwEuA+0Rkr4h8Mck2GMnSqFD5IANdLpsJLHFoGGFkRLh+4wpu+cM3TYuO95s7+E1Ogmb9bqxKq0QUu1gaaqNmGqkH9bMRzO3JsmXd0ras4mW0DjdsXOHZZweG8ly2Y6+nMMiIMKE6zRDsl+bcj7jTtJfTSjYCo0NpZMpdrzxGN2xcwdDVa9iwso8HHjkS+zWNdDC3J+s7cdl+78HAlBNeM/tq3UtbITOppZgwaqbRofJBLnet8DIZ7Ykq02oWu0TtV4XiOJffto/Lduydkm7cRYCeGRmOjU53eGhmRLGLrQiMmgnLBNrIVLut8DIZ7clwoeirq6+mX7krBK/aBwqMjk1MS3He7IhiFxMERs0Ehco3ItVuuaAZGR2bVmzGMKJSqdJ0+1Z+uOBXuqJqihPK7BndiaZprxVTDRk1ExQqv2rb/TWl2o1Sj9grwtgqehn14rp2uoO/axsotxFkRDj79Lk8dOhoTXEtRwvFuhLWJYUJAqMu/PT2tdgPBobybLpj32TRmPxwgU137Ju8jruP5QoyksJNWRJkIN716HPksl3M7ckyPFKkqxRxHIVWVWGaasiIRLX6/loqiV1zzwHPymHX3HNg8rPlCjJagUJxgheLE1y/cQWfe9/yaSpSL1rFHuCFCQIjlFr0/auXzK9qO/ird8q3m3eQ0SqUqzrLXZt7c1nm9jgBjm55y1ayB3hhqiEjlFpK6/n59dfq73/VwH6u3bDMt6iHYTQDd2ISVCyp1slLFHtZXNiKwAilFn2/32DtGuTKVxOu2imIrz94iKVXf9uEgNFS+Kk66/Waa3SBexMERii16PszAWk+yzt1ZXWxILyCcQyjWQTp/OuNum9k1D6YasiIgF+N35HRsckZSuUSNsyLorxTm/HXaCcEQlU1fhObqGqiRkfthwoCEVkF7FXVYyLyAeAM4MZS2UkjBbid3ct3f9Md+0CdYBk4Mduf25MN9e03w6/RbkRJEHfVwH7f76K6jza6wH2UFcEXgOUishy4HLgZ+AfgdxJpkdGSbFjZx/Z7D07Ls17p7gnODL9QHA8tGejWG/bL3W4YrUa5Ksg15uaHC5PVy8ImQO4qOszo67UKT9L9NIqNYKxURew84POq+rc49QSMlFHtDD4sxGb1kvkcGx2rvUGG0UB6sl3TAhvdWburCg1bBT8/4p/XqByvbLtJup9GWRH8SkSuBP4AeKuIdAFWBSSFxOm6ObcnywOPHPFcURhGkriz90wVEcEwNZlcPYGNUVKtQGML3EdZEWwEjgMfUdVngFcB2xNtldGSeCWZy2ak6mRv2S5hy7qlZiMwmsK4KtkuYWZ3df22XD9f74So1fp+qCAoDf53AjNLm54FvhF2nIgsLpWidH9eEJFLRWSeiNwnIj8t/Z5b3y0YjcJrubr9guVsfOOpVWVkHCvNwnp7bGFpNIfihDLikS7aD7fEpBsDE+QeHYVWyzkUWqpSRP4QuBiYp6qvFpHXAF9U1XMjX0QkA+SBs4A/BZ5T1W0ishmYq6qfCDreSlUmh5fBqy/ENa4y4rEWg28um2FsfJwq3kXDaAqVTg+5bCZQLZTNCLNndHO0UGROLsux0bEpKtCkyrl6EbVUZRQbwZ8CbwR2A6jqT0Xk5VW251zgUVV9UkTOA95e2v414N+AQEFgJENlJk9XX+q6gAKeYfPlx9S6RLbYAaNdqJwqF4rjvvaFjAjbL1g+5b1pZKqIWokiCI6r6qiUlkIi0k24Q0gl7wduLf39ClX9eenvZ4BXVHkuIyaCDF5+Bi3L/mkYzqSpcmXgN9NvpNG3VqIYi78nIp8EciLyu8DtwD1RLyAiM4D1peOmUHJL9RQqInKxiAyKyOCRI1aYvFaC0keHGay8vm81I5dhNAPXnbMVq43VQpQVwWbgo8B+4GPAv+AElUXlncBDqvqL0udfiMgrVfXnIvJK4JdeB6nqTcBN4NgIqrieUcJLjVOu8glzB/UyaFn2TyPtuIFd7TDTj0oUr6EJVf3fqvpeVb2g9Hc1A/OFnFALAdwNfLD09weBb1ZxLqMKwhJXebmDumQzwrHjY9NWEpvWLq65ZqthtCtun09i5l9t0ackiOI1tArYCpyGs4IQHK3O6aEnF5kNHAJOV9WjpW0vA24DFgBPAu9T1eeCzmNeQ7WxaPNOT72bAI9vezfgHyb/6xfHJvMHuccozotgKwKj0+jNZXnhxSITAcNhlDxD1eJVejVOr6I4vYa+BFwG7AGqshKq6jHgZRXb/h+OF5GRMFESV3ktb1dtu39aqLz7flQW9jaMduWGjSsAJidCYSRhH6ul6FMSRBEER1X1W4m3xIidWhNXhXV4ZbpvtWG0G7cPHuKhQ0cje8ElEQTW6HTTfvgKAhE5o/TnAyKyHbgLJ9UEAKr6UMJtM+rEnVFU68McxSBsQsBod3Y9GqiRnkJSmT8bnW7aD18bgYg8EHCcqmq8yrIAzEbQGMrtBTbjNwyHXLaLWdkMwyPF2APCWsVGEGosbgXSKAjqjUaMenzQ4G/CwDC86c1l2bp+aSyDdZKRx7EZi0Xkr4HPqOpw6fNc4HJVvar+ZhpehPn/hx17zT0Hphh7/Y6vvE7loG+2AMPwZrhQjPxOhtEK8QhRIovf6QoBAFV9HnhXck0yai1c7Q7sXsUxvI6/5p4DoYYyEwKG4U2SxeQbTRRBkBERNwU1IpLjREpqIwFq9SQIywNUfvzAUD60mpJhGMF0SsqVKO6jtwDfFZGvlD5/GCdrqJEQtXoShHXKLhEWbd45mTo6DFMLGUYwrVZXoFaipJj4NPA/gd8q/fyVqn4m6YalGa/UD1Hc18I65bgqimMziFI/4KKzF/imoDAMg8SKyTeaKKohVPVbqvrx0s+9STcq7dRauNovd1AtxZQEuOXBQ8zKRuoihtFxZLuE2TP8J0Jze7JNN/LGRRSvofOBTwMvxxkf3FxDL024bammFk8CvwCyy3bsrfr6rkrI7AhGGsmIsP29ToEZP1//LeuWNrGF8RLFRvAZYJ2q/iTpxhj14yVAKt1JXeb2ZFGl6jKThtEJ+FUZA5hQnXyPao3QbyeiCIJfmBBIhqRL2A0M5dl69wHPgT6bEbasW8r2ew+aIDBSRXnk7oprvuPZ/yvtba3g658kUQTBoIjsAAaYmmvorsRalQKqCRqrNkr46eECvT1Zjo4U8asNP3tGNxtW9tWkNjKMdsYVAgNDeV540XsStHrJ/Aa3qrlEEQQvBUaANWXbFCcJnVEjUdPPRhUYlfuF6faHC0VWbbuf3p6s2QGM1JAp85zYfu9B3/oDDzySrvK4oYJAVT/ciIakjahBY1EFRi1F5fPDBbJdVm/MSA/jqpMTqaC4m04JFItKUBrqK1T1MyLyN3jEFanqnyXasg4natBYVIFRa8ctBpVkMowOxJ1IBaVb75RAsagEOYm7BuJBnOpklT9GHUQNGvPrkKf05qbUOu2qJVjAMFLK08MFNq1d7LkizmakYwLFouK7IlDVe0q/a04nISK9wM3A63BWFR8BCsAXgVnAGPAnqvrDWq/RrkRxSRsYynumgshlM6xeMn+KTcDPDc4wjOm4E6ft710+xbNubk+WLeviSS/dTiRaj0BEvgb8u6reLCIzgB6cwvXXq+q3RORdwBWq+vag86S1HkFlEAuc6Kh+dVaDfKMNwzhBuRtp0q7czSLO4vW1NmAO8DbgQwCqOgqMiojieCIBzAGeTqoN7UZ5Z+zyGdB7Qtw+J1S5YeMKTyFiGMYJytNI11r/o1NIMpHMIuAI8BURGRKRm0VkNnApsF1EDgOfBa70OlhELhaRQREZPHKk81253BVAfriA4q/qcY3CQbaD8lxFhpF2enNZ3++eHi7UXP+jkwgVBCIyX0Q+KSI3iciX3Z8I5+4GzgC+oKorgWPAZuCPgctU9VTgMuBLXger6k2q2q+q/fPnd35wR1T3T1cA+CWYGxkdY2AoH3v7DKNdyIhMJmu8YeMK9m5Z4zspOqU3V3P9j04iimrom8C/A/8KVKNreAp4SlV3lz7fgSMI3gL8eWnb7TjG5NTj58ZWTrlXkbtkrUwh8fxIkUstWthIMeMl9Wi5WmfT2sWeieM2rV3sa2+bE7CS6DSiCIIeVf1EtSdW1WdE5LCILFbVg8C5wMPA6cDvAP8GnAP8tNpzdxoDQ3nfIjAZESZUPQ1YG1b2Wa4gw/Dgsh17uXTHXvoq3hs/g/Cm2/dNi6k5VlpdR7ETtLuxOdRrSESuBb6vqv9S9clFVuDM+GcAj+FUN1sK3IgjhF7EcR8NjEvodK+hVdvu910RfODsBVy7YZnvsYs277QqYoYRggioMk0wuKz8y+94plrp682xa/M5gef2S1MdpYZI0kT1GopiLP5z4J9FpCAiL4jIr0TkhSiNUNW9JT3/61V1g6o+r6r/oapnqupyVT0rTAikgSBd5J178oE6/7RFQBpGLbjzXdcjqPKdGvbJtxXFTtAJxuYopSpfoqpdqppT1ZeWPltRmhgJGszDOtSmtYvJZiyq2DCi4vVOBXnhhdEJxmZfQSAiS0q/z/D6aVwTOx8/DyCXoA61YWUfs2ckFg5iGB1J5TtVa51wqE+ItApBI8hfABcDn/P4TnEMvUYMuHrEy2/b5xk/ENahjoYYi7MZoThulgTDcPEqPAO1VSEL8khqF4JyDV1c+r26cc1JL26Hq6VDBWVRBEwIGEYZAp7vVK1VyDqhlKXpFFqIWjvU6iXz+fqDhxrRRMNoe5T4U0e0eylLEwQtQBQf5KB9dv7o581otmG0JZZ6ZTomCJpMlFKUVw3s55YHD03GC5TvA+FlKQ3DcPBTC6WdKLmGvhtlm1EbYT7IA0P5KUKgfJ/Lb9vH1rsPNKilhtH+JKEW6gSCSlXOwqkfcLKIzMURpuCkkLYnGRNhPsjb7z3oGzk8rmrpJQyjCjJWyc+TINXQx3BSRp+CU5rSfYIvAJ9PuF1tS7U5R8JqF7dTUIphtDpWtMkbX9WQqt6oqouAj6vq6aq6qPSzXFU7WhCU1wJete3+yGmdK2sK+IWzlxMWyNJOQSmGETe5bG0lU/zm/WYo9ibUWKyqfyMibwYWlu+vqv+QYLuaRhTjrR9B+v7KY68a2M+tuw8zrooI9GS7KBQnpmdF9AhWqWRuT5aeGd2RUlkbRjtRKE5M/p3LZpiV7fJ1jujNZTlaKHJKb47VS+Zz5558Wwd5NZJQQSAi/wi8GtjLiXoECnSkIKhmMK8kas6Rqwb2T/H7V4WR4oRnptHy2IL8cGFauupsRlA1FZLR+RSK48zs7iKXzUx5RwW4yOPd6T9tXlsHeTWSKO6j/cBva5JV7luIehJIhen7XW7dfdjz+Ft3H/ZMOV0erDIwlOeaew5MzoqK42YwNtLD0UKR6zeuiDTAt3uQVyOJIgh+DPwGkIqopaiDuRdRc474GazCDFmVQsAw2p0uoDsjjEZMg+LW5LYBPl6C3EfvwdFCvAR4WER+CBx3v1fV9ck3r/HUk0AqaoqIjIjnoB/k2uZV/MIw2p0JYHwidDfAdPxJErQi+GzDWtFC1JtAKsps5cKzTvXMDXThWaf6HnPNPQdMCBgdSdhKWMB0/AkTlH30e/WeXER6cUpVvg5ndfERVf2BiFwC/CmO8Xmnql5R77XiJOmlp2sHcL2GMiJceNapviUpB4bypg4yUsvj297d7CZ0PFG8hn7F9LrqR4FB4HJVfSzg8BuBb6vqBSIyA+gRkdXAecByVT0uIi+vse1tzbUblk0Z+N3YBa9VSDuVvDOMOHnNy2c3uwmpIIqx+AbgKeCfcFZp78dxJ30I+DLwdq+DRGQO8DbgQwCqOgqMisgfA9tU9Xhp+y/ruoMGUm3UcDXnDYpdCPNYymaEjW/wVjcZRqvT15vj2PExT++3kdGIBgSjLqKE7a1X1b9X1V+p6guqehOwVlV3AHMDjlsEHAG+IiJDInKziMwGXgu8VUR2i8j3ROQN9d9G8vhFDV81sL+mKORywhLPhXksFceVBx45UvV1DaPZuNlA/arsWXxMY4giCEZE5H0i0lX6eR/wYum7ICtPN3AG8AVVXQkcAzaXts8DzgY2AbeJTHeXEZGLRWRQRAaPHGn+IOc3WN/y4KGqUkp4ERa7EFbTmNK1LXzeaDcU592ak8t6fm8pVhpDFEFwEfAHwC+BX5T+/oCI5ID/EXDcU8BTqrq79PkOHMHwFHCXOvwQx4Ps5MqDVfUmVe1X1f758+dHvqGk8BusvdJDV6vTDyt+vWFlH+85M1wFZSkmjHYkP1zg2OgY2a6p80FzF20coYJAVR9T1XWqerKqzi/9/TNVLajqfwQc9wxwWETc/+S5wMPAALAaQEReC8wAnq37ThKmmplJtctZrxm/4LwgrrrJqpAZnUxxXDlpVjd9vTkEx25w3fnLzF20QQQFlF2hqp8Rkb/BQwWkqn8W4fyXALeUPIYeAz6MoyL6soj8GBgFPtjdMLdgAAAXTUlEQVQO6Su8As0q8/64VCM0XAN0oTg+GWhWfl5X3WQxBEanMzxSZOjqNc1uRioJ8hr6Sen3YK0nV9W9OLmKKvlAredsNOWeQr09WWZ2d8WW4bDSW6hSCLiYEDA6hWxGmD2j29NDKGwClZTXnhEcUHZP6ffXAESkR1VHGtWwVqByoH5+pEgum+H6jSsmO2DUDIeVnXj1kvmTAWXltPzSyDDqYPaMbrauX1p1Gpd60sMb4UiYVkZE3gR8CThJVReIyHLgY6r6J41oIEB/f78ODta8MKmZVdvu9zTA9vXm2LX5nMjnsTxBhuEgOJHC1c7u43wX07SqEJE9quqllZlC1ICytcDdAKq6T0TeVmf72oJ6UlKX4+V6GoSf7cEw2p1yT7hqBuA43kVbVfgTqQ6cqlYm0E/F1DbMrTMq1XTWXDbDRWcvqOr8htGKVAYH1eMOGse7GBa4mWaiCILDpVKVKiJZEfk4JwzJHU1YPeGoRO2sGRGuO9/JQWTBYUa7oxCbO2gc72JcK/xOJIpq6I9wksf1AXngOziZQzueypTUvT1ZVOGyHXvZfu/ByPrFKHWHc9nMlBclyjGG0cpkRKrS31dSqc9/z5l9PPDIkZr1+/UUnep0ohSvfxYnujiVuLrMevSLXjUOVi+ZH9ipK2sVG0a7EVZnIAiv9+3OPfm6VxW1Fp3qdIICyjwDyVwiBpR1DPUUtYfpxrGBofy0RHFeHg2b1i5m0+37KE6Y+dhoL+pRb9b7vnlRb9GpTiZoRVDur3kNsCXhtrQ0cekXB4byfPKuHzFSPJFeNz9cYNPt+0CcUHt326U79poHkdGWlKdIqWWwTUqfb/WOvQkKKPua+7eIXFr+OY1Uo1/081UeGMr7zu79ZvwmBIx2oDeXRcQJuvRKkQLVuWiaPr+xRHIfxcajyF4LfnULXOFgKh6jE5k9s5st65bS15uLJSNvXB57RjSieA0ZRNcvBuk2zU3N6FTCkiNG7fvlq+k5uSyzsl0MjxRNn58wQcbi8lrFPSLygvsVoKr60qQb12oE6RfdDuzn4eMKjyAPoEyXMG4rBqPFyGZk0nYVRKE4jgh4OQuVq3SCVKflwmS4MD23l5EMvqohVX2Jqr609NNd9vdL0igEgihXB/nhdvjK4hvlvGRmN3N7vCs1GUYzEGD7BcsjewB5CYFsl0yqdMJUpxb52xyi2giMAMJyCbm6zQ0r+9j+3uW++x0tFNmybinZjL+wMIxGckpvjg0r+9i1+Rye2PZubti4omq30JNmdU9RrVarOjWVavKYIIiBoI5aGVq/YWWf74t0Sm/OMShHWIYbRiOoNM66QuGGjStC62i7DI+cqD0QNNjHldvLqB4TBDHg11HdFLmV+k0vj4hslzAyOmZRxEbLMLcn66ub37Cyj+vOXzaZSyiI8vcjaLA3T6HmYYIgBqrtwJUvUW8uCyUf7GoJMDkATr4XN+mXYUQll82wZd3SwH3c1cHj297t27+EqauKoHel8r2wusWNI7QwTV0nF+kFbgZeh+OB9BFV/UHpu8uBzwLzS/mMfGlWYZpqqLXgxcBQnstv21dzXpZslzCju4tjo942CrcQCMDKv/xOTcLGSBcCXHT2Aq7dsCzyMV7Fl/zOk7biMM0kzsI09XAj8G1VvaBUwL6n1LhTgTXAoYSvnwh+Hbnazuy+PPUk5ypOKBPFCWbPyHgKA3cpPjCUNyFgREJhWh6sMKrJ42NpHlqPxASBiMwB3gZ8CEBVR4HR0tfXA1cA30zq+klRTRbSsJlPtZXL/BhX9RQCAqxeMn8ytYVhRKUWTx0b4NuXJG0Ei4AjwFdEZEhEbhaR2SJyHpBX1cCRSUQuFpFBERk8cqS62UmSRPV1DvKXdknaLU6BWx48xKe+sd9SWxhVYZ466SJJQdANnAF8QVVXAseArcAngavDDlbVm1S1X1X758+fn2AzqyOqr3MUgdGIl03B135gpIsPnL0gMKDRxTx10keSguAp4ClV3V36fAeOYFgE7BORJ4BXAQ+JyG8k2I5YierrHEVg+HlQGEbcdAn0nzaP7e9d7niplZjbk+UDZy8wT52Uk5iNQFWfEZHDIrJYVQ8C5wIPqeq57j4lYdAf5jXUSkStchQlja6fgc2qkhnl5LJdXHf+6xl88jlu3X24JueCCYUr79rPdecvY++WNQm00mhnkvYaugS4peQx9Bjw4YSvlzhRvSOiCgw/A9tlO/bGmvtbcAaU8oI4RnvwYul/du2GZVy7YRmLNu8M7BvZjDA2rr7poG22b1SSaBxBXLRDHIEX9fhLL9y8M/b25LIZZmW7zI20DZnbk2Xoamcmv2rb/b4rxr5SP/ObSAhw/cYV5sefEloljiDV1OpONzCUJyMSqgLwK2PZVUoF7DUjjMNd1Wg8z48UGRjKs2Fln+9qs1y376de7O3JRnZ/NtKDpZhoMaIGmfX15rh+4wpu2LhiWrbSTJdYSbkOxPU427Cyj/ec2UdGnP97RoT3nDl10uHniKCKpXo2pmErgiZTqT4aGR0LnbULsGvzOYCjJqjMVloc18AVRbZLLK6gDXFn+ANDee7ck5/8/46rcueePP2nzZuS5Ram27Iu27HX89yW6jndmCAIIOmcKF5RylEo9zzye4GDVhQnzepG1akAZcRLFJVePecG/xiVa+45MK2/uhMGFz+VkQWQpRtTDfkQJTK4XmpJMVHpeVTLCzw8UmTvljU1FRkx/OnrzfG59y2PFLRVC66A8RP+z48UQ/urpXo2vDBB4EOSZfMGhvKBnh+VuMOKV7CP14sdhis8KitPGbVTnkr5pFnJLLRdoR1V+Hv1V0v1bHhhqiEfaimbV65KmpPLIuLMvoMKdHvRm8sye2Z3FSqp6lQRlbM/t91GbfRV/I+Gq3DP7RIn2KvPJwDRpXzW7uU15IdXf7XkcEYlJgh8iBIZXE7lAF+ufy930YtS33jr+qWRaxk414weJFa5BIwimKrFz601mxHGx5VOCml7olTroRy/vuPFhJ4Y5P309xmRaeVOYaoh+NjxMU+bj+n+jSiYasiHanWpYQN8WIFuqH6ZXouNYQKmzP7jSoXtkstmePOr500rXyhAd5d0lBAAPG1Gm9YunubSG4TbN/z63Ofet3xanyivDrZr8zlsXb/UdP9GzdiKwIdqCm1ANPc79zxesz63vrFLFI+lWl3+ni5zQ4wzp5EA7zmzjwceOTJtRaBQ1cqlXfANxqrScSg/XJgUyq7nUaXKKYhq+6thlGMpJmIiivHXfbHDokK91DWV+0S5Zi7b5Tv4zu3J8usXx2KPJ0jSfbJVqRTi1TgC+OH1/zaMaomaYsJUQzER5r0TpUC360106Y69kTyWgpb9bsZKvzY9P1JMJKhsXHWaWqjTyQ8XWLXt/kk1URzBWRbtazQSUw3FROXS3M9ryN3Xq6xlmNG2coDZsLKPS30iRQvFiWmqhrjoCzBOgqMV8TMYdyrlDgG9PdnQxH5R/icW7Ws0ChMEMRLmlhek949itK3WA8RVT8QpBESc9BZuHWS/VYUS7hLZTvip9cpxZ/FRHneU/4l5/BiNwgRBgwgreh82+/PzAJkbYfYZJ6qw8i+/E3pNV2+eRDrtZvD0cCE0syc4/9coqrGwFZN5/BiNxGwEDSIsUjlo9lfpR+7aEhZt3omqk220kYQJgfJBrNVTWPTmspEG7spobDfvTyUZkdCZfJgQmNuTNUOx0VBMEDSIsEjloNnfhOo0jyI3p8xwoUgXzuDhGp/La9KW05vLThqpkxQd5YNYLSkwvGIQvOjJdvmeOyPCqlfPC7320UKRi85eEPg8vGbnfqqdcVXPey5PExIkBG7YuIKhq9eYEDAaSqKCQER6ReQOEXlERH4iIm8Ske2lzz8SkW+ISG+SbWgVworeb1jZ5zuAlx/rtbIoTig9M7pDg4u2rl86GYQURF9vjhs2rvBtT9ix5YOYl5fUB85e4JuYLZfNcFFFMfWLzl7geT9/ff7rp537ho0reGLbu3n0undxyx++afJ7P07pzdF/2jx6e7Jl5+6aIli9Zud+53Tvv7Jd15fatWvzOaHHGkajSdpGcCPwbVW9oFS3uAe4D7hSVcdE5NPAlcAnEm5H04lSw3jr+qWh+0TJgRQluChqYFs16Sf89NrlRnTXYF6cOFEzIUoAVf9p83zvJ2jwdK/tF5uxesl8j3sUtqwLTvMR9v8MchyIWs/aMBpFYgFlIjIH2Aucrj4XEZHfBy5Q1YuCzlVLQJk74OSHCzVFakY9fzVRnFGOCdsnSr3a6vIUBQetVSbSOzY6NqUQjqvvjnLtqNdMCq9n62f4rRSIUc8X9T6SrnVhGBA9oCxJQbACuAl4GFgO7AH+XFWPle1zD7BDVb8edK5qBUGQT34cA08zB7SweINq2lGLsKxnAPMTYlEG3aRYtHmnb5H3MBWaYbQ6rRBZ3A2cAXxBVVcCx4DN7pci8ilgDLjF62ARuVhEBkVk8MiRI1VdOMgnP46IzSRrFYRRrn/2opp2uIXQc9nMpPEzrABPZbKzagRfLam9kybMdmMYaSBJQfAU8JSq7i59vgNHMCAiHwJ+D7jIT22kqjepar+q9s+fP7+qC4cNLPUOPM0e0NzB2M/TpZp2NFKoteKgaxW7DCNBQaCqzwCHRcR9o84FHhaRdwBXAOtVdSSJa4cNLPUOPK0yoNXTjrAqaUkItVYcdK1il2Ek7zV0CXBLyWPoMeDDwH8CM4H7xAnKeVBV/yjOiwalAohj4GkVrw+vdmS7hJHRMRZt3hlokA7zBkpCqLVqqmSr2GWknUQFgaruBSoNFb+Z5DVh6oCThNdQqwxoXonujo2OTUb+VqaxcIlSJS0poWaDrmG0HlaPoIOI6pXj5ynj7tsKs/QkMddNIy1E9RqypHMdRFQjdtRgsk4kKPkfNH+VZxjNwHINdRBRjcetaLRtFH5eUlvvPjAlh1OYG61hdBImCDqIqAN8mj1l/FZNw4Vi02JDDKPZmGqog6jGiJ1Wo62fWswPqxJmpAETBB1GWgf4qPi5/s7KdnnWWbAIYyMNmCAwUoXfqgmmZ1pNi93EMEwQGLEQR2bVRhG0amqF9hlGozFBYNRNWD3mqPtEvVZSg7Wp1Yy0Yl5DRt1ESVwXR3K7yjKd5uJpGPFgK4IEaRVVSNJECWSLI2NrkDDpxOdqGI3CVgQJkabZa5RAtjgytjY7/bdhdComCBKimcVrGk2UQLY4oplbJf23YXQaJggSIk2z1yiRynFEM6c5NYZhJInZCBLCL4K1U2evUTxu6vXKaZX034bRaZggSIhWKV7TytRiTDcXT8OIHxMECWGz12DiiiswDKN+TBAkiM1e/TFXUMNoHRI1FotIr4jcISKPiMhPRORNIjJPRO4TkZ+Wfs9Nsg1Ga5ImY7phtDpJew3dCHxbVZcAy4GfAJuB76rqa4Dvlj4bKcNcQQ2jdUhMEIjIHOBtwJcAVHVUVYeB84CvlXb7GrAhqTYYrYu5ghpG65DkimARcAT4iogMicjNIjIbeIWq/ry0zzPAK7wOFpGLRWRQRAaPHDmSYDONZpDmKmmG0WqIqiZzYpF+4EFglaruFpEbgReAS1S1t2y/51U10E7Q39+vg4ODibTTMAyjUxGRParaH7ZfkiuCp4CnVHV36fMdwBnAL0TklQCl379MsA2GYRhGCIkJAlV9BjgsIq7S91zgYeBu4IOlbR8EvplUGwzDMIxwko4juAS4RURmAI8BH8YRPreJyEeBJ4H3JdwGwzAMI4BEBYGq7gW89FPnJnldwzAMIzqWfdQwDCPlJOY1FCcicgRHjRQ3JwPPJnDedsKegYM9B3sGLp30HE5T1flhO7WFIEgKERmM4lrVydgzcLDnYM/AJY3PwVRDhmEYKccEgWEYRspJuyC4qdkNaAHsGTjYc7Bn4JK655BqG4FhGIZhKwLDMIzUkxpBICKXicgBEfmxiNwqIrNEZJGI7BaRn4nIjlIEdEchIl8WkV+KyI/LtnkWBxKH/1V6Hj8SkTOa1/L48HkG20sFk34kIt8QkfJEiFeWnsFBEVnbnFbHj9dzKPvuchFRETm59Dk1faG0/ZJSfzggIp8p296RfaGSVAgCEekD/gzoV9XXARng/cCngetV9TeB54GPNq+VifFV4B0V2/yKA70TeE3p52LgCw1qY9J8lenP4D7gdar6euC/gCsBROS3cfrG0tIxfyciGTqDrzL9OSAipwJrgENlm1PTF0RkNU6dlOWquhT4bGl7J/eFKaRCEJToBnIi0g30AD8HzsHJigodWiRHVf8v8FzFZr/iQOcB/6AODwK9bqbYdsbrGajqd1R1rPTxQeBVpb/PA/6Pqh5X1ceBnwFvbFhjE8SnLwBcD1wBlBsMU9MXgD8Gtqnq8dI+bkbkju0LlaRCEKhqHkfKH8IRAEeBPcBw2WDwFJCWqih+xYH6gMNl+6XlmXwE+Fbp71Q9AxE5D8ir6r6Kr9L0HF4LvLWkJv6eiLyhtD01zyDp7KMtQUkHfh5O1bRh4HY8lshpRFVVRFLrOiYinwLGgFua3ZZGIyI9wCdx1EJpphuYB5wNvAEnO/LpzW1SY0nFigD4b8DjqnpEVYvAXcAqnOWuKwxfBeSb1cAG41ccKA+cWrZfRz8TEfkQ8HvARXrCjzpNz+DVOJOjfSLyBM69PiQiv0G6nsNTwF0lNdgPgQmcfEOpeQZpEQSHgLNFpEdEhBNFch4ALijtk6YiOX7Fge4G/nvJY+Rs4GiZCqmjEJF34OjF16vqSNlXdwPvF5GZIrIIx1j6w2a0MWlUdb+qvlxVF6rqQpwB8YxSUanU9AVgAFgNICKvBWbgJJ1LTV9AVVPxA1wDPAL8GPhHYCZwOs4/9mc46qKZzW5nAvd9K45dpIjzon8UeBmOt9BPgX8F5pX2FeBvgUeB/TheVk2/h4Sewc9w9L97Sz9fLNv/U6VncBB4Z7Pbn+RzqPj+CeDkFPaFGcDXS2PDQ8A5nd4XKn8sstgwDCPlpEU1ZBiGYfhggsAwDCPlmCAwDMNIOSYIDMMwUo4JAsMwjJRjgsDoCETkehG5tOzzvSJyc9nnz4nIX4Sc4/sRrvOEm6GzYvvbReTNPsesF5HNXt8FXOfXZef952qONYxqMUFgdAq7gDcDiEgXTmTo0rLv3wwEDvSq6jmQR+Tt7vU9znu3qm6r49yGkSgmCIxO4fvAm0p/L8UJDvqViMwVkZnAb+EECyEim0TkP0t59q9xT1A2C+8Skb8r5ae/T0T+RUQuKLvWJSLykIjsF5ElIrIQ+CPgMhHZKyJvLW+YiHxIRD5f+vurpTz/3xeRxyrO68dLRWRnKSf+F0uCzjBiwzqU0RGo6tPAmIgswJmZ/wDYjSMc+oH9qjoqImtwUgW8EVgBnCkib6s43fnAQuC3gT/ghIBxeVZVz8DJ0f9xVX0C+CJObYsVqvrvIc19JfAWnDxHUVYKbwQuKbXn1aX2GUZsmCAwOonv4wgBVxD8oOzzrtI+a0o/QzgrhCU4gqGctwC3q+qEOnl3Hqj4/q7S7z04AqNaBkrnfpgTKcCD+KGqPqaq4zgpEt5SwzUNw5dUpKE2UoNrJ1iGoxo6DFwOvAB8pbSPANep6t/XcZ3jpd/j1PYOHS/7WyLsX5kHxvLCGLFiKwKjk/g+jrrlOVUdV9XngF4c1Y5rKL4X+IiInAROGVMReXnFeXYB7ynZCl6BYwgO41fAS2K4By/eKE597S5gI/AfCV3HSCkmCIxOYj+Ot9CDFduOquqz4JSoBP4J+IGI7McpVVo5gN+Jk5nyYZyslA/hVLUL4h7g972MxTHwn8DngZ8AjwPfiPn8Rsqx7KOG4YGInKSqvxaRl+GkKl9VshcYRsdhNgLD8OafRaQXJ1f9X5kQMDoZWxEYhmGkHLMRGIZhpBwTBIZhGCnHBIFhGEbKMUFgGIaRckwQGIZhpBwTBIZhGCnn/wPj0SsP1ELUmwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(data_demo[\"Weight\"], data_demo[\"Height\"])\n", "plt.xlabel(\"Weight in lb\")\n", "plt.ylabel(\"Height in inches\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这个数据集中,总共含有 $n$ 个样本,向量 $x$ 表示样本中每个人的重量,$y$ 则表示每个人的身高。假设 $x$ 与 $y$ 线性相关,则可以定义出一元线性回归模型:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " $$y_i = w_0 + w_1 x_i$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "其中 $y_i$ 是 $i$ 身高值,$x_i$ 是 $i$ 体重值。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "模型训练的目标是:要找到一组权重值 $w_0$ 和 $w_1$ ,使得通过回归模型 $y_i = w_0 + w_1 x_i$ 预测出的身高与真实身高的平方差达到最小。用公式描述如下所示,下式中的 $SE(w_0,w_1)$ 也称为损失函数。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$SE(w_0, w_1) = \\frac{1}{2}\\sum_{i=1}^{n}(y_i - (w_0 + w_1x_{i}))^2 \\rightarrow min_{w_0, w_1}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在最小化损失函数的过程中。使用梯度下降算法来进行优化。利用 $SE(w_0,w_1)$ 对权重 $w_0$ 和 $w_1$ 求偏导数,然后通过下面所示的更新公式来对权值进行更新。其中,$\\eta$ 为学习率:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$w_0^{(t+1)} = w_0^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_0} |_{t}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$w_1^{(t+1)} = w_1^{(t)} -\\eta \\frac{\\partial SE}{\\partial w_1} |_{t} $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "计算损失函数对权值的偏导数,将得到以下结果:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$ w_0^{(t+1)} = w_0^{(t)} + \\eta \\sum_{i=1}^{n}(y_i - w_0^{(t)} - w_1^{(t)}x_i)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$ w_1^{(t+1)} = w_1^{(t)} + \\eta \\sum_{i=1}^{n}(y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "关于梯度下降的数学运算过程在 [ 《深度学习》](http://www.deeplearningbook.org/contents/numerical.html) 中的数值计算章节也得到了非常详尽的介绍。 \n", "这里先不讨论局部最小值,鞍点,选择学习率和其他内容的问题。如果数据量不大的话,这个优化过程当然可以运行。但是,当训练样本很大时会发生什么呢? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "显然,梯度下降存在一个问题,即梯度计算需要用到训练集中的每个样本。换句话说,该算法需要大量迭代才能找到最小值,并且每次迭代都需要使用训练样本的全部数据来进行运算。当训练数据集非常庞大时,则其将需要发费巨大的计算才能完成一次更新迭代。要训练一个模型,就要付出巨大的时间代价。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了解决上述梯度下降存在的问题, 随机梯度下降算法被提出。相比于梯度下降算法,随机梯度下降每次迭代仅用一些小样本来进行运算,然后迭代更新权重,也就是每次迭代只从训练样本里抽取一部分数据,而不是所有的数据。这也极大的提高了计算效率,因为每次迭代的计算样本变少了。如果每次只取一个样本,则权重更新可表达为下式:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$ w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当然,随机梯度下降算法也带来了一个问题。就是随机梯度下降并不能保证在每次迭代中都会朝着最佳的方向前进。因为每次迭代取的只是一小批数据,而这一小批数据并不一定等同于整体数据,通过这小批数据所计算得到的梯度方向不一定为全局的最佳方向。因此,可能需要更多的迭代才能收敛。吴恩达在他的 [ 机器学习课程](https://www.coursera.org/learn/machine-learning) 中很好地说明了这一点。 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上图是某函数的等值线图, $\\theta_0$ 和 $\\theta_1$ 对应于 $w_0$ 和 $w_1$ 。优化过程是找到此函数的全局最小值。 在随机梯度下降方法中,随着迭代次数的增加,权重的更新方向会更难预测,如图中的紫线所示。但是,不论是随机梯度下降还是梯度下降算法,最终结果都会收敛于同一个全局最小值点。而随机梯度下降算法则要快得多。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 在线学习方法" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "随机梯度下降为训练具有高达数百 GB 的大量数据的分类器和回归器提供了实现途径。因为每次迭代只需要拿取小批量的数据,而不是全部数据,因此大大提高训练速度。但其仍然存在一个问题。如果训练数据为 100G 或者更大,对于现在的普通电脑来说,一次读取全部数据到内存是不可能的,会出现内存爆满的情况。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为解决这一问题,在线学习方法被提出,在线学习的思想是将训练数据集 $(X,y)$ 存储在电脑的硬盘中而不将其加载到运行内存中,然后在训练模型时逐个读取,并更新模型的权重:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$w_0^{(t+1)} = w_0^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$w_1^{(t+1)} = w_1^{(t)} + \\eta (y_i - w_0^{(t)} - w_1^{(t)}x_i)x_i$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里我们不对随机梯度下降算法原理进行深入讨论,如果你感兴趣可以参考 [ 凸优化](https://www.amazon.com/Convex-Optimization-Stephen-Boyd/dp/0521833787) 这本书。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在 scikit-learn 中,使用随机梯度下降算法来进行优化的分类器和回归器在 `sklearn.linear_model` 中,并命名为 `SGDClassifier` 和 `SGDRegressor`。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 类别型特征处理" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "目前,许多分类和回归算法是在欧几里德空间中操作的。这意味着,输入数据特征要用数值表示。 但是,在实际数据中,往往包含离散的类别特征,例如:是/否或 1 月/ 2 月/ ... / 12 月。如果将这些类别型特征输入都模型中,模型可能无法运行。 那应该如何去处理这种类别型的数据呢?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了解释说明这个问题。选择 UCI 的 [ bank marketing](https://archive.ics.uci.edu/ml/datasets/bank+marketing) 数据集来进行实验,因为该数据集中大部分的特征均为类别型特征。先读取数据集。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
026studentsinglehigh.schoolnononotelephonejunmon90119990nonexistent1.494.465-41.84.9615228.1
146admin.marrieduniversity.degreenoyesnocellularaugtue20829990nonexistent1.493.444-36.14.9635228.1
249blue-collarmarriedbasic.4yunknownyesyestelephonejuntue13159990nonexistent1.494.465-41.84.8645228.1
331technicianmarrieduniversity.degreenononocellularjultue40419990nonexistent-2.992.469-33.61.0445076.2
442housemaidmarrieduniversity.degreenoyesnotelephonenovmon8519990nonexistent-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital ... cons.conf.idx euribor3m nr.employed\n", "0 26 student single ... -41.8 4.961 5228.1\n", "1 46 admin. married ... -36.1 4.963 5228.1\n", "2 49 blue-collar married ... -41.8 4.864 5228.1\n", "3 31 technician married ... -33.6 1.044 5076.2\n", "4 42 housemaid married ... -42.0 4.191 5195.8\n", "\n", "[5 rows x 20 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../../data/bank_train.csv\")\n", "labels = pd.read_csv(\"../../data/bank_train_target.csv\", header=None)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上表中,可以看到大多数特征都没有用数字表示。也就是说,不能将这些数据直接输入大多数机器学习模型。因此,要将类别型数据改为数值型数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "先来分析 education 这个特征:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAckAAAD8CAYAAAAc/1/bAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHttJREFUeJzt3XmYXVWd7vHvmzAECCZMchktUAQShIAFQgRaI07AxYEoKLYgerkgLWK3Q3iwEWlbcbjaRiaDMigINgiaRhSVSQSBVAmZCdCAglcZVGZFCG//sVfBoTg7VZUazqnU+3me89Taa6299m+fc5JfrbV3nSPbRERExIuNa3UAERER7SpJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVFjtVYHEIOz4YYbuqOjo9VhRESMKt3d3Q/Z3qivfkmSo1xHRwddXV2tDiMiYlSR9Nv+9Mtya0RERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqLGmE+Skm4oPzskLSrl10m6rJQPkDSrlN8uacoQHnuapH2HaryIiBhaYz5J2p7eR/tc2yeXzbcDA0qSklb0qUbTgCTJiIg2NeaTpKTH+2g/TNIpkqYDBwBflnSrpJeXx08ldUu6TtJ2ZZ9zJJ0h6SbgS5J2k/RrSbdIukHStpLWAE4CDirjHSRpHUlnSbq59H3bsD8BERFRK5/d2k+2b5A0F7jM9sUAkq4EjrR9h6TXAKcBM8oumwPTbS+X9BJgL9vPSNoH+LztAyWdAHTa/qcy3ueBq2wfLmkycLOkX9h+YoRPNyIiSJJcaZImAtOBiyT1VK/Z0OUi28tLeRJwrqRtAAOr1wz7JuAASR8v2xOALYGlvY59BHAEwJZbbjnIM4mIiDpJkitvHPCw7Wk17Y2zv38Drrb9DkkdwDU1+wg40PayFR3Y9hxgDkBnZ6cHEHNERAzAmL8mOUCPAesC2H4UuFvSuwBU2almv0nA70v5sGbjFVcAH1GZmkraeehCj4iIgUqSHJgLgU+Um2peDhwCfFDSfGAxUHejzZeAL0i6hRfO3q8GpvTcuEM141wdWCBpcdmOiIgWkZ3VutGss7PT+dLliIiBkdRtu7OvfplJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1Vuu7S7Szv//+ce6bdV2rwxiQzU/eq9UhRET0S2aSERERNZIkIyIiaiRJ9kHSYZJOaXUcEREx8pIkIyIiaoy5JCmpQ9Kihu2PSzpR0jWSvijpZkm3S3rR3SWS9pP0a0kbSjpH0mxJN0i6S9LM0keSvixpkaSFkg4q9adKOqCUL5V0VikfLunfS1xLJZ0pabGkn0laa2SelYiIaGbMJck+rGZ7N+BY4DONDZLeAcwC9rX9UKneBNgT2B84udS9E5gG7ATsA3xZ0ibAdUBP4t0MmFLKewG/LOVtgFNtTwUeBg4c0rOLiIgBSZJ8oUvKz26go6F+BvApYD/bf2mo/6HtZ20vATYudXsCF9hebvt+4FpgV0qSlDQFWALcX5LnHsANZd+7bd9aE8NzJB0hqUtS15+ffHjlzzYiIlZoLCbJZ3jheU9oKD9Vfi7nhX9D+t/AusAre431VENZKzqo7d8Dk4G3UM0crwPeDTxu+7Em4/WOoXGsObY7bXeuv/bkFR02IiIGYSwmyfuBl0raQNKaVEulffkt1dLndyRN7aPvdcBBksZL2gjYG7i5tN1ItZTbkyQ/Xn5GREQbGnNJ0vbTwElUievnwG393O824BDgIkkvX0HXS4EFwHzgKuCTtv9Y2q6juu55J/AbYH2SJCMi2pZstzqGGIQdN9nOlx96ZqvDGJB8LF1EtJqkbtudffUbczPJiIiI/kqSjIiIqJFvARnl1thsYpYvIyKGSWaSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEj3wIyyt1/1538v4P2b3UYI+Zfvn9Zq0OIiDEkM8mIiIgaSZIRERE1xnySlNQhadEgxzhA0qwB9Jekf5d0u6Slko4ZzPEjImJ45JrkELA9F5g7gF0OA7YAtrP9rKSXDktgERExKGN+JlmsJun8Mqu7WNLakk6QNE/SIklzJAlA0jGSlkhaIOnCUneYpFNKeWNJl0qaXx7TmxzvKOAk288C2H5A0jhJd0jaqIwzTtKdPdsRETHykiQr2wKn2d4eeBT4MHCK7V1t7wCsBfTcQjoL2Nn2jsCRTcaaDVxreydgF2Bxkz4vBw6S1CXpJ5K2KQnzPOCQ0mcfYL7tB3vvLOmIsm/XE0/9faVPOiIiVixJsnKv7etL+TxgT+D1km6StBCYAUwt7QuA8yW9D3imyVgzgNMBbC+3/UiTPmsCf7PdCZwJnFXqzwLeX8qHA2c3C9b2HNudtjvXWXONgZxnREQMQJJkxU22TwNm2n4VVSKbUNr2A06lmiXOk7Qy13XvAy4p5UuBHQFs3wvcL2kGsBvwk5UYOyIihkiSZGVLSXuU8nuBX5XyQ5ImAjOhuk4IbGH7auBTwCRgYq+xrqS65oik8ZImNTneD4HXl/I/ALc3tH2LajZ7ke3lgzqriIgYlCTJyjLgaElLgfWolkvPBBYBVwDzSr/xwHllCfYWYLbth3uN9VGqpdqFQDcwBUDS5ZI2LX1OBg4sfb4AfKhh/7lUibfpUmtERIwc2b1XGqOVJHUCX7O9V3/6b7H+ZB/7xj2HOar2kY+li4ihIKm73BeyQvk7yTZSPpDgKJ6/wzUiIlooM8lRrrOz011dXa0OIyJiVOnvTDLXJCMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRr4qa5R74LePceqRV7U6jBiAo8+Y0eoQIqKfMpOMiIiokSQZERFRY8wnSUkdkhYNcowDJM1aif1mS3p8MMeOiIjhk2uSQ8D2XGDuQPaR1AmsNzwRRUTEUBjzM8liNUnnS1oq6WJJa0s6QdI8SYskzZEkAEnHSFoiaYGkC0vdYZJOKeWNJV0qaX55TO99MEnjgS8Dn2yoW1fS3ZJWL9svadyOiIiRlyRZ2RY4zfb2wKPAh4FTbO9qewdgLWD/0ncWsLPtHYEjm4w1G7jW9k7ALsDiJn3+CZhr+w89FbYfA64B9itVBwOX2H66986SjpDUJanr8b89PPCzjYiIfkmSrNxr+/pSPg/YE3i9pJskLQRmAFNL+wLgfEnvA55pMtYM4HQA28ttP9LYKGlT4F3AN5rs+y3gA6X8AeDsZsHanmO703bnxAmT+3uOERExQEmSFTfZPg2YaftVwJnAhNK2H3Aq1SxxnqSBXtfdGXgFcKeke4C1Jd0JUBJ1h6TXAeNtD+qGooiIGJwkycqWkvYo5fcCvyrlhyRNBGYCSBoHbGH7auBTwCRgYq+xrgSOKv3HS5rU2Gj7x7b/l+0O2x3Ak7Zf0dDlO8D3qJlFRkTEyEmSrCwDjpa0lOqO09OpZo+LgCuAeaXfeOC8sgR7CzDbdu+Lgh+lWqpdCHQDUwAkXV6WWvtyfonhgsGdUkREDNaY/xMQ2/cA2zVp+nR59LZnkzHOAc4p5fuBtzXps2/N8XvPRPcELm6SfCMiYoSN+STZTiR9A3gr0DShRkTEyJLd+56VGE06Ozvd1dXV6jAiIkYVSd22O/vql2uSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI0kyYiIiBpJkhERETWSJCMiImokSUZERNRIkoyIiKiRJBkREVEjX5U1yv1t0WKWbrd9q8OINrb9bUtbHULEqJWZZERERI0kyYiIiBrDniQlbSfpVkm3SHr5EIx3gKRZQxFbr3EfH+oxIyJidBuSa5KSxtteXtP8duBi258bimPZngvMHYqxWkGSANl+ttWxRETEivU5k5TUIek2SedLWirpYklrS7pH0hcl/QZ4l6Rpkm6UtEDSpZLWk7QvcCxwlKSry3jvk3RzmV1+U9L48jhH0iJJCyV9rPQ9RtKSMuaFpe4wSac0xHZVab9S0pal/hxJsyXdIOkuSTNL/cTS7zflOG/rx/m/pfSfL+nKUre+pB+W494oacdSf6Kkjzfsu6jE2CFpmaTvAIuALWrO9+WSfiqpW9J1krbr9ysZERFDrr8zyW2BD9q+XtJZwIdL/Z9s7wIgaQHwEdvXSjoJ+IztYyWdATxu+yuStgcOAl5r+2lJpwGHAIuBzWzvUMaaXMafBWxl+6mGukbfAM61fa6kw4HZVDNXgE2APYHtqGaeFwN/A95h+1FJGwI3Sppr281OWtJGwJnA3rbvlrR+afoscIvtt0uaAXwHmNbHc7gNcKjtGyW9uuZ85wBH2r5D0muA04AZfYwbERHDpL/XJO+1fX0pn0eVfAC+DyBpEjDZ9rWl/lxg7ybjvAF4NTBP0q1le2vgLmBrSd+Q9Bbg0dJ/AXC+pPcBzzQZbw/ge6X83Ya4AH5o+1nbS4CNS52Az5eE/gtgs4a2ZnYHfmn7bgDbfy71e5bjYfsqYANJL1nBOAC/tX1jKb/ofCVNBKYDF5Xn5ptUif5FJB0hqUtS15+XN3taIiJiKPR3Jtl7ptWz/cQAjyeqmd9xL2qQdgLeDBwJvBs4HNiPKtn+b+B4Sa8awLGe6nVcqGatGwGvLjPZe4AJAzyHFXmGF/7i0Tj2c8+V7b80Od9jgYdt9zUjxfYcqlknO0xYq+ksOCIiBq+/M8ktJe1Ryu8FftXYaPsR4C+S9ipV/whcy4tdCcyU9FJ47trey8rS5zjbPwA+DewiaRywhe2rgU8Bk4CJvca7ATi4lA8BruvjPCYBD5QE+XrgZX30vxHYW9JWPfGW+uvK8ZD0OuAh248C9wA9y8+7AFs1G7TZ+Zb975b0rtJHJZFGRESL9HcmuQw4ulyPXAKcDnykV59DgTMkrU21nPiB3oPYXiLp08DPShJ8Gjga+CtwdqkDOA4YD5xXlnIFzLb9sKTGIT9S9vsE8GCzY/ZyPvBfkhYCXcBtzTpJutX2NNsPSjoCuKTE9gDwRuBE4KyybPtkOXeAHwDvl7QYuAm4vSaOzZqcL1SJ9/TyHK0OXAjM7+OcIiJimKjmnpXnO0gdwGU9N5lEe9lhwlq+qKOj1WFEG8vH0kW8mKRu25199csn7kRERNToc7nV9j1AZpFtasIOU9m+q6vVYURErJIyk4yIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRpJkREREjSTJiIiIGv390uVoU4v/tJhXnfuqVocRY9DCQxe2OoSIYZeZZERERI0kyYiIiBpjPklK6pC0aJBjHCBp1gD6z5D0G0mLJJ0rKcveERFtaMwnyaFge67tk/vTV9I44FzgYNs7AL8FDh3O+CIiYuUkSVZWk3S+pKWSLpa0tqQTJM0rs705kgQg6RhJSyQtkHRhqTtM0imlvLGkSyXNL4/pvY61AfB327eX7Z8DB0oaJ+kOSRuVccZJurNnOyIiRl6SZGVb4DTb2wOPAh8GTrG9a5ntrQXsX/rOAna2vSNwZJOxZgPX2t4J2AVY3Kv9Iaqk3Fm2ZwJb2H4WOA84pNTvA8y3/eCQnGFERAxYkmTlXtvXl/J5wJ7A6yXdJGkhMAOYWtoXAOdLeh/wTJOxZgCnA9hebvuRxkbbBg4GvibpZuAxYHlpPgt4fykfDpzdLFhJR0jqktS1/LHlzbpERMQQSJKsuMn2acBM268CzgQmlLb9gFOpZonzVuamG9u/tr2X7d2AXwK3l/p7gfslzQB2A35Ss/8c2522O8evO36gh4+IiH5KkqxsKWmPUn4v8KtSfkjSRKol0Z6bbrawfTXwKWASMLHXWFcCR5X+4yVN6n0wSS8tP9cs45zR0PwtqtnsRbYzTYyIaKEkycoy4GhJS4H1qJZLzwQWAVcA80q/8cB5ZQn2FmC27Yd7jfVRqqXahUA3MAVA0uWSNi19PlGOtQD4L9tXNew/lyrxNl1qjYiIkaPqElm0i3JDz9ds79Wf/mtttZZfceIrhjmqiBfLx9LFaCap23ZnX/3yR+xtpHwgwVE8f4drRES0UJZb24jtk22/zPav+u4dERHDLTPJUW7qBlPpOrSr1WFERKySMpOMiIiokSQZERFRI0kyIiKiRpJkREREjSTJiIiIGkmSERERNZIkIyIiaiRJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI18wPlo9/9vgRMntTqKiBgJJz7S6gjGnMwkIyIiaiRJRkRE1Bj1SVJSh6RFTepPkrRPH/ueKOnjwx3LSo51jqSZQzFWRESsnFX2mqTtE1odQ0REjG6jfiZZjJd0pqTFkn4maa3GmZikfSXdJqlb0mxJlzXsO0XSNZLuknRMs8ElnSxpiaQFkr5S6jaWdKmk+eUxvS6W0n+apBvLGJdKWm9F9RER0XqrSpLcBjjV9lTgYeDAngZJE4BvAm+1/Wpgo177bge8GdgN+Iyk1RsbJW0AvAOYantH4HOlaTZwre2dgF2AxX3E8h3gU2WMhcBn+qiPiIgWW1WS5N22by3lbqCjoW074C7bd5ftC3rt+2PbT9l+CHgA2LhX+yPA34BvS3on8GSpnwGcDmB7ue2ee7NfFIukScBk29eW+nOBvevq+zpZSUdI6pLU9eCT7qt7RESspFUlST7VUF7OwK61rnBf289QzTIvBvYHfjqMsfSL7Tm2O213brS2hnr4iIgoVpUkuSLLgK0ldZTtgways6SJwCTblwMfA3YqTVcCR5U+48ussKkyy/yLpL1K1T9SLdU2rR9IfBERMXxW2btbe9j+q6QPAz+V9AQwrz/7Sboc+BBg4Efl2qaAfy5dPgrMkfRBqhnjUcAfVjDkocAZktYG7gI+0Ed9RES0mOxV/5qWpIm2H5ck4FTgDttfa3VcQ6Fz0/HuOmJiq8OIiJGQj6UbMpK6bXf21W8sLLcC/B9Jt1LdgTqJ6m7XiIiIFVrll1sByqxxlZg5RkTEyBkTSXKVtunOcGJXq6OIiFgljZXl1oiIiAFLkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1kiQjIiJqJElGRETUSJKMiIiokSQZERFRI0kyIiKiRj7gfJRb+PtH6Jj141aHERExou45eb8ROU5mkhERETWSJCMiImokSUZERNRoSZKU1Clp9jAf44bys0PSewc51nmS3j40kUVExGjRkiRpu8v2MYMdR1LtjUe2p5diBzCoJDlUVhRvRES0nyFJkmW2tqhh++OSTpR0jaQvSrpZ0u2S9irtr5N0maRxku6RNLlh3zskbSxpI0k/kDSvPF5b2k+U9F1J1wPflTS1jH+rpAWStin9Hi9DngzsVdo/JumXkqY1HO9XknbqdT7jJJ0m6TZJPwc2bGjbVdK1krol/UTSxqV+93L8WyV9RdKtpf5Dkn4o6WrgilI3q8S8QNIJDWMf2nAup0nKcnhERAuNxH/Cq9neDTgW+Exjg+1ngR8B7wCQ9Brgt7bvB74OfM32rsCBwLcadp0C7GP7PcCRwNdtTwM6gft6HX8WcJ3taba/BnwbOKwc75XABNvze+0zE9iqHOcDwPTSf80S14G2Xw2cB/xb2eds4EMljt52Bt5p+w2S9gW2BF4DTAOmS5ouaYfyPEwvY6wGHNzsCZV0hKQuSV3Ln3ykWZeIiBgCI7H8d0n52U219Nnb94ETqJLMwWUbYB9giqSefi+RNLGU59r+ayn/Gjhe0ubAJbbv6COei4B/lfQJ4HDgnCZ99gYuKEn8PknXlPrtganAL0pc40v7hsAatm8u/b5X4u/xM9t/KeU3AW8FbinbE4FXApOBXYGuMvZawL3NTsD2HGAOwJqbbOM+zjciIlbSUCXJZ3jhrHRCQ/mp8nN5zfF+DbxC0kbA24HPlfpxwO62/9bYuSSQJ3q2bX9P0k3AfsDlkv6v7avqArX9ZFlCfRvwbuDVfZ/e84cHFtjeq1dMG9b07/FEQ1nA52x/u9cYHwPOsv2vA4gnIiKG0VAtt94PvFTSBmVJcv/+7mjbwKXAV4Gltv9Umn4GfKSnX+N1xEaStgbusj2baul2x15dHgPW7VX3LWA2MK9hhtfol8BB5drkZsA/lPolwGaSdivHXkPSVNsPAU9L6iz9mi6TFlcAH5S0Thlj85JkfwG8uyfhludyyxWMExERw2xIkqTtp4GTgJuBnwO3DXCI7wPv4/mlVoBjgM5yc8sSqmuPzbwbWFRulNkB+E6v9gXAcknzy2wN293Ao1RLvEB1PVTSGWXzYuB3VEnxbKrZLraforpe+VVJC6iWTF9T9jkcOFvSLVQz6aYXC21fXsa/UdJC4D+BibYXAp+lWspdQPVLwsY15xwRESNA1URubJG0KXANsF257jgUY060/XgpHw+sb/tfhmLsFVlzk228yaH/MdyHiYhoK4P97FZJ3bY7++o35v7EQNL7gZuA44cqQRYHlD/dWATsAXxhCMeOiIgWGJMzyVVJZ2enu7q6Wh1GRMSokplkRETEICVJRkRE1EiSjIiIqJEkGRERUSNJMiIiokaSZERERI38CcgoJ+kxYFmr46ixIfBQq4OokdhWXjvHl9hWTjvHBsMT38tsb9RXp3wJ8Oi3rD9/69MKkroS28C1c2zQ3vEltpXTzrFBa+PLcmtERESNJMmIiIgaSZKj35xWB7ACiW3ltHNs0N7xJbaV086xQQvjy407ERERNTKTjIiIqJEkOUpJeoukZZLulDRrhI55lqQHyteB9dStL+nnku4oP9cr9ZI0u8S3QNIuDfscWvrfIenQIYptC0lXS1oiabGkj7ZZfBMk3Vy+/HuxpM+W+q0k3VTi+L6kNUr9mmX7ztLe0TDWcaV+maQ3D0V8Zdzxkm6RdFk7xSbpHkkLy1fRdZW6dnldJ0u6WNJtkpZK2qONYtu2PGc9j0clHdtG8X2s/FtYJOmC8m+kLd5zL2A7j1H2AMYD/w1sDawBzAemjMBx9wZ2ARY11H0JmFXKs4AvlvK+wE8AAbsDN5X69YG7ys/1Snm9IYhtE2CXUl4XuB2Y0kbxCZhYyqtTfafp7sB/AgeX+jOAo0r5w8AZpXww8P1SnlJe7zWBrcr7YPwQvb7/DHwPuKxst0VswD3Ahr3q2uV1PRf4UCmvAUxul9h6xTke+CPwsnaID9gMuBtYq+G9dli7vOdeEOtQDpbHyDyovtT5iobt44DjRujYHbwwSS4DNinlTaj+bhPgm8B7evcD3gN8s6H+Bf2GMM4fAW9sx/iAtYHfAK+h+gPp1Xq/rsAVwB6lvFrpp96vdWO/Qca0OXAlMAO4rByrXWK7hxcnyZa/rsAkqv/o1W6xNYn1TcD17RIfVZK8lyrxrlbec29ul/dc4yPLraNTzxusx32lrhU2tv2HUv4jsHEp18U47LGXpZidqWZrbRNfWc68FXgA+DnVb70P236mybGei6O0PwJsMIzx/QfwSeDZsr1BG8Vm4GeSuiUdUera4XXdCngQOLssU39L0jptEltvBwMXlHLL47P9e+ArwO+AP1C9h7ppn/fcc5IkY8i4+lWupbdLS5oI/AA41vajjW2tjs/2ctvTqGZtuwHbtSqWRpL2Bx6w3d3qWGrsaXsX4K3A0ZL2bmxs4eu6GtXlh9Nt7ww8QbV82Q6xPadc1zsAuKh3W6viK9dB30b1i8amwDrAW0Y6jv5Ikhydfg9s0bC9ealrhfslbQJQfj5Q6utiHLbYJa1OlSDPt31Ju8XXw/bDwNVUy0mTJfV8PGTjsZ6Lo7RPAv40TPG9FjhA0j3AhVRLrl9vk9h6Zh3YfgC4lOoXjHZ4Xe8D7rN9U9m+mCpptkNsjd4K/Mb2/WW7HeLbB7jb9oO2nwYuoXoftsV7rlGS5Og0D9im3Am2BtVSytwWxTIX6Lnb7VCqa4E99e8vd8ztDjxSlniuAN4kab3y2+SbSt2gSBLwbWCp7a+2YXwbSZpcymtRXS9dSpUsZ9bE1xP3TOCq8lv/XODgcrffVsA2wM2Dic32cbY3t91B9V66yvYh7RCbpHUkrdtTpno9FtEGr6vtPwL3Stq2VL0BWNIOsfXyHp5fau2Jo9Xx/Q7YXdLa5d9uz3PX8vfciwzlBc48Ru5BdSfa7VTXtY4foWNeQHX94Gmq36I/SHVd4ErgDuAXwPqlr4BTS3wLgc6GcQ4H7iyPDwxRbHtSLRstAG4tj33bKL4dgVtKfIuAE0r91lT/qO+kWg5bs9RPKNt3lvatG8Y6vsS9DHjrEL/Gr+P5u1tbHluJYX55LO55r7fR6zoN6Cqv6w+p7v5si9jKuOtQzbgmNdS1RXzAZ4Hbyr+H71Ldodry91zvRz5xJyIiokaWWyMiImokSUZERNRIkoyIiKiRJBkREVEjSTIiIqJGkmRERESNJMmIiIgaSZIRERE1/gcth/LdJnrwmQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df[\"education\"].value_counts().plot.barh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将类别型数据转换成为数值型数据最直接的解决方法就是将此特征的每个值映射到一个唯一的数字。例如,可以将 university.degree 映射到 0 ,将 basic.9y 映射到 1,依此类推。 这里可以使用 `sklearn.preprocessing.LabelEncoder` 来执行此映射。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "label_encoder = LabelEncoder()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "该类的 `fit` 方法会查找所有一列特征中的所有类别并构建类别和数字之间的映射,用 `transform` 方法将类别转换为数字。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: 'basic.4y',\n", " 1: 'basic.6y',\n", " 2: 'basic.9y',\n", " 3: 'high.school',\n", " 4: 'illiterate',\n", " 5: 'professional.course',\n", " 6: 'university.degree',\n", " 7: 'unknown'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADz5JREFUeJzt3X+s3XV9x/Hna7cUpJCWH5UgJRYygqJuQG6YRGc2nAhq8B+SlewHOpdmm1tkMzEQk0X/c8tmdIlRG3/MbIo/EDbCVGSCcS5b9RaKtJRqxSrthFYNP40i+N4f53vxcrm399tyvr3nQ5+P5KTf8z1fvudFv6ev+7mf8z3fk6pCktSOX1vuAJKkg2NxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhqzYoidnnzyybV+/fohdi1Jz0lbtmz5UVWt7bPtIMW9fv16ZmZmhti1JD0nJfl+322dKpGkxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1pndxJ5lKckeSm4YMJEk6sIMZcb8N2DFUEElSP72KO8k64PXAR4aNI0laSt8R9/uAdwC/HDCLJKmHJYs7yRuAfVW1ZYntNiaZSTKzf//+sQWUJD1dnxH3K4DLkuwGPg1clORf529UVZuqarqqpteu7XVlQknSIViyuKvqmqpaV1XrgQ3ArVX1h4MnkyQtyPO4JakxB/VFClX1VeCrgySRJPXiiFuSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGHNT1uPt6fO+j7Ln6v4bY9WDWvee3lzuCJPXiiFuSGmNxS1JjlizuJGcn2Trn9nCSqw5HOEnSMy05x11VO4FzAZJMAXuBGwbOJUlaxMFOlbwa+G5VfX+IMJKkpR1scW8Arh0iiCSpn97FnWQlcBnwuUUe35hkJsnMT3764LjySZLmOZgR96XA7VX1wEIPVtWmqpququkTj10znnSSpGc4mOK+AqdJJGnZ9SruJKuA1wDXDxtHkrSUXh95r6rHgJMGziJJ6sFPTkpSYyxuSWrMIFcHXHnacV5tT5IG4ohbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmMsbklqjMUtSY2xuCWpMYNcHfCBe3fxj7//hiF2PXHe/pmbljuCpCOMI25JaozFLUmNWbK4k3wsyb4k2w5HIEnSgfUZcf8zcMnAOSRJPS1Z3FX1NeAnhyGLJKmHsc1xJ9mYZCbJzGM/f3xcu5UkzTO24q6qTVU1XVXTq45eOa7dSpLm8awSSWqMxS1JjelzOuC1wP8AZyfZk+Qtw8eSJC1myY+8V9UVhyOIJKkfp0okqTGDXGTqlDN/3YsvSdJAHHFLUmMsbklqjMUtSY2xuCWpMRa3JDXG4pakxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1ZpCrA+77/iN84M9uHWLXGshbP3TRckeQ1JMjbklqjMUtSY3pVdxJLkmyM8muJFcPHUqStLg+XxY8BXwAuBQ4B7giyTlDB5MkLazPiPsCYFdV3VtVjwOfBt44bCxJ0mL6FPdpwH1z7u/p1j1Nko1JZpLMPPqzB8eVT5I0z9jenKyqTVU1XVXTxx2zZly7lSTN06e49wKnz7m/rlsnSVoGfYr7m8BZSc5IshLYANw4bCxJ0mKW/ORkVT2R5C+Bm4Ep4GNVtX3wZJKkBfX6yHtVfQH4wsBZJEk9+MlJSWrMIBeZev4Lj/eiRZI0EEfcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjRnk6oA/27adHS968RC71nPEi+/ZsdwRpGY54pakxljcktSYXlMlSXYDjwBPAk9U1fSQoSRJizuYOe7fraofDZZEktSLUyWS1Ji+xV3Al5NsSbJxyECSpAPrO1Xyyqram+T5wC1J7qmqr83doCv0jQCnrhjkLENJEj1H3FW1t/tzH3ADcMEC22yqqumqmj5xyuKWpKEsWdxJViU5fnYZuBjYNnQwSdLC+gyNTwFuSDK7/aeq6kuDppIkLWrJ4q6qe4HfPAxZJEk9eDqgJDVmkHcRj3npS3jxzMwQu5akI54jbklqjMUtSY2xuCWpMRa3JDXG4pakxljcktQYi1uSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMYMcnXA7T/ezss+8bIhdi0d0F1X3rXcEaTBOeKWpMZY3JLUmD5fFnx6ktuS3J1ke5K3HY5gkqSF9ZnjfgJ4e1Xd3n3b+5Ykt1TV3QNnkyQtYMkRd1X9sKpu75YfAXYApw0dTJK0sIOa406yHjgP2DxEGEnS0noXd5LjgM8DV1XVwws8vjHJTJKZJx95cpwZJUlz9CruJEcxKu1PVtX1C21TVZuqarqqpqeOnxpnRknSHH3OKgnwUWBHVb13+EiSpAPpM+J+BfBHwEVJtna31w2cS5K0iCVPB6yqrwM5DFkkST34yUlJaozFLUmNGeTqgC856SXMXDkzxK4l6YjniFuSGmNxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjRnkIlP83x3wrtWD7FrShHnXQ8ud4IjjiFuSGmNxS1Jj+nxZ8DFJvpHkziTbk7z7cASTJC2szxz3z4GLqurRJEcBX0/yxar634GzSZIW0OfLggt4tLt7VHerIUNJkhbXa447yVSSrcA+4Jaq2jxsLEnSYnoVd1U9WVXnAuuAC5K8dP42STYmmUkys/+nDsglaSgHdVZJVT0I3AZcssBjm6pquqqm1x6bceWTJM3T56yStUnWdMvPA14D3DN0MEnSwvqcVXIq8IkkU4yK/rNVddOwsSRJi+lzVsm3gPMOQxZJUg9+clKSGmNxS1Jjhrk64AvOg3fNDLJrSTrSOeKWpMZY3JLUGItbkhpjcUtSYyxuSWqMxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmMGucjUXXsfYv3V/zHEriVpIu1+z+sP23M54pakxljcktQYi1uSGtOruJOsSXJdknuS7Ehy4dDBJEkL6/vm5PuBL1XV5UlWAscOmEmSdABLFneS1cCrgDcBVNXjwOPDxpIkLabPVMkZwH7g40nuSPKRJKvmb5RkY5KZJDNP/vShsQeVJI30Ke4VwPnAB6vqPOAx4Or5G1XVpqqarqrpqWNXjzmmJGlWn+LeA+ypqs3d/esYFbkkaRksWdxVdT9wX5Kzu1WvBu4eNJUkaVF9zyr5K+CT3Rkl9wJvHi6SJOlAehV3VW0FpgfOIknqwU9OSlJjBrk64MtOW83MYbxSliQdSRxxS1JjLG5JaozFLUmNsbglqTEWtyQ1xuKWpMakqsa/0+QRYOfYdzweJwM/Wu4QizDboZnkbDDZ+cx2aIbI9sKqWttnw0HO4wZ2VtVEftIyyYzZDp7ZDt0k5zPboVnubE6VSFJjLG5JasxQxb1poP2Og9kOjdkO3STnM9uhWdZsg7w5KUkajlMlktSYsRZ3kkuS7EyyK8kzvpdyCEk+lmRfkm1z1p2Y5JYk3+n+PKFbnyT/1OX7VpLz5/w3V3bbfyfJlWPKdnqS25LcnWR7krdNSr4kxyT5RpI7u2zv7tafkWRzl+Ez3ZdnkOTo7v6u7vH1c/Z1Tbd+Z5LXPttsc/Y71X1B9U0TmG13kruSbE0y061b9uPa7XNNkuuS3JNkR5ILJyFbkrO7v6/Z28NJrpqEbHP2+9fdv4dtSa7t/p1MzOvuKVU1lhswBXwXOBNYCdwJnDOu/R/geV/F6Dswt81Z9/fA1d3y1cDfdcuvA74IBHg5sLlbfyKjb/Y5ETihWz5hDNlOBc7vlo8Hvg2cMwn5uuc4rls+CtjcPedngQ3d+g8Bf94t/wXwoW55A/CZbvmc7lgfDZzRvQamxnRs/wb4FHBTd3+Ssu0GTp63btmPa7ffTwB/2i2vBNZMSrY5GaeA+4EXTko24DTge8Dz5rze3jRJr7unso7xQFwI3Dzn/jXANeMMe4DnXs/Ti3sncGq3fCqj88oBPgxcMX874Argw3PWP227Meb8d+A1k5YPOBa4HfgtRh8qWDH/mAI3Axd2yyu67TL/OM/d7llmWgd8BbgIuKl7ronI1u1rN88s7mU/rsBqRuWTScs2L8/FwH9PUjZGxX0fox8IK7rX3Wsn6XU3exvnVMns//SsPd265XBKVf2wW74fOKVbXizj4Nm7X6POYzSynYh83VTEVmAfcAujkcGDVfXEAs/zVIbu8YeAk4bKBrwPeAfwy+7+SROUDaCALyfZkmRjt24SjusZwH7g490000eSrJqQbHNtAK7tliciW1XtBf4B+AHwQ0avoy1M1usOOALenKzRj7xlPXUmyXHA54GrqurhuY8tZ76qerKqzmU0ur0AeNFy5JgvyRuAfVW1ZbmzHMArq+p84FLgrUleNffBZTyuKxhNHX6wqs4DHmM0/TAJ2QDo5ogvAz43/7HlzNbNrb+R0Q+/FwCrgEuWI8tSxlnce4HT59xf161bDg8kORWg+3Nft36xjINlT3IUo9L+ZFVdP2n5AKrqQeA2Rr8GrkkyeymEuc/zVIbu8dXAjwfK9grgsiS7gU8zmi55/4RkA54anVFV+4AbGP3gm4TjugfYU1Wbu/vXMSryScg261Lg9qp6oLs/Kdl+D/heVe2vql8A1zN6LU7M627WOIv7m8BZ3TuwKxn9KnTjGPd/MG4EZt9pvpLR3PLs+j/u3q1+OfBQ9yvazcDFSU7ofupe3K17VpIE+Ciwo6reO0n5kqxNsqZbfh6jufcdjAr88kWyzWa+HLi1Gx3dCGzo3mE/AzgL+MazyVZV11TVuqpaz+h1dGtV/cEkZANIsirJ8bPLjI7HNibguFbV/cB9Sc7uVr0auHsSss1xBb+aJpnNMAnZfgC8PMmx3b/d2b+7iXjdPc04J8wZvQv8bUZzpe8c574P8JzXMpqP+gWj0cZbGM0zfQX4DvCfwIndtgE+0OW7C5ies58/AXZ1tzePKdsrGf3a9y1ga3d73STkA34DuKPLtg342279mYxeZLsY/Sp7dLf+mO7+ru7xM+fs651d5p3ApWM+vr/Dr84qmYhsXY47u9v22df6JBzXbp/nAjPdsf03RmdeTEq2VYxGpavnrJuIbN1+3w3c0/2b+BdGZ4ZMxOtu7s1PTkpSY57zb05K0nONxS1JjbG4JakxFrckNcbilqTGWNyS1BiLW5IaY3FLUmP+H3jwbo8iXYE/AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mapped_education = pd.Series(label_encoder.fit_transform(df[\"education\"]))\n", "mapped_education.value_counts().plot.barh()\n", "dict(enumerate(label_encoder.classes_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上图可以看出,转换之后,会把类别型的特征都替换成了数值型特征。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
026studentsingle3nononotelephonejunmon90119990nonexistent1.494.465-41.84.9615228.1
146admin.married6noyesnocellularaugtue20829990nonexistent1.493.444-36.14.9635228.1
249blue-collarmarried0unknownyesyestelephonejuntue13159990nonexistent1.494.465-41.84.8645228.1
331technicianmarried6nononocellularjultue40419990nonexistent-2.992.469-33.61.0445076.2
442housemaidmarried6noyesnotelephonenovmon8519990nonexistent-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital ... cons.conf.idx euribor3m nr.employed\n", "0 26 student single ... -41.8 4.961 5228.1\n", "1 46 admin. married ... -36.1 4.963 5228.1\n", "2 49 blue-collar married ... -41.8 4.864 5228.1\n", "3 31 technician married ... -33.6 1.044 5076.2\n", "4 42 housemaid married ... -42.0 4.191 5195.8\n", "\n", "[5 rows x 20 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"education\"] = mapped_education\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用同样的方法转换数据集的其他类别型的特征。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employed
0268230001419011999011.494.465-41.84.9615228.1
1460160200132082999011.493.444-36.14.9635228.1
2491101221431315999011.494.465-41.84.8645228.1
331916000033404199901-2.992.469-33.61.0445076.2
44231602017185199901-0.193.200-42.04.1915195.8
\n", "
" ], "text/plain": [ " age job marital ... cons.conf.idx euribor3m nr.employed\n", "0 26 8 2 ... -41.8 4.961 5228.1\n", "1 46 0 1 ... -36.1 4.963 5228.1\n", "2 49 1 1 ... -41.8 4.864 5228.1\n", "3 31 9 1 ... -33.6 1.044 5076.2\n", "4 42 3 1 ... -42.0 4.191 5195.8\n", "\n", "[5 rows x 20 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_columns = df.columns[df.dtypes == \"object\"].union([\"education\"])\n", "for column in categorical_columns:\n", " df[column] = label_encoder.fit_transform(df[column])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这种方法存在一个问题,那就是会引入了一些可能不存在任何意义的相对排序。例如,在 job 这个特征的值中隐含地引入了代数,这可以从客户端 #1 的工作中减去客户端 #2 的工作:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1].job - df.loc[2].job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个操作有意义吗?显然没有, 现在使用转换后的特征来训练逻辑回归模型。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.89 1.00 0.94 6128\n", " 1 0.62 0.01 0.02 771\n", "\n", " micro avg 0.89 0.89 0.89 6899\n", " macro avg 0.75 0.50 0.48 6899\n", "weighted avg 0.86 0.89 0.84 6899\n", "\n" ] } ], "source": [ "def logistic_regression_accuracy_on(dataframe, labels):\n", " features = dataframe.values\n", " labels = np.array(labels)\n", " train_features, test_features, train_labels, test_labels = train_test_split(\n", " features, labels.ravel()\n", " )\n", "\n", " logit = LogisticRegression(max_iter=1000, solver=\"lbfgs\")\n", " logit.fit(train_features, train_labels)\n", " return classification_report(test_labels, logit.predict(test_features))\n", "\n", "\n", "print(logistic_regression_accuracy_on(df[categorical_columns], labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到 1 类的召回率为 0 或接近于 0,这意味着模型几乎把数据都分给了 0 类。为了避免这个问题,这里将使用另一种转换方法:独热编码。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 独热编码" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "独热编码又称为 One-Hot 编码,是用只含 0 和 1 来表示类别型特征的方法。假设某项特征含有三个类别值。独热编码会创建三个向量来表示这三个类别值,例如:[1,0,0],[0,1,0],[0,0,1]。来看一个例子。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
00000001000
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9\n", "0 0 0 0 0 0 0 1 0 0 0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_example = pd.DataFrame([{i: 0 for i in range(10)}])\n", "one_hot_example.loc[0, 6] = 1\n", "one_hot_example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在使用 One-Hot 编码时,可以直接调用 `sklearn.preprocessing.OneHotEncoder` 接口。 默认情况下,One-Hot 将数据转换为稀疏矩阵以节省内存空间,因为大多数值都是零。 但是,在本实验这个特定的例子中,因为数据量比较少,所以没有遇到内存爆满的问题,因此这里使用「稠密」矩阵表示。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "onehot_encoder = OneHotEncoder(sparse=False, categories=\"auto\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
00.01.00.01.00.00.00.01.00.00.00.00.00.01.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.01.00.00.00.00.01.00.00.00.00.00.01.00.00.00.00.00.00.01.00.0
11.00.00.00.00.01.00.01.00.00.00.00.00.00.00.00.01.00.00.00.01.01.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.01.00.00.00.01.00.00.00.00.00.00.00.00.00.01.00.0
20.01.00.00.00.01.00.00.01.00.01.00.00.00.00.00.00.00.00.00.01.00.01.00.00.00.00.00.00.00.00.00.00.00.00.01.00.01.00.00.00.00.00.00.01.00.00.00.00.00.00.01.00.0
31.00.00.00.00.01.00.01.00.00.00.00.00.00.00.00.01.00.01.00.00.00.00.00.00.00.00.00.00.00.01.00.00.01.00.00.00.01.00.00.00.00.00.01.00.00.00.00.00.00.00.01.00.0
40.01.00.01.00.00.00.01.00.00.00.00.00.00.00.00.01.00.00.00.01.00.00.00.01.00.00.00.00.00.00.00.00.01.00.00.00.01.00.00.00.00.00.00.00.00.00.01.00.00.00.01.00.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 ... 46 47 48 49 50 51 52\n", "0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0\n", "1 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0\n", "2 0.0 1.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0\n", "3 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0\n", "4 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0\n", "\n", "[5 rows x 53 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded_categorical_columns = pd.DataFrame(\n", " onehot_encoder.fit_transform(df[categorical_columns])\n", ")\n", "encoded_categorical_columns.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在进行 One-Hot 编码之后,得到 53 列数据,分别对应于原数据集类别特征的唯一值。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.90 0.99 0.94 6099\n", " 1 0.61 0.17 0.26 800\n", "\n", " micro avg 0.89 0.89 0.89 6899\n", " macro avg 0.76 0.58 0.60 6899\n", "weighted avg 0.87 0.89 0.86 6899\n", "\n" ] } ], "source": [ "print(logistic_regression_accuracy_on(encoded_categorical_columns, labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由上面的结果可知, 1 类的召回率得到了改善。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 哈希技巧" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在实际的工程应用中,真实数据可能是不稳定的,也就是说我们无法保证一些类别特征不会出现新的值。 此问题可能会导致训练好的模型无法使用。 除此之外,类别编码需要对整个数据集进行分析,并在内存中构建映射,这使得处理大型数据集变得尤为困难。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "有一种基于哈希的类别编码方法,并且被称为哈希技巧。哈希函数将类别型特征编码为不同的特征值,例如:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "university.degree → -5370095693728667446\n", "high.school → -7042998680499890429\n", "illiterate → -7750457402342120656\n" ] } ], "source": [ "for s in (\"university.degree\", \"high.school\", \"illiterate\"):\n", " print(s, \"→\", hash(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一般情况下,在哈希函数中,我们不使用负数值以及比较大的数值,所以要将哈希值限定在一个范围空间。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "university.degree → 4\n", "high.school → 21\n", "illiterate → 19\n" ] } ], "source": [ "hash_space = 25\n", "for s in (\"university.degree\", \"high.school\", \"illiterate\"):\n", " print(s, \"→\", hash(s) % hash_space)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "哈希编码也可以创建类似于 One-Hot 编码的向量。可以看下面这个例子:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "job=student → 14\n", "marital=single → 21\n", "day_of_week=mon → 1\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789101112131415161718192021222324
00.01.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.01.00.00.00.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 ... 18 19 20 21 22 23 24\n", "0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "\n", "[1 rows x 25 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hashing_example = pd.DataFrame([{i: 0.0 for i in range(hash_space)}])\n", "for s in (\"job=student\", \"marital=single\", \"day_of_week=mon\"):\n", " print(s, \"→\", hash(s) % hash_space)\n", " hashing_example.loc[0, hash(s) % hash_space] = 1\n", "hashing_example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里需要指出的是,哈希编码不仅需要散列特征值,也需要散列「特征名称 + 特征值」对。 因为这样可以区分不同特征的相同值。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "assert hash(\"no\") == hash(\"no\")\n", "assert hash(\"housing=no\") != hash(\"loan=no\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "使用哈希编码时是否可能发生冲突? 当然,这是可能的。不过只要哈希空间足够大,这个问题可以避免。 但一般情况下,即使发生冲突,回归或分类指标也不会受到太大影响。 在这种情况下,哈希冲突可作为正则化的一种形式。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "你可能在说:WTF,哈希似乎违反直觉。但事实上,有时这是唯一可行的处理类别数据的方法。 而且,这种技术已被证明是有效的。等你处理了足够多的数据之后,你可能自己意识到这一点。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 实验总结" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在本次实验中,我们主要讲述了随机梯度下降算法的原理以及它的优势。为解决大数据训练算法的问题,我们讲述了在线学习方法。为将类别型数据转化为数值型数据,讲述了 One-Hot 编码和哈希技巧。虽然在线学习在 scikit-learn 中实现的方法都很不错,但是在线学习还有一些其他的方法,你可以去了解 [ Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 相关链接\n", "- [ 深度学习](http://www.deeplearningbook.org/)\n", "- [ VW 快速学习方法](http://fastml.com/blog/categories/vw/)\n", "- [ 了解实验楼《楼+ 机器学习和数据挖掘课程》](https://www.shiyanlou.com/louplus/)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }