{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "***\n", "# 计算传播与机器学习\n", "\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](./img/machine.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 1、 监督式学习\n", "\n", "工作机制:\n", "- 这个算法由一个目标变量或结果变量(或因变量)组成。\n", "- 这些变量由已知的一系列预示变量(自变量)预测而来。\n", "- 利用这一系列变量,我们生成一个将输入值映射到期望输出值的函数。\n", "- 这个训练过程会一直持续,直到模型在训练数据上获得期望的精确度。\n", "- 监督式学习的例子有:回归、决策树、随机森林、K – 近邻算法、逻辑回归等。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 2、非监督式学习\n", "\n", "工作机制:\n", "- 在这个算法中,没有任何目标变量或结果变量要预测或估计。\n", "- 这个算法用在不同的组内聚类分析。\n", "- 这种分析方式被广泛地用来细分客户,根据干预的方式分为不同的用户组。\n", "- 非监督式学习的例子有:关联算法和 K–均值算法。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 3、强化学习\n", "\n", "工作机制:\n", "- 这个算法训练机器进行决策。\n", "- 它是这样工作的:机器被放在一个能让它通过反复试错来训练自己的环境中。\n", "- 机器从过去的经验中进行学习,并且尝试利用了解最透彻的知识作出精确的商业判断。 \n", "- 强化学习的例子有马尔可夫决策过程。alphago\n", "\n", "> Chess. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the\n", "reward can be defined as win or lose at the end of the game:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- 线性回归\n", "- 逻辑回归\n", "- 决策树\n", "- SVM\n", "- 朴素贝叶斯\n", "---\n", "- K最近邻算法\n", "- K均值算法\n", "- 随机森林算法\n", "- 降维算法\n", "- Gradient Boost 和 Adaboost 算法\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # 使用sklearn做线性回归\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 线性回归\n", "- 通常用于估计连续性变量的实际数值(房价、呼叫次数、总销售额等)。\n", "- 通过拟合最佳直线来建立自变量X和因变量Y的关系。\n", "- 这条最佳直线叫做回归线,并且用 $Y= \\beta *X + C$ 这条线性等式来表示。\n", "- 系数 $\\beta$ 和 C 可以通过最小二乘法获得" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:22.109042Z", "start_time": "2019-04-22T08:22:20.811040Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import sklearn\n", "from sklearn import datasets\n", "from sklearn import linear_model\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import classification_report\n", "from sklearn.preprocessing import scale" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:24.400103Z", "start_time": "2019-04-22T08:22:24.390296Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# boston data\n", "boston = datasets.load_boston()\n", "y = boston.target\n", "X = boston.data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:25.362696Z", "start_time": "2019-04-22T08:22:25.356162Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "Intercept 36.4595 5.103 7.144 0.000 26.432 46.487\n", "boston.data[0] -0.1080 0.033 -3.287 0.001 -0.173 -0.043\n", "boston.data[1] 0.0464 0.014 3.382 0.001 0.019 0.073\n", "boston.data[2] 0.0206 0.061 0.334 0.738 -0.100 0.141\n", "boston.data[3] 2.6867 0.862 3.118 0.002 0.994 4.380\n", "boston.data[4] -17.7666 3.820 -4.651 0.000 -25.272 -10.262\n", "boston.data[5] 3.8099 0.418 9.116 0.000 2.989 4.631\n", "boston.data[6] 0.0007 0.013 0.052 0.958 -0.025 0.027\n", "boston.data[7] -1.4756 0.199 -7.398 0.000 -1.867 -1.084\n", "boston.data[8] 0.3060 0.066 4.613 0.000 0.176 0.436\n", "boston.data[9] -0.0123 0.004 -3.280 0.001 -0.020 -0.005\n", "boston.data[10] -0.9527 0.131 -7.283 0.000 -1.210 -0.696\n", "boston.data[11] 0.0093 0.003 3.467 0.001 0.004 0.015\n", "boston.data[12] -0.5248 0.051 -10.347 0.000 -0.624 -0.425\n", "==============================================================================\n", "Omnibus: 178.041 Durbin-Watson: 1.078\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126\n", "Skew: 1.521 Prob(JB): 8.84e-171\n", "Kurtosis: 8.281 Cond. No. 1.51e+04\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.51e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n" ] } ], "source": [ "import numpy as np\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "\n", "# Fit regression model (using the natural log of one of the regressors)\n", "results = smf.ols('boston.target ~ boston.data', data=boston).fit()\n", "\n", "print(results.summary())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:29.198868Z", "start_time": "2019-04-22T08:22:29.179869Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "regr = linear_model.LinearRegression()\n", "lm = regr.fit(boston.data, y)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:30.210025Z", "start_time": "2019-04-22T08:22:30.203639Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(36.45948838508965,\n", " array([-1.08011358e-01, 4.64204584e-02, 2.05586264e-02, 2.68673382e+00,\n", " -1.77666112e+01, 3.80986521e+00, 6.92224640e-04, -1.47556685e+00,\n", " 3.06049479e-01, -1.23345939e-02, -9.52747232e-01, 9.31168327e-03,\n", " -5.24758378e-01]),\n", " 0.7406426641094095)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.intercept_, lm.coef_, lm.score(boston.data, y)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:31.110418Z", "start_time": "2019-04-22T08:22:31.107129Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "predicted = regr.predict(boston.data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:32.479326Z", "start_time": "2019-04-22T08:22:31.916490Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(y, predicted)\n", "ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)\n", "ax.set_xlabel('$Measured$', fontsize = 20)\n", "ax.set_ylabel('$Predicted$', fontsize = 20)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 训练集和测试集" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:36.365683Z", "start_time": "2019-04-22T08:22:36.360788Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n", " 4.9800e+00],\n", " [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n", " 9.1400e+00],\n", " [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n", " 4.0300e+00],\n", " ...,\n", " [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", " 5.6400e+00],\n", " [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n", " 6.4800e+00],\n", " [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", " 7.8800e+00]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston.data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:48.265456Z", "start_time": "2019-04-22T08:22:48.261247Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "Xs_train, Xs_test, y_train, y_test = train_test_split(boston.data,\n", " boston.target, \n", " test_size=0.2, \n", " random_state=42)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:51.873960Z", "start_time": "2019-04-22T08:22:51.869286Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "regr = linear_model.LinearRegression()\n", "lm = regr.fit(Xs_train, y_train)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:52.561738Z", "start_time": "2019-04-22T08:22:52.555669Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(30.24675099392396,\n", " array([-1.13055924e-01, 3.01104641e-02, 4.03807204e-02, 2.78443820e+00,\n", " -1.72026334e+01, 4.43883520e+00, -6.29636221e-03, -1.44786537e+00,\n", " 2.62429736e-01, -1.06467863e-02, -9.15456240e-01, 1.23513347e-02,\n", " -5.08571424e-01]),\n", " 0.7508856358979673)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.intercept_, lm.coef_, lm.score(Xs_train, y_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:53.518402Z", "start_time": "2019-04-22T08:22:53.515220Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "predicted = regr.predict(Xs_test)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:22:54.585839Z", "start_time": "2019-04-22T08:22:54.380438Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(y_test, predicted)\n", "ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)\n", "ax.set_xlabel('$Measured$', fontsize = 20)\n", "ax.set_ylabel('$Predicted$', fontsize = 20)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 交叉验证" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# cross-validation \n", " \n", "k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:\n", "- A model is trained using k-1 of the folds as training data;\n", "- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:23:06.421218Z", "start_time": "2019-04-22T08:23:06.407755Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-1.5841985220997412" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "regr = linear_model.LinearRegression()\n", "scores = cross_val_score(regr, boston.data , boston.target, cv = 3)\n", "scores.mean() " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:03.323654Z", "start_time": "2019-04-22T08:24:01.612164Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "scores = [cross_val_score(regr, boston.data,\\\n", " boston.target,\\\n", " cv = int(i)).mean() \\\n", " for i in range(3, 50)]\n", "plt.plot(range(3, 50), scores,'r-o')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:34.174960Z", "start_time": "2019-04-22T08:24:34.155764Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.45059442471362826" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X_scale = scale(boston.data)\n", "scores = cross_val_score(regr,data_X_scale, boston.target,\\\n", " cv = 7)\n", "scores.mean() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 使用天涯bbs数据" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:46.198546Z", "start_time": "2019-04-22T08:24:46.171912Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlelinkauthorauthor_pageclickreplytime
0【民间语文第161期】宁波px启示:船进港湾人应上岸/post-free-2849477-1.shtml贾也http://www.tianya.cn/5049945019467527032012-10-29 07:59
1宁波镇海PX项目引发群体上访 当地政府发布说明(转载)/post-free-2839539-1.shtml无上卫士ABChttp://www.tianya.cn/743418358824410412012-10-24 12:41
\n", "
" ], "text/plain": [ " title link author \\\n", "0 【民间语文第161期】宁波px启示:船进港湾人应上岸 /post-free-2849477-1.shtml 贾也 \n", "1 宁波镇海PX项目引发群体上访 当地政府发布说明(转载) /post-free-2839539-1.shtml 无上卫士ABC \n", "\n", " author_page click reply time \n", "0 http://www.tianya.cn/50499450 194675 2703 2012-10-29 07:59 \n", "1 http://www.tianya.cn/74341835 88244 1041 2012-10-24 12:41 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/tianya_bbs_threads_list.txt', sep = \"\\t\", header=None)\n", "df=df.rename(columns = {0:'title', 1:'link', 2:'author',3:'author_page', 4:'click', 5:'reply', 6:'time'})\n", "df[:2]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:47.185301Z", "start_time": "2019-04-22T08:24:47.169337Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 定义这个函数的目的是让读者感受到:\n", "# 抽取不同的样本,得到的结果完全不同。\n", "def randomSplit(dataX, dataY, num):\n", " dataX_train = []\n", " dataX_test = []\n", " dataY_train = []\n", " dataY_test = []\n", " import random\n", " test_index = random.sample(range(len(df)), num)\n", " for k in range(len(dataX)):\n", " if k in test_index:\n", " dataX_test.append([dataX[k]])\n", " dataY_test.append(dataY[k])\n", " else:\n", " dataX_train.append([dataX[k]])\n", " dataY_train.append(dataY[k])\n", " return dataX_train, dataX_test, dataY_train, dataY_test, " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:48.122580Z", "start_time": "2019-04-22T08:24:48.081523Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variance score: 0.42\n" ] } ], "source": [ "import numpy as np\n", "\n", "# Use only one feature\n", "data_X = df.reply\n", "# Split the data into training/testing sets\n", "data_X_train, data_X_test, data_y_train, data_y_test = randomSplit(np.log(df.click+1), \n", " np.log(df.reply+1), 20)\n", "# Create linear regression object\n", "regr = linear_model.LinearRegression()\n", "# Train the model using the training sets\n", "regr.fit(data_X_train, data_y_train)\n", "# Explained variance score: 1 is perfect prediction\n", "print('Variance score: %.2f' % regr.score(data_X_test, data_y_test))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:49.133689Z", "start_time": "2019-04-22T08:24:49.129343Z" } }, "outputs": [ { "data": { "text/plain": [ "[[12.179091917198399], [11.387872315966666], [11.323941765302724]]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X_train[:3]\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:50.276495Z", "start_time": "2019-04-22T08:24:50.273286Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "y_true, y_pred = data_y_test, regr.predict(data_X_test)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:51.151351Z", "start_time": "2019-04-22T08:24:50.992991Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAADT1JREFUeJzt3UGIo/d5x/HfT7MbEjWBwM4cjNfzvimUQJtDzAqXYujBEFhMaHvoIUH1KUVgCDi0UGp0ykHXkLMgpinzkhBwDsWkBEM3BEPiROPaIfamJQ3W1CWwE0JIFkFLsk8PO7vZXc+MXs3onVeP9P2AYEd6X73P/r3+8vJKIzkiBADIo9P2AACAxRBuAEiGcANAMoQbAJIh3ACQDOEGgGQINwAkQ7gBIBnCDQDJXGriSbe3t6MsyyaeGgDW0v7+/i8iYqfOto2EuyxLTSaTJp4aANaS7WndbblUAgDJEG4ASIZwA0AyhBsAkiHcAJAM4QaAZAg3AJxDVVUqy1KdTkdlWaqqqsaP2cj7uAFgE1RVpcFgoNlsJkmaTqcaDAaSpH6/39hxOeMGgDMaDof3o33PbDbTcDhs9LiEGwDO6ODgYKH7l4VwA8AZ7e7uLnT/stQOt+0t2/9u+5UmBwKALEajkbrd7kP3dbtdjUajRo+7yBn3C5JuNjUIAGTT7/c1Ho9VFIVsqygKjcfjRl+YlCRHxPyN7KuSvippJOnvIuLTp23f6/WCTwcEgPps70dEr862dc+4vyzpHyTdOfNUAIClmBtu25+WdCsi9udsN7A9sT05PDxc2oAAgIfVOeN+WtJf2H5X0tclPWN779GNImIcEb2I6O3s1PoSBwDAGcwNd0S8GBFXI6KU9BlJ/xYRf9P4ZACAY/E+bgBIZqHPKomI70j6TiOTAABq4YwbAJIh3ACQDOEGgGQIN9CwNj5oH+uNL1IAGtTWB+1jvXHGDTSorQ/ax3oj3ECD2vqgfaw3wg00qK0P2sd6I9xAg9r6oH2sN8INNKitD9rHeqv1RQqL4osUAGAxTXyRAgBgRRBuAEiGcANAMoQbAJIh3ACQDOEGgGQINwAkQ7gBIBnCDQDJEG4ASIZwA0AyhBsAkiHcAJAM4QaAZAg3ACRDuAEgGcINAMkQbgBIhnADQDKEGwCSIdwAkAzhBoBkCDcAJEO4ASAZwg0AyRBuAEiGcANAMoQbAJKZG27bH7T9A9tv2X7b9hcvYjAAwPEu1djmfyU9ExG3bV+W9Jrtf42I7zc8GwDgGHPDHREh6fbRj5ePbtHkUACAk9W6xm17y/abkm5JejUiXm92LADASWqFOyJ+FxGflHRV0lO2P/HoNrYHtie2J4eHh8ueEwBwZKF3lUTEryTdkHT9mMfGEdGLiN7Ozs6y5gMAPKLOu0p2bH/06M8fkvQpST9pejAAwPHqvKvkMUlftb2lu6H/RkS80uxYAICT1HlXyY8kPXkBswAAauA3JwEgGcINAMkQbgBIhnADQDKEGwCSIdwAkAzhBoBkCDcAJEO4ASAZwg0AyRBuAEiGcANAMoQbG6mqKpVlqU6no7IsVVVV2yMBtdX5WFdgrVRVpcFgoNlsJkmaTqcaDAaSpH6/3+ZoQC2ccWPjDIfD+9G+ZzabaTgctjQRsBjCjY1zcHCw0P3AqiHc2Di7u7sL3Q+sGsKNjTMajdTtdh+6r9vtajQatTQRsBjCjY3T7/c1Ho9VFIVsqygKjcdjXphEGo6IpT9pr9eLyWSy9OcFgHVlez8ienW25YwbAJIh3ACQDOEGgGQINwAkQ7gBIBnCDQDJEG4ASIZwA0AyhBsAkiHcAJAM4QaAZAg3ACRDuAEgGcINAMkQbgBIhnADQDKEGwCSIdwAkMzccNt+wvYN2+/Yftv2CxcxGADgeJdqbPNbSX8fEW/Y/oikfduvRsQ7Dc8GADjG3DPuiPh5RLxx9OffSLop6fGmBwMAHG+ha9y2S0lPSnq9iWEAAPPVDrftD0t6WdIXIuLXxzw+sD2xPTk8PFzmjACAB9QKt+3LuhvtKiK+edw2ETGOiF5E9HZ2dpY5IwDgAXXeVWJJX5F0MyK+1PxIAIDT1DnjflrSc5Kesf3m0e3ZhucCAJxg7tsBI+I1Sb6AWQAANfCbkwCQDOEGgGQI94aoqkplWarT6agsS1VV1fj+5z0mgBNExNJv165dC6yOvb296Ha7Ien+rdvtxt7eXmP7n/eYwKaRNImajfXd7Zer1+vFZDJZ+vPibMqy1HQ6fd/9RVHo3XffbWT/8x4T2DS29yOiV2tbwr3+Op2OjvvvbFt37txpZP/zHhPYNIuEm2vcG2B3d3eh+5ex/3mPCeBkhHsDjEYjdbvdh+7rdrsajUaN7X/eYwI4Rd2L4YvceHFy9ezt7UVRFGE7iqJY+EXCs+x/3mMCm0S8OAkAuXCNGwDWGOEGgGQINwAkQ7gBIBnCDQDJEG4ASIZwA0AyhBsAkiHcAJAM4QaAZAg3ACRDuJPh68AAXGp7ANRXVZUGg4Fms5kkaTqdajAYSJL6/X6bowG4QJxxJzIcDu9H+57ZbKbhcNjSRADaQLgTOTg4WOh+AOuJcCfC14EBkAh3KnwdGACJcKfS7/c1Ho9VFIVsqygKjcdjXpgENgxfXQYAK4CvLgOANUa4ASAZwg0AyRBuAEiGcANAMoQbAJIh3ACQDOEGgGQINwAkQ7gBIJm54bb9ku1btn98EQMBAE5X54z7nyRdb3gOAEBNc8MdEd+V9MsLmAUAUAPXuAEgmaWF2/bA9sT25PDwcFlPCwB4xNLCHRHjiOhFRG9nZ2dZTwsAeASXSgAgmTpvB/yapO9J+rjt92x/rvmxAAAnuTRvg4j47EUMAgCoh0slAJAM4QaAZAg3ACRDuAEgGcINAMkQbgBIhnBvgKqqVJalOp2OyrJUVVVtjwTgHOa+jxu5VVWlwWCg2WwmSZpOpxoMBpKkfr/f5mgAzogz7jU3HA7vR/ue2Wym4XDY0kQAzotwr7mDg4OF7gew+gj3mtvd3V3ofgCrj3CvudFopG63+9B93W5Xo9GopYkAnBfhXnP9fl/j8VhFUci2iqLQeDzmhUkgMUfE0p+01+vFZDJZ+vMCwLqyvR8RvTrbcsYNAMkQbgBIhnADQDKEGwCSIdwAkAzhBoBkCDcAJEO4ASAZwg0AyRBuAEiGcANAMoQbAJIh3ACQDOEGgGQINwAkQ7gBIBnCDQDJEG4ASIZwA0AyhBsAkiHcAJAM4QaAZAg3ACRDuAEgGcINAMnUCrft67b/w/ZPbf9jE4NUVaWyLNXpdFSWpaqqWujxZRzjtO23t7e1vb39vn2XMVfTf486+7Sxvpug6TVhzTdURJx6k7Ql6b8k/aGkD0h6S9Ifn7bPtWvXYhF7e3vR7XZD0v1bt9uNvb29Wo8v4xh1tn903+eff/7ccy3qLGuxiuu7CZpeE9Z8vUiaxJwe37vVCfefSfr2Az+/KOnF0/ZZNNxFURwbx6Ioaj2+jGPU3f7B29bW1rnnWtRZ1mIV13cTNL0mrPl6WSTcvrv9yWz/taTrEfG3Rz8/J+lPI+Lzj2w3kDSQpN3d3WvT6fTU531Qp9PRcXPY1p07d+Y+voxj1N2+jkXmWtRZ1mIV13cTNL0mrPl6sb0fEb062y7txcmIGEdELyJ6Ozs7C+27u7t76v3zHl/GMc7y3FtbW+eea1FnmXcV13cTNL0mrPkGm3dKrgu4VLKK12C5xs017vPiGjcWoSVf474k6WeSPqbfvzj5J6fts2i4I+7+IyyKImxHURTv+8c37/FlHOO07a9cuRJXrlx5377LmKvpv0edfdpY303Q9Jqw5utjkXDPvcYtSbaflfRl3X2HyUsRMTpt+16vF5PJZO7zAgDuWuQa96U6G0XEtyR961xTAQCWgt+cBIBkCDcAJEO4ASAZwg0AyRBuAEim1tsBF35S+1BS/d95X23bkn7R9hArjjWqh3WqZ1PXqYiIWr923ki414ntSd33Vm4q1qge1qke1mk+LpUAQDKEGwCSIdzzjdseIAHWqB7WqR7WaQ6ucQNAMpxxA0AyhPsEF/EFydnZfsn2Lds/bnuWVWb7Cds3bL9j+23bL7Q906qx/UHbP7D91tEafbHtmVYZl0qOYXtL0n9K+pSk9yT9UNJnI+KdVgdbMbb/XNJtSf8cEZ9oe55VZfsxSY9FxBu2PyJpX9Jf8e/p92xb0h9ExG3blyW9JumFiPh+y6OtJM64j/eUpJ9GxM8i4v8kfV3SX7Y808qJiO9K+mXbc6y6iPh5RLxx9OffSLop6fF2p1otR98lcPvox8tHN84qT0C4j/e4pP9+4Of3xP9oWALbpaQnJb3e7iSrx/aW7Tcl3ZL0akSwRicg3MAFsf1hSS9L+kJE/LrteVZNRPwuIj4p6aqkp2xz+e0EhPt4/yPpiQd+vnp0H3AmR9dtX5ZURcQ3255nlUXEryTdkHS97VlWFeE+3g8l/ZHtj9n+gKTPSPqXlmdCUkcvvH1F0s2I+FLb86wi2zu2P3r05w/p7hsDftLuVKuLcB8jIn4r6fOSvq27LyR9IyLebneq1WP7a5K+J+njtt+z/bm2Z1pRT0t6TtIztt88uj3b9lAr5jFJN2z/SHdPnF6NiFdanmll8XZAAEiGM24ASIZwA0AyhBsAkiHcAJAM4QaAZAg3ACRDuAEgGcINAMn8P7Lcj2jEg96EAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(y_pred, y_true, color='black')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:52.301659Z", "start_time": "2019-04-22T08:24:52.130224Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAH0hJREFUeJzt3Xmck9W9x/HPGZAlKKACbkiiVGQRRZiLO3UvLtXe2lbtXHu11lGLG6AIjnUfFfdal3YqIu1ErbhUq7hdUVyLDgIC4oLLUFALVAFl2GR+948zYyYgkGSSefIk3/frlZdwnuTJLxG+nDnPec5xZoaIiIRHSdAFiIhIehTcIiIho+AWEQkZBbeISMgouEVEQkbBLSISMgpuEZGQUXCLiISMgltEJGRa5+KkXbp0sVgslotTi4gUpGnTpi0xs66pPDcnwR2LxaipqcnFqUVECpJzrjbV52qoREQkZBTcIiIho+AWEQkZBbeISMgouEVEQkbBLSISMgpuEZFmisfjxGIxSkpKiMVixOPxnL5fTuZxi4gUi3g8Tnl5OXV1dQDU1tZSXl4OQFlZWU7eUz1uEZFmqKio+C60G9XV1VFRUZGz91Rwi4g0w/z589NqzwYFt4hIM/To0SOt9mxIObidc62cc9Odc0/mrBoRkZCprKwkEokktUUiESorK3P2nun0uM8H5uaqEBGRMCorK6OqqopoNIpzjmg0SlVVVc4uTAI4M9v8k5zrDkwAKoERZnbspp5fWlpqWh1QRCR1zrlpZlaaynNT7XHfBowC6jfxpuXOuRrnXM3ixYtTPK2IiKRrs8HtnDsWWGRm0zb1PDOrMrNSMyvt2jWltcBFRCQDqfS4DwCOc859CjwIHOqcq85pVSIislGbDW4zG2Nm3c0sBpwETDaz/8l5ZSIi8r00j1tEJGTSWqvEzF4CXspJJSIikhL1uEVEQkbBLZJjLb3kpxQ+LesqkkNBLPkphU89bpEcCmLJTyl8Cm6RHApiyU8pfApukRwKYslPKXwKbpEcCmLJTyl8Cm6RHApiyU8Jzrp1LfM+KS3rmi4t6yoixeRf/4JRo2DrreGuuzI7RzrLumo6oIhIhlauhBtvhOuv978uKYEzz4S99srt+2qoREQkTWYwcSL07g2XX+5DG6C+Hp56Kvfvrx63iEgaZs6E88+HKVOS2wcMgNtvh4MOyn0N6nGLiKRgyRI46ywYODA5tLt0gaoqqKlpmdAG9bhFRDZp7Vp/wfGKK2Dp0kR769Zw7rlw2WXQuXPL1qTgFhHZiOeegwsugLlzk9t/9CO49Vbo0yeYujRUIiKynnnz4PjjfUA3De0f/AD+8Q94+ungQhsU3CIi3/n6a7j4YujbF554ItG+1VZ+2t+cOXDsseBccDWChkpERKivh7/8BcaMgS++SLQ7B6edBpWVsP32wdW3PgW3iBS1f/4TzjsP3noruX2//fz0vtKU7mVsWRoqEZGi9NlncMopPqCbhvaOO0J1Nbz2Wn6GNqjHLSJFZtUquOUWuPZaWLEi0d62LVx4IYweDVtuGVx9qVBwi0hRMIO//x1GjoRPPkk+dsIJ/uLjLrsEU1u6FNwiUvBmz/bzsV94Ibl9jz3g97+HQw8Npq5MaYxbRArWl1/COef41fqahvY228Cdd8L06eELbVCPW0QK0Lffwp/+5G9H//LLRHurVnD22XDllT68w0rBLSIFZfJkv3rf7NnJ7YcdBrfd5odHwk5DJSJSED75xF9kPOyw5NDeZRd47DF4/vnCCG1Qj1tEQu6bb+C66+Dmm2H16kR7hw5QUQHDh0O7dsHVlwsKbhEJJTOIx/3aIp99lnzslFP8dmI77hhMbbmm4BaR0HnrLT+O/cYbye2DB/vpffvuG0xdLUVj3CISGl984Rd9Gjw4ObS33x7uu8+3FXpog3rcIhICq1f7nvQ11/ilVxu1aePHsCsq/NKrxULBLSJ5ywyefBJGjPCbGzR13HH+guQPfhBMbUHabHA759oBLwNtG57/sJldnuvCRKS4zZ3re9PPPpvc3qePn4995JHB1JUPUhnjXg0camZ7AQOAoc65IhhFEpEgLF3q1xXp3z85tDt39sMlM2cWd2hDCj1uMzPgm4bfbtHwsFwWJSLFZ906uOceuPRSWLIk0V5SAuXlcNVV0LVrcPXlk5TGuJ1zrYBpwA+AO81sak6rEpGi8vLLfheamTOT23/4Q9/L3muvYOrKVylNBzSzdWY2AOgODHbObXDjqHOu3DlX45yrWbx4cbbrFJECVFsLJ57oA7ppaEejMHEivPiiQvv7pDWP28yWAi8CQ7/nWJWZlZpZaVf9PCMim1BXB5dfDr17w0MPJdrbt/dDInPnws9+Fvxu6vkqlVklXYG1ZrbUOdceOAIYm/PKRKTgmMHf/gajRsG//pV87OSTYexY2HnnYGoLk1TGuHcAJjSMc5cAD5nZk7ktS0QKzdtv+9vUX301uX3gQD+OfeCBwdQVRqnMKnkH2LsFahGRArRokb+zcdw43+Nu1K2b37D31FP9BgeSOt05KSI5sWYN3HGH321m+fJEe+vWvuf9u99Bp07B1RdmCm4Rybqnn/Z3Pb7/fnL70UfDLbfA7rsHU1ehUHCLSNZ88IFfV+Spp5Lbe/WCW2/1wS3Np2VdRaTZli2DCy/0W4M1De2OHf1CULNmKbSzST1uEclYfT2MHw+XXOIvQjZyDk4/3S/Dut12wdVXqBTcIpKR117zFxmnTUtuP+AAP71v0KBg6ioGGioRkbQsWABlZX7eddPQ7t4d7r8fXnlFoZ1r6nGLSEpWrvTj1ddd529Zb9Sunb8TctQov7O65J6CW0Q2yQwefdRffPz00+RjP/853HADxGJBVFa8FNwislHvvOPHsV96Kbl9zz39OPbBBwdRlWiMW0Q2sGQJ/Pa3sPfeyaG97bZw991+3RGFdnAU3FKU4vE4sViMkpISYrEY8Xg86JLywtq1cPvtsNtuPqDr6317q1a+5/3hh3DWWVpbJGgaKpGiE4/HKS8vp67hClttbS3l5eUAlJWVBVlaoJ5/3u/1+O67ye1HHOE35+3bN5i6ZEPqcUvRqaio+C60G9XV1VFRURFQRcH66CP4yU/8BrxNQ7tnT3j8cb9hr0I7vyi4pejMnz8/rfZC9fXXMGaMD+XHH0+0b7ml39Bgzhw47jjtQpOPFNxSdHr06JFWe6Gpr4e//MWv0Hf99X751UannuoXiho1Ctq2DaxE2QwFtxSdyspKIpFIUlskEqGysjKgilrO1Kmw337wv/8Ln3+eaN9nH39s/HjYYYfg6pPUKLil6JSVlVFVVUU0GsU5RzQapaqqqqAvTH72mQ/rffeFN99MtO+wg+99v/46DB4cXH2SHmdN9xLKktLSUqupqcn6eUUkPatW+Rkh11wDK1Yk2tu08XdCjhnjx7QleM65aWZWmspzNR1QpACZwRNP+E0NPv44+dhPfuLXHNl112Bqk+ZTcIsUmDlz/Hzs//u/5PZ+/Xzv+/DDg6lLskdj3CIF4ssv4bzzYK+9kkN7663hD3+AGTMU2oVCPW6RkPv2W/jzn/2u6f/5T6K9pATOPtvvsr7ttsHVJ9mn4BYJsRdf9MMi77yT3H7IIX71vv79g6lLcktDJSIh9Mkn8LOfwaGHJod2LAaPPAIvvKDQLmTqcYuEyIoV/m7HG2+E1asT7ZGI37B35Ei/I40UNgW3SAiYwQMP+FvRFy5MPlZW5tcW2WmnYGqTlqfgFslzNTV+LezXX09uLy3149j77x9MXRIcjXGL5KkvvoDTT/e3ojcN7e22g3vv9WuLKLSLk3rcInlmzRq/C81VV/mlVxttsYWfQXLppdCxY3D1SfAU3CJ5wgwmTYLhw/0WYU0de6y/Tb1Xr2Bqk/yi4BbJA++95wP7mWeS23v3hltvhaFDg6lL8pPGuEUCtHSpXwiqf//k0O7UyQf2O+8otGVD6nGLBGDdOn+BsaICFi9OtDsHZ5wBV18N3boFV5/kt80Gt3NuZ+AvwHaAAVVm9vtcFyZSqF55xU/vmz49uf2gg/z0vr33DqYuCY9Uhkq+BUaaWV9gX2CYc057Poukaf58OOkkGDIkObR33hkefBCmTFFoS2o22+M2s8+Bzxt+/bVzbi6wE/BujmsTKQh1df4W9bFjYeXKRHu7dnDxxf5uyPW2wBTZpLTGuJ1zMWBvYGouihEpJGYwcSJcdJHvbTd14olwww1QJBvLS5alHNzOuS2BR4ALzGz59xwvB8oBeuhPoxS56dP9OPYrryS3Dxjgx7GHDAmmLikMKU0HdM5tgQ/tuJk9+n3PMbMqMys1s9KuXbtms0aR0Fi8GM48EwYNSg7tLl2gqsqvO6LQluZKZVaJA8YBc83sltyXJBI+a9fCnXfCFVfAsmWJ9tat4dxz4bLLoHPnwMqTApPKUMkBwCnALOfcjIa2S8xsUu7KEgmPZ5/1a4i8915y+9Ch/iaa3r2DqUsKVyqzSl4FXAvUIhIqH37o73p88snk9t1284F9zDHB1CWFT7e8i6Rp+XI/ha9fv+TQ3moruOkmmD1boS25peAuEvF4nFgsRklJCbFYjHg8nvPXN/c98019PYwf71fou/FGP64N/jb1X//a98BHjoQ2bYKtU4qAmWX9MWjQIJP8UV1dbZFIxPBLFhhgkUjEqqurc/b65r5nvnn9dbPSUjM/Ozvx2H9/s7feCro6KQRAjaWYsc4/P7tKS0utpqYm6+eVzMRiMWprazdoj0ajfPrppzl5fXPfM18sXAijR0N1dXL7Tjv5G2hOPtn3uEWayzk3zcxKU3qugrvwlZSU8H3/n51z1NfX5+T1zX3PoK1a5TcuuPZaf8t6o7Zt/Z2Qo0dDhw7B1SeFJ53g1hh3EdjYnayp3uGayeub+55BMYNHH4U+ffwWYU1D+4QT/JS/q69WaEuwFNxFoLKyksh6qxhFIhEqKytz9vrmvmcQZs2Cww/3Ad10NKd/f5g8GR5+GGKxoKoTaSLVwfB0Hro4mX+qq6stGo2ac86i0WjaFwkzeX1z37OlLFliNmyYWUlJ8oXHbbc1u/tus7Vrg65QigG6OCmyed9+C3/8o78d/auvEu2tWsFvf+tvX99mm8DKkyKTzhi3ti6TovTCC/429dmzk9sPPxxuu83fXCOSrzTGLUXl44/hpz/1Ad00tHfdFf7+d3juOYW25D/1uKUofPMNXHedn+K3enWivUMHP3tk+HA/1U8kDBTcUtDq6yEe91uEff558rFf/cqH+Y47BlObSKYU3FKw3nzT70Lzz38mtw8eDLffDvvsE0xdIs2lMW4pOJ9/Dqed5oO5aWhvvz1MmABvvKHQlnBTj1sKxurVfj/Hq6/2Y9qN2rTx62ZfcolfelUk7BTcEnpmfl3sESNg3rzkY8cf7y9I9uwZTG0iuaDgllCbO9fPx37uueT2vn39fOwjjgimLpFc0hh3yBTa5gSZ+uorH9j9+yeHdufO/sLjjBkKbSlc6nGHSDwep7y8nLqGJetqa2spLy8HoKysLMjSWsy6dXDPPX7u9ZIlifaSEjjzTLjqKujSJbj6RFqC1ioJkULZnCBTU6b46X0zZya3H3ywvyi5556BlCWSFVqPu0DNnz8/rfZCUVsLv/iFD+imoR2N+qVWJ09WaEtxUXCHSFg3J8hUXR1cfjn07g0TJybaIxE/5W/uXL92trYOk2Kj4A6RMG5OkAkzePBB2H13P2a9alXi2C9/Ce+/78e427cPrkaRICm4Q6SsrIyqqiqi0SjOOaLRKFVVVQV1YfLtt2HIEL8J74IFifaBA+HVV/26I927B1efSD7QxUnJC4sWQUUFjBvne9yNunXzC0GdeqqfOSJSqLSRgoTGmjVwxx1w5ZWwfHmivXVrP4Pkd7+DTp2Cq08kHym4JTBPP+3XwX7//eT2Y47xt6nvvnswdYnkOwW3tLgPPvDrijz1VHJ7r17+NvWjjgqmLpGw0KihtJhly+Cii2CPPZJDu2NH38OeNUuhLZIK9bgl5+rr4b77YMwYfxGykXNw+ulQWekvQopIahTcklOvvw7nnQfTpiW3H3igv0194MBg6hIJMw2VSE4sWABlZXDAAcmh3b07PPAAvPyyQlskU+pxS1atXAk33QTXX+9vWW/Urh2MGuUfHToEV59IIdhscDvn7gWOBRaZ2R65L0nCyAweeQQuvNAvCtXUz38ON9wAsVggpYkUnFSGSu4Dhua4DgmxmTPhkEN8QDcN7b32gpdegoceUmiLZNNmg9vMXga+bIFaJGSWLIGzz/Zj1VOmJNq33Rb++Ec/tv3DHwZXn0ih0hi3pG3tWrj7br/k6tKlifZWreCcc3z71lsHV59IoctacDvnyoFyKNz1oQWef97v9fjuu8ntRx4Jt97qN+kVkdzK2nRAM6sys1IzK+3atWu2Tit5Yt48OP54H9BNQ7tnT3jiCXjmGYW2SEvRPG7ZpK+/htGjoV8/H9CNttwSxo6FOXPgxz/WLjQiLWmzwe2cewB4A9jdObfAOXd67suSoNXXw4QJfuGnsWP98quNTj0VPvzQz8lu2zawEkWK1mbHuM3s5JYoRPLH1Kn+NvU330xu33dfuP12+K//CqYuEfE0VCLf+ewz+NWvfEA3De0dd4S//hVee02hLZIPNB1QWLXKr4N9zTWwYkWivU0bfyfkmDF+TFtE8oOCu4iZweOPw8iR8PHHycf++7/9miO77hpMbSKycQruIjVnjt/T8YUXktv79fPLrR52WDB1icjmaYy7yHz5JZx7rl9HpGlob72137R3xgyFtki+U3AXgXg8TjTaE+eG0bXrV9xxB6xb54+VlMCwYX5637Bhfnd1Eclv+mta4OLxOKefXs3q1Y8Ce1Ffnzh2yCF+WKR//8DKE5EMKLgL2CefQHn51qxe/fT6R+ja9XpeeOFPuuNRJIQ0VFKAVqyASy+FPn2gru7opkeACqAPS5b8WaEtElLqcRcQM7j/frj4Yli4cP2jfwVGA58B0KNHtIWrE5FsUXAXiJoaP73v9deT23fddQkLF/6c1atf+q4tEolQWVnZsgWKSNZoqCTkvvgCfv1rGDw4ObS32w7Gj4cPP+zCuHG/IRqN4pwjGo1SVVVFWVlZcEWLSLM4M8v6SUtLS62mpibr55WENWv8jJCrr/ZLrzbaYgsYPhwqKqBjx+DqE5H0OOemmVlpKs/VUEnImMFTT8GIEX7udVM//jHcfDPstlswtYlIy1Bwh8h77/ne9DPPJLf36eO3DfvRj4KpS0Ralsa4Q2DpUt/D7t8/ObQ7dfKr+s2cqdAWKSbqceexdevg3nv9ePXixYl25+CMM/wyrNreU6T4KLjz1Cuv+Ol906cntx90kN+FZsCAYOoSkeBpqCTPzJ8PJ50EQ4Ykh/bOO8Pf/gZTpii0RYqdetx5oq4ObrzRb8y7cmWivX17fyfkRRdBJBJcfSKSPxTcATODiRN9MM+fn3zsxBPhhhugR49gahOR/KTgDtCMGX4c++WXk9sHDPDj2AcdFExdIpLfNMYdgMWL4ayzYNCg5NDu0gWqqvy6IwptEdkY9bhb0Nq1cOedcMUVsGxZor11a7+d2GWXQefOgZUnIiGh4G4hzz4LF1zg735sauhQf9dj797B1CUi4aOhkhz78EM47jgf0E1De7fd4MknYdIkhbaIpEfBnSPLl/tpfP36wT/+kWjfais/7W/2bDjmGLQLjYikTUMlWVZfDxMmwJgx8O9/J9qdg9NOg2uv9Wtli4hkSsGdRW+8Aeed52eFNLX//n7t7NKUVtoVEdk0DZVkwcKFcMopPqCbhvZOO0E8Dq++qtAWkexRj7sZVq2CW27xwx8rViTa27b1d0KOHg0dOgRXn4gUJgV3Bszgscdg5Ej49NPkYyecADfdBLFYEJWJSDFQcKdp1iw/H3vy5OT2/v39OPYhhwRTl4gUD41xp+g//4Fhw/w6Ik1De5tt4K674O23Fdoi0jJSCm7n3FDn3PvOuXnOudG5KCQejxOLxSgpKSEWixGPx9M63tzzb+w10WhPnDuHLl2+5K67/HQ/gFat/G3q1147kbFjY7Rpk1ldmcj0s+Ty+83WOQpNrr8TfedFysw2+QBaAR8BuwJtgJlA3029ZtCgQZaO6upqi0QiBnz3iEQiVl1dndLx5p5/Y69p2/Zog1nmR7UTj5KSyXbddf9odl2ZyPSz5PL7zdY5Ck2uvxN954UFqLHN5HHjI5Xg3g94tsnvxwBjNvWadIM7Go0m/eFrfESj0ZSON/f86/voI7P27Z/ZILBhnsHx3722uXVlIpP3zPX3m61zFJpcfyf6zgtLOsHt/PM3zjn3M2Comf2m4fenAPuY2TnrPa8cKAfo0aPHoNra2k2et6mSkhK+rw7nHPX19Zs93tzzN/rmGz+17+abYc2aps/8BrgGuBVY891rgWbVlYlMvotcf7+Z1lXocv2d6DsvLM65aWaW0h0fWbs4aWZVZlZqZqVd09x6vMdGtnhpbN/c8eaev74e/vpX6NULrrtu/dCeAPQCxtIY2o2vbW5dmcjkPXP9/WbrHIUm19+JvvMitrkuOS0wVBLkGPfUqWb77rv+kIhZz56LrW3bId/7o2jjazXG3by6Cp3GuCUdZHmMuzXwMbALiYuT/Tb1mnSD28z/IYxGo+acs2g0usEfvs0dT/f8f/jDw3bqqRsG9vbbm02YYLZuXeI1gLVq1eq78cOm793cujKRyXvm+vvN1jkKTa6/E33nhSOd4N7sGDeAc+5o4Db8DJN7zaxyU88vLS21mvVXWsoTq1fDbbfBNdf4Me1GbdrAiBFwySV+6VURkZaUzhh3SndOmtkkYFKzqgqYmd+4YMQImDcv+djxx/sLkj17BlObiEg6iuKW97lzYfhwv31YU337+t73EUcEU5eISCYK+pb3r77y64r0758c2p07+3VFZsxQaItI+BRkj3vdOrjnHrj0UliyJNFeUgJnnglXXQVdugRXn4hIcxRccE+ZAuefDzNnJrcffLDvZe+5ZyBliYhkTcEMldTWwi9+4QO6aWhHo/Dww35FP4W2iBSC0Pe4V6yAG27wj1WrEu2RiN+wd+RIaN8+uPpERLIttMFtBg8+CKNGwYIFycd++UsYOxa6dw+mNhGRXAplcL/9tt9N/bXXktsHDoTbb4cDDgimLhGRlhCqMe5Fi+CMM/yO6U1Du1s3GDcO3npLoS0ihS8UPe41a+COO+DKK2H58kT7Flv4GSSXXgqdOgVXn4hIS8r74J40yd/1+MEHye3HHAO33OKXYhURKSZ5G9zvv+/XFZm03gopvXr529SPOiqYukREgpZ3Y9zLlvkpfHvskRzaHTv6haBmzVJoi0hxy5se97p1MH68X1Z18eJEu3Pwm9/4ZVi7dQuuPhGRfJE3wX3BBf4CZFMHHuhvUx84MJiaRETyUd4MlZx1FrRq5X+9887+5pqXX1Zoi4isL2963P36JW5PHzXK37IuIiIbypvgBn+buoiIbFreDJWIiEhqFNwiIiGj4BYRCRkFt4hIyCi4RURCRsEtIhIyCm4RkZBxZpb9kzq3GKjN8mm7AEuyfM4gFMrnAH2WfFQonwOK77NEzaxrKifLSXDngnOuxsxKg66juQrlc4A+Sz4qlM8B+iyboqESEZGQUXCLiIRMmIK7KugCsqRQPgfos+SjQvkcoM+yUaEZ4xYRES9MPW4RESHPg9s5184596ZzbqZzbo5z7sqga2ou51wr59x059yTQdfSHM65T51zs5xzM5xzNUHXkynnXGfn3MPOufecc3Odc/sFXVMmnHO7N/y/aHwsd85dEHRdmXLODW/4Oz/bOfeAc65d0DVlwjl3fsNnmJPN/x95PVTinHNABzP7xjm3BfAqcL6Z/TPg0jLmnBsBlAIdzezYoOvJlHPuU6DUzEI9z9Y5NwF4xczucc61ASJmtjTouprDOdcKWAjsY2bZvp8i55xzO+H/rvc1s5XOuYeASWZ2X7CVpcc5twfwIDAYWAM8A5xlZvOae+687nGb903Db7doeOTvvzSb4ZzrDhwD3BN0LQLOuU7AEGAcgJmtCXtoNzgM+CiMod1Ea6C9c641EAE+C7ieTPQBpppZnZl9C0wBfpqNE+d1cMN3QwszgEXA82Y2NeiamuE2YBRQH3QhWWDAc865ac658qCLydAuwGJgfMPw1T3OuQ5BF5UFJwEPBF1EpsxsIXATMB/4HFhmZs8FW1VGZgMHOee2dc5FgKOBnbNx4rwPbjNbZ2YDgO7A4IYfP0LHOXcssMjMpgVdS5YcaGYDgaOAYc65IUEXlIHWwEDgbjPbG1gBjA62pOZpGO45DpgYdC2Zcs5tDRyP/4d1R6CDc+5/gq0qfWY2FxgLPIcfJpkBrMvGufM+uBs1/Aj7IjA06FoydABwXMPY8IPAoc656mBLylxDrwgzWwQ8hh/HC5sFwIImP8U9jA/yMDsKeNvM/h10Ic1wOPCJmS02s7XAo8D+AdeUETMbZ2aDzGwI8BXwQTbOm9fB7Zzr6pzr3PDr9sARwHvBVpUZMxtjZt3NLIb/UXaymYWuFwHgnOvgnNuq8dfAkfgfC0PFzL4A/uWc272h6TDg3QBLyoaTCfEwSYP5wL7OuUjDBIXDgLkB15QR51y3hv/2wI9v35+N8+bVLu/fYwdgQsNV8hLgITML9TS6ArEd8Jj/O0Vr4H4zeybYkjJ2LhBvGGL4GDgt4Hoy1vCP6BHAmUHX0hxmNtU59zDwNvAtMJ3w3kX5iHNuW2AtMCxbF7/zejqgiIhsKK+HSkREZEMKbhGRkFFwi4iEjIJbRCRkFNwiIiGj4BYRCRkFt4hIyCi4RURC5v8BFawCipP+jrAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot outputs\n", "plt.scatter(data_X_test, data_y_test, color='black')\n", "plt.plot(data_X_test, regr.predict(data_X_test), color='blue', linewidth=3)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:53.326537Z", "start_time": "2019-04-22T08:24:53.321437Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('Coefficients: \\n', array([0.68623605]))" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The coefficients\n", "'Coefficients: \\n', regr.coef_" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:55.007412Z", "start_time": "2019-04-22T08:24:55.002637Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Residual sum of squares: 0.98'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The mean square error\n", "\"Residual sum of squares: %.2f\" % np.mean((regr.predict(data_X_test) - data_y_test) ** 2)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:24:55.875656Z", "start_time": "2019-04-22T08:24:55.846855Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n", " if __name__ == '__main__':\n", "/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n", " from ipykernel import kernelapp as app\n" ] } ], "source": [ "df.click_log = [[np.log(df.click[i]+1)] for i in range(len(df))]\n", "df.reply_log = [[np.log(df.reply[i]+1)] for i in range(len(df))]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:25:13.823742Z", "start_time": "2019-04-22T08:25:13.811227Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'Variance score: 0.62'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "Xs_train, Xs_test, y_train, y_test = train_test_split(df.click_log, df.reply_log,test_size=0.2, random_state=0)\n", "\n", "# Create linear regression object\n", "regr = linear_model.LinearRegression()\n", "# Train the model using the training sets\n", "regr.fit(Xs_train, y_train)\n", "# Explained variance score: 1 is perfect prediction\n", "'Variance score: %.2f' % regr.score(Xs_test, y_test)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:25:18.210290Z", "start_time": "2019-04-22T08:25:18.010690Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot outputs\n", "plt.scatter(Xs_test, y_test, color='black')\n", "plt.plot(Xs_test, regr.predict(Xs_test), color='blue', linewidth=3)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:25:26.241798Z", "start_time": "2019-04-22T08:25:26.227633Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-0.6837007391943056" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "regr = linear_model.LinearRegression()\n", "scores = cross_val_score(regr, df.click_log, \\\n", " df.reply_log, cv = 3)\n", "scores.mean() " ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-04-22T08:25:30.245410Z", "start_time": "2019-04-22T08:25:30.227128Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-0.7188149722820985" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regr = linear_model.LinearRegression()\n", "scores = cross_val_score(regr, df.click_log, \n", " df.reply_log, cv =5)\n", "scores.mean() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # 使用sklearn做logistic回归\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- logistic回归是一个分类算法而不是一个回归算法。\n", "- 可根据已知的一系列因变量估计离散数值(比方说二进制数值 0 或 1 ,是或否,真或假)。\n", "- 简单来说,它通过将数据拟合进一个逻辑函数(logistic function)来预估一个事件出现的概率。\n", "- 因此,它也被叫做逻辑回归。因为它预估的是概率,所以它的输出值大小在 0 和 1 之间(正如所预计的一样)。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "$$odds= \\frac{p}{1-p} = \\frac{probability\\: of\\: event\\: occurrence} {probability \\:of \\:not\\: event\\: occurrence}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$ln(odds)= ln(\\frac{p}{1-p})$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$logit(x) = ln(\\frac{p}{1-p}) = b_0+b_1X_1+b_2X_2+b_3X_3....+b_kX_k$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/logistic.jpg)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:46:50.277195Z", "start_time": "2018-04-29T07:46:50.272229Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "repost = []\n", "for i in df.title:\n", " if u'转载' in i:\n", " repost.append(1)\n", " else:\n", " repost.append(0)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:47:06.292994Z", "start_time": "2018-04-29T07:47:06.270715Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[[194675, 2703], [88244, 1041], [82779, 625]]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X = [[df.click[i], df.reply[i]] for i in range(len(df))]\n", "data_X[:3]" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:47:45.269303Z", "start_time": "2018-04-29T07:47:45.259792Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.61241970021413272" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "df['repost'] = repost\n", "model = LogisticRegression()\n", "model.fit(data_X,df.repost)\n", "model.score(data_X,df.repost)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:47:59.648431Z", "start_time": "2018-04-29T07:47:59.633936Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def randomSplitLogistic(dataX, dataY, num):\n", " dataX_train = []\n", " dataX_test = []\n", " dataY_train = []\n", " dataY_test = []\n", " import random\n", " test_index = random.sample(range(len(df)), num)\n", " for k in range(len(dataX)):\n", " if k in test_index:\n", " dataX_test.append(dataX[k])\n", " dataY_test.append(dataY[k])\n", " else:\n", " dataX_train.append(dataX[k])\n", " dataY_train.append(dataY[k])\n", " return dataX_train, dataX_test, dataY_train, dataY_test, " ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:48:27.726443Z", "start_time": "2018-04-29T07:48:27.710922Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'Variance score: 0.45'" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Split the data into training/testing sets\n", "data_X_train, data_X_test, data_y_train, data_y_test = randomSplitLogistic(data_X, df.repost, 20)\n", "# Create logistic regression object\n", "log_regr = LogisticRegression()\n", "# Train the model using the training sets\n", "log_regr.fit(data_X_train, data_y_train)\n", "# Explained variance score: 1 is perfect prediction\n", "'Variance score: %.2f' % log_regr.score(data_X_test, data_y_test)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:48:56.873331Z", "start_time": "2018-04-29T07:48:56.870219Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "y_true, y_pred = data_y_test, log_regr.predict(data_X_test)\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:39:12.344043Z", "start_time": "2018-04-29T07:39:12.338223Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "([1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n", " array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_true, y_pred" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:39:13.175680Z", "start_time": "2018-04-29T07:39:13.171386Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.50 0.17 0.25 6\n", " 1 0.72 0.93 0.81 14\n", "\n", "avg / total 0.66 0.70 0.64 20\n", "\n" ] } ], "source": [ "print(classification_report(y_true, y_pred))" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:51:43.039620Z", "start_time": "2018-04-29T07:51:43.034812Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "Xs_train, Xs_test, y_train, y_test = train_test_split(data_X, df.repost, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:51:47.690742Z", "start_time": "2018-04-29T07:51:47.683127Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'Variance score: 0.60'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create logistic regression object\n", "log_regr = LogisticRegression()\n", "# Train the model using the training sets\n", "log_regr.fit(Xs_train, y_train)\n", "# Explained variance score: 1 is perfect prediction\n", "'Variance score: %.2f' % log_regr.score(Xs_test, y_test)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:51:55.780061Z", "start_time": "2018-04-29T07:51:55.771924Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic score for test set: 0.595745\n", "Logistic score for training set: 0.613941\n", " precision recall f1-score support\n", "\n", " 0 1.00 0.03 0.05 39\n", " 1 0.59 1.00 0.74 55\n", "\n", "avg / total 0.76 0.60 0.46 94\n", "\n" ] } ], "source": [ "print('Logistic score for test set: %f' % log_regr.score(Xs_test, y_test))\n", "print('Logistic score for training set: %f' % log_regr.score(Xs_train, y_train))\n", "y_true, y_pred = y_test, log_regr.predict(Xs_test)\n", "print(classification_report(y_true, y_pred))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:52:53.880925Z", "start_time": "2018-04-29T07:52:53.866672Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.53333333333333333" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logre = LogisticRegression()\n", "scores = cross_val_score(logre, data_X, df.repost, cv = 3)\n", "scores.mean() " ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T07:53:26.825100Z", "start_time": "2018-04-29T07:53:26.810871Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.62948717948717947" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logre = LogisticRegression()\n", "data_X_scale = scale(data_X)\n", "# The importance of preprocessing in data science and the machine learning pipeline I: \n", "scores = cross_val_score(logre, data_X_scale, df.repost, cv = 3)\n", "scores.mean() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # 使用sklearn实现贝叶斯预测\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Naive Bayes algorithm\n", "\n", "It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. \n", "\n", "In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. \n", "\n", "why it is known as ‘Naive’? For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "贝叶斯定理为使用$p(c)$, $p(x)$, $p(x|c)$ 计算后验概率$P(c|x)$提供了方法:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$\n", "p(c|x) = \\frac{p(x|c) p(c)}{p(x)}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).\n", "- P(c) is the prior probability of class.\n", "- P(x|c) is the likelihood which is the probability of predictor given class.\n", "- P(x) is the prior probability of predictor." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/Bayes_41.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Step 1: Convert the data set into a frequency table\n", "\n", "Step 2: Create Likelihood table by finding the probabilities like:\n", "- p(Overcast) = 0.29, p(rainy) = 0.36, p(sunny) = 0.36\n", "- p(playing) = 0.64, p(rest) = 0.36\n", "\n", "Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Problem: Players will play if weather is sunny. Is this statement is correct?\n", "\n", "We can solve it using above discussed method of posterior probability.\n", "\n", "$P(Yes | Sunny) = \\frac{P( Sunny | Yes) * P(Yes) } {P (Sunny)}$\n", "\n", "Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64\n", "\n", "Now, $P (No | Sunny) = \\frac{0.33 * 0.64}{0.36} = 0.60$, which has higher probability.\n", "\n", "$P(No | Sunny) = \\frac{P( Sunny | No) * P(No) } {P (Sunny)}$\n", "\n", "Here we have P (Sunny |No) = 2/5 = 0.4, P(Sunny) = 5/14 = 0.36, P( No)= 5/14 = 0.36\n", "\n", "Now, $P (Yes | Sunny) = \\frac{0.4 * 0.36}{0.36} = 0.4$, which has lower probability.\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'ABCMeta BaseDiscreteNB BaseEstimator BaseNB BernoulliNB ClassifierMixin GaussianNB LabelBinarizer MultinomialNB __all__ __builtins__ __doc__ __file__ __name__ __package__ _check_partial_fit_first_call abstractmethod binarize check_X_y check_array check_is_fitted in1d issparse label_binarize logsumexp np safe_sparse_dot six'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import naive_bayes\n", "' '.join(dir(naive_bayes)) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- naive_bayes.GaussianNB\tGaussian Naive Bayes (GaussianNB)\n", "- naive_bayes.MultinomialNB([alpha, ...])\tNaive Bayes classifier for multinomial models\n", "- naive_bayes.BernoulliNB([alpha, binarize, ...])\tNaive Bayes classifier for multivariate Bernoulli models." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:02:37.644606Z", "start_time": "2018-04-29T08:02:37.635952Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "#Import Library of Gaussian Naive Bayes model\n", "from sklearn.naive_bayes import GaussianNB\n", "import numpy as np\n", "\n", "#assigning predictor and target variables\n", "x= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2], [2,7], [-4,1], [-2,7]])\n", "Y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:02:52.828101Z", "start_time": "2018-04-29T08:02:52.818463Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([4, 3])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Create a Gaussian Classifier\n", "model = GaussianNB()\n", "\n", "# Train the model using the training sets \n", "model.fit(x[:8], Y[:8])\n", "\n", "#Predict Output \n", "predicted= model.predict([[1,2],[3,4]])\n", "predicted" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# cross-validation \n", " \n", "k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:\n", "- A model is trained using k-1 of the folds as training data;\n", "- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:04:04.297675Z", "start_time": "2018-04-29T08:04:04.273413Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([41, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0])" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X_train, data_X_test, data_y_train, data_y_test = randomSplit(df.click, df.reply, 20)\n", "# Train the model using the training sets \n", "model.fit(data_X_train, data_y_train)\n", "\n", "#Predict Output \n", "predicted= model.predict(data_X_test)\n", "predicted" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:04:34.184513Z", "start_time": "2018-04-29T08:04:34.178511Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.65000000000000002" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(data_X_test, data_y_test)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:05:04.297453Z", "start_time": "2018-04-29T08:05:04.249311Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=7.\n", " % (min_labels, self.n_folds)), Warning)\n" ] }, { "data": { "text/plain": [ "0.53413410073295453" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "model = GaussianNB()\n", "scores = cross_val_score(model, [[c] for c in df.click],\\\n", " df.reply, cv = 7)\n", "scores.mean() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # 使用sklearn实现决策树\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 决策树\n", "- 这个监督式学习算法通常被用于分类问题。\n", "- 它同时适用于分类变量和连续因变量。\n", "- 在这个算法中,我们将总体分成两个或更多的同类群。\n", "- 这是根据最重要的属性或者自变量来分成尽可能不同的组别。\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/tree.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/playtree.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 在上图中你可以看到,根据多种属性,人群被分成了不同的四个小组,来判断 “他们会不会去玩”。\n", "### 为了把总体分成不同组别,需要用到许多技术,比如说 Gini、Information Gain、Chi-square、entropy。" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:10:20.871345Z", "start_time": "2018-04-29T08:10:20.855125Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn import tree\n", "model = tree.DecisionTreeClassifier(criterion='gini')" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:10:49.988277Z", "start_time": "2018-04-29T08:10:49.973060Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.91275167785234901" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X_train, data_X_test, data_y_train, data_y_test = randomSplitLogistic(data_X, df.repost, 20)\n", "model.fit(data_X_train,data_y_train)\n", "model.score(data_X_train,data_y_train)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:11:12.730866Z", "start_time": "2018-04-29T08:11:12.725782Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Predict\n", "model.predict(data_X_test)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:11:28.411441Z", "start_time": "2018-04-29T08:11:28.397481Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.33461538461538459" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# crossvalidation\n", "scores = cross_val_score(model, data_X, df.repost, cv = 3)\n", "scores.mean() " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> # 使用sklearn实现SVM支持向量机\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/svm.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- 将每个数据在N维空间中用点标出(N是你所有的特征总数),每个特征的值是一个坐标的值。\n", " - 举个例子,如果我们只有身高和头发长度两个特征,我们会在二维空间中标出这两个变量,每个点有两个坐标(这些坐标叫做支持向量)。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/xyplot.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- 现在,我们会找到将两组不同数据分开的一条直线。\n", " - 两个分组中距离最近的两个点到这条线的距离同时最优化。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](./img/sumintro.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 上面示例中的黑线将数据分类优化成两个小组\n", "- 两组中距离最近的点(图中A、B点)到达黑线的距离满足最优条件。\n", " - 这条直线就是我们的分割线。接下来,测试数据落到直线的哪一边,我们就将它分到哪一类去。" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:17:29.788250Z", "start_time": "2018-04-29T08:17:29.785022Z" } }, "outputs": [], "source": [ "from sklearn import svm\n", "# Create SVM classification object \n", "model=svm.SVC() " ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:17:31.035310Z", "start_time": "2018-04-29T08:17:31.030713Z" } }, "outputs": [ { "data": { "text/plain": [ "'LinearSVC LinearSVR NuSVC NuSVR OneClassSVM SVC SVR __all__ __builtins__ __cached__ __doc__ __file__ __loader__ __name__ __package__ __path__ __spec__ base bounds classes l1_min_c liblinear libsvm libsvm_sparse'" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "' '.join(dir(svm))" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:17:41.872379Z", "start_time": "2018-04-29T08:17:41.849759Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.90380313199105144" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_X_train, data_X_test, data_y_train, data_y_test = randomSplitLogistic(data_X, df.repost, 20)\n", "model.fit(data_X_train,data_y_train)\n", "model.score(data_X_train,data_y_train)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:17:47.661313Z", "start_time": "2018-04-29T08:17:47.655841Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1])" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Predict\n", "model.predict(data_X_test)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:18:00.419986Z", "start_time": "2018-04-29T08:17:58.671257Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# crossvalidation\n", "scores = []\n", "cvs = [3, 5, 10, 25, 50, 75, 100]\n", "for i in cvs:\n", " score = cross_val_score(model, data_X, df.repost,\n", " cv = i)\n", " scores.append(score.mean() ) # Try to tune cv\n", " " ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "ExecuteTime": { "end_time": "2018-04-29T08:18:05.493658Z", "start_time": "2018-04-29T08:18:05.359658Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(cvs, scores, 'b-o')\n", "plt.xlabel('$cv$', fontsize = 20)\n", "plt.ylabel('$Score$', fontsize = 20)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "> # 泰坦尼克号数据分析\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:31:28.492497Z", "start_time": "2018-05-29T07:31:28.488728Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn import tree\n", "import warnings \n", "warnings.filterwarnings(\"ignore\") \n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2018-06-06T07:02:49.855926Z", "start_time": "2018-06-06T07:02:49.705773Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "train = pd.read_csv('../data/tatanic_train.csv', \n", " sep = \",\")" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2018-06-06T07:02:52.803564Z", "start_time": "2018-06-06T07:02:52.759733Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Unnamed: 0 PassengerId Survived Pclass \\\n", "0 0 1 0 3 \n", "1 1 2 1 1 \n", "2 2 3 1 3 \n", "3 3 4 1 1 \n", "4 4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head() " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:58.070575Z", "start_time": "2018-05-29T07:28:57.897862Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "train[\"Age\"] = train[\"Age\"].fillna(train[\"Age\"].median())\n", "train[\"Fare\"] = train[\"Fare\"].fillna(train[\"Fare\"].median())\n", "#Convert the male and female groups to integer form\n", "train[\"Sex\"][train[\"Sex\"] == \"male\"] = 0\n", "train[\"Sex\"][train[\"Sex\"] == \"female\"] = 1\n", "#Impute the Embarked variable\n", "train[\"Embarked\"] = train[\"Embarked\"].fillna('S')\n", "#Convert the Embarked classes to integer form\n", "train[\"Embarked\"][train[\"Embarked\"] == \"S\"] = 0\n", "train[\"Embarked\"][train[\"Embarked\"] == \"C\"] = 1\n", "train[\"Embarked\"][train[\"Embarked\"] == \"Q\"] = 2" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:08.358884Z", "start_time": "2018-05-29T07:28:08.346226Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.12294397 0.31274009 0.23680307 0.32751287]\n", "0.977553310887\n" ] } ], "source": [ "#Create the target and features numpy arrays: target, features_one\n", "target = train['Survived'].values\n", "features_one = train[[\"Pclass\", \"Sex\", \"Age\", \"Fare\"]].values\n", "\n", "#Fit your first decision tree: my_tree_one\n", "my_tree_one = tree.DecisionTreeClassifier()\n", "my_tree_one = my_tree_one.fit(features_one, target)\n", "#Look at the importance of the included features and print the score\n", "print(my_tree_one.feature_importances_)\n", "print(my_tree_one.score(features_one, target))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:15.915998Z", "start_time": "2018-05-29T07:28:15.705994Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "test = pd.read_csv('../data/tatanic_test.csv', sep = \",\")\n", "# Impute the missing value with the median\n", "test.Fare[152] = test.Fare.median()\n", "test[\"Age\"] = test[\"Age\"].fillna(test[\"Age\"].median())\n", "#Convert the male and female groups to integer form\n", "test[\"Sex\"][test[\"Sex\"] == \"male\"] = 0\n", "test[\"Sex\"][test[\"Sex\"] == \"female\"] = 1\n", "\n", "#Impute the Embarked variable\n", "test[\"Embarked\"] = test[\"Embarked\"].fillna('S')\n", "#Convert the Embarked classes to integer form\n", "test[\"Embarked\"][test[\"Embarked\"] == \"S\"] = 0\n", "test[\"Embarked\"][test[\"Embarked\"] == \"C\"] = 1\n", "test[\"Embarked\"][test[\"Embarked\"] == \"Q\"] = 2\n", "\n", "# Extract the features from the test set: Pclass, Sex, Age, and Fare.\n", "test_features = test[[\"Pclass\",\"Sex\", \"Age\", \"Fare\"]].values\n", "\n", "# Make your prediction using the test set\n", "my_prediction = my_tree_one.predict(test_features)\n", "\n", "# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions\n", "PassengerId =np.array(test['PassengerId']).astype(int)\n", "my_solution = pd.DataFrame(my_prediction, PassengerId, columns = [\"Survived\"])\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:18.081288Z", "start_time": "2018-05-29T07:28:18.074414Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Survived
8920
8930
8941
\n", "
" ], "text/plain": [ " Survived\n", "892 0\n", "893 0\n", "894 1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_solution[:3]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:25:44.488717Z", "start_time": "2018-05-29T07:25:44.484381Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(418, 1)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that your data frame has 418 entries\n", "my_solution.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Write your solution to a csv file with the name my_solution.csv \n", "my_solution.to_csv(\"../data/tatanic_solution_one.csv\", \n", " index_label = [\"PassengerId\"])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:26.996353Z", "start_time": "2018-05-29T07:28:26.982601Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.905723905724\n" ] } ], "source": [ "# Create a new array with the added features: features_two\n", "features_two = train[[\"Pclass\",\"Age\",\"Sex\",\"Fare\",\\\n", " \"SibSp\", \"Parch\", \"Embarked\"]].values\n", "\n", "#Control overfitting by setting \"max_depth\" to 10 and \"min_samples_split\" to 5 : my_tree_two\n", "max_depth = 10\n", "min_samples_split = 5\n", "my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, \n", " min_samples_split = min_samples_split, \n", " random_state = 1)\n", "my_tree_two = my_tree_two.fit(features_two, target)\n", "\n", "#Print the score of the new decison tree\n", "print(my_tree_two.score(features_two, target))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:28.033226Z", "start_time": "2018-05-29T07:28:28.018293Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.979797979798\n" ] } ], "source": [ "# create a new train set with the new variable\n", "train_two = train\n", "train_two['family_size'] = train.SibSp + train.Parch + 1\n", "\n", "# Create a new decision tree my_tree_three\n", "features_three = train[[\"Pclass\", \"Sex\", \"Age\", \\\n", " \"Fare\", \"SibSp\", \"Parch\", \"family_size\"]].values\n", "\n", "my_tree_three = tree.DecisionTreeClassifier()\n", "my_tree_three = my_tree_three.fit(features_three, target)\n", "\n", "# Print the score of this decision tree\n", "print(my_tree_three.score(features_three, target))\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:28:32.678968Z", "start_time": "2018-05-29T07:28:32.465958Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.939393939394\n", "418\n", "[0 0 0]\n" ] } ], "source": [ "#Import the `RandomForestClassifier`\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "#We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables\n", "features_forest = train[[\"Pclass\", \"Age\", \"Sex\", \"Fare\", \"SibSp\", \"Parch\", \"Embarked\"]].values\n", "\n", "#Building the Forest: my_forest\n", "n_estimators = 100\n", "forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, \n", " n_estimators = n_estimators, random_state = 1)\n", "my_forest = forest.fit(features_forest, target)\n", "\n", "#Print the score of the random forest\n", "print(my_forest.score(features_forest, target))\n", "\n", "#Compute predictions and print the length of the prediction vector:test_features, pred_forest\n", "test_features = test[[\"Pclass\", \"Age\", \"Sex\", \"Fare\", \"SibSp\", \"Parch\", \"Embarked\"]].values\n", "pred_forest = my_forest.predict(test_features)\n", "print(len(test_features))\n", "print(pred_forest[:3])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-29T07:26:25.602062Z", "start_time": "2018-05-29T07:26:25.572689Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.14130255 0.17906027 0.41616727 0.17938711 0.05039699 0.01923751\n", " 0.0144483 ]\n", "[ 0.10384741 0.20139027 0.31989322 0.24602858 0.05272693 0.04159232\n", " 0.03452128]\n", "0.905723905724\n", "0.939393939394\n" ] } ], "source": [ "#Request and print the `.feature_importances_` attribute\n", "print(my_tree_two.feature_importances_)\n", "print(my_forest.feature_importances_)\n", "\n", "#Compute and print the mean accuracy score for both models\n", "print(my_tree_two.score(features_two, target))\n", "print(my_forest.score(features_two, target))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "# 阅读材料\n", "机器学习算法的要点(附 Python 和 R 代码)http://blog.csdn.net/a6225301/article/details/50479672\n", "\n", "The \"Python Machine Learning\" book code repository and info resource https://github.com/rasbt/python-machine-learning-book\n", "\n", "An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) : Python code https://github.com/JWarmenhoven/ISLR-python\n", "\n", "BuildingMachineLearningSystemsWithPython https://github.com/luispedro/BuildingMachineLearningSystemsWithPython" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 作业\n", "https://www.datacamp.com/community/tutorials/the-importance-of-preprocessing-in-data-science-and-the-machine-learning-pipeline-i-centering-scaling-and-k-nearest-neighbours" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "780px", "left": "1279px", "top": "168.667px", "width": "341px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }