{ "cells": [ { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from numpy import *\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['font.sans-serif'] = ['SimHei'] \n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def loadDataSet(fileName):\n", " \"\"\"loadDataSet(解析每一行,并转化为float类型)\n", " Desc:该函数读取一个以 tab 键为分隔符的文件,然后将每行的内容保存成一组浮点数\n", " Args:\n", " fileName 文件名\n", " Returns:\n", " dataMat 每一行的数据集array类型\n", " Raises:\n", " \"\"\"\n", " # 假定最后一列是结果值\n", " # assume last column is target value\n", " dataMat = []\n", " fr = open(fileName)\n", " for line in fr.readlines():\n", " curLine = line.strip().split('\\t')\n", " # 将所有的元素转化为float类型\n", " # map all elements to float()\n", " # map() 函数具体的含义,可见 https://my.oschina.net/zyzzy/blog/115096\n", " fltLine = list(map(float, curLine))\n", " dataMat.append(fltLine)\n", " return dataMat" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def binSplitDataSet(dataSet, feature, value):\n", " \"\"\"binSplitDataSet(将数据集,按照feature列的value进行 二元切分)\n", " Description:在给定特征和特征值的情况下,该函数通过数组过滤方式将上述数据集合切分得到两个子集并返回。\n", " Args:\n", " dataMat 数据集\n", " feature 待切分的特征列\n", " value 特征列要比较的值\n", " Returns:\n", " mat0 小于等于 value 的数据集在左边\n", " mat1 大于 value 的数据集在右边\n", " Raises:\n", " \"\"\"\n", " # # 测试案例\n", " # print 'dataSet[:, feature]=', dataSet[:, feature]\n", " # print 'nonzero(dataSet[:, feature] > value)[0]=', nonzero(dataSet[:, feature] > value)[0]\n", " # print 'nonzero(dataSet[:, feature] <= value)[0]=', nonzero(dataSet[:, feature] <= value)[0]\n", "\n", " # dataSet[:, feature] 取去每一行中,第1列的值(从0开始算)\n", " # nonzero(dataSet[:, feature] > value) 返回结果为true行的index下标\n", " mat0 = dataSet[nonzero(dataSet[:, feature] > value)[0], :] \n", " mat1 = dataSet[nonzero(dataSet[:, feature] <= value)[0], :]\n", " return mat0, mat1\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 返回每一个叶子结点的均值\n", "# returns the value used for each leaf\n", "# 我的理解是:regLeaf 是产生叶节点的函数,就是求均值,即用聚类中心点来代表这类数据\n", "def regLeaf(dataSet):\n", " return mean(dataSet[:, -1])" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 计算总方差=方差*样本数\n", "# 我的理解是:求这组数据的方差,即通过决策树划分,可以让靠近的数据分到同一类中去\n", "def regErr(dataSet):\n", " # shape(dataSet)[0] 表示行数\n", " return var(dataSet[:, -1]) * shape(dataSet)[0]\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1, 4)):\n", " \"\"\"createTree(获取回归树)\n", " Description:递归函数:如果构建的是回归树,该模型是一个常数,如果是模型树,其模型师一个线性方程。\n", " Args:\n", " dataSet 加载的原始数据集\n", " leafType 建立叶子点的函数\n", " errType 误差计算函数\n", " ops=(1, 4) [容许误差下降值,切分的最少样本数]\n", " Returns:\n", " retTree 决策树最后的结果\n", " \"\"\"\n", " # 选择最好的切分方式: feature索引值,最优切分值\n", " # choose the best split\n", " feat, val = chooseBestSplit(dataSet, leafType, errType, ops)\n", " # if the splitting hit a stop condition return val\n", " # 如果 splitting 达到一个停止条件,那么返回 val\n", " if feat is None:\n", " return val\n", " retTree = {}\n", " retTree['spInd'] = feat\n", " retTree['spVal'] = val\n", " # 大于在右边,小于在左边,分为2个数据集\n", " lSet, rSet = binSplitDataSet(dataSet, feat, val)\n", " # 递归的进行调用,在左右子树中继续递归生成树\n", " retTree['left'] = createTree(lSet, leafType, errType, ops)\n", " retTree['right'] = createTree(rSet, leafType, errType, ops)\n", " return retTree" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 1. 0. 0. 0.]\n", " [ 0. 1. 0. 0.]\n", " [ 0. 0. 1. 0.]\n", " [ 0. 0. 0. 1.]]\n" ] } ], "source": [ "testMat = mat(eye(4))\n", "print (testMat)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[ 0., 1., 0., 0.]])" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mat0, mat1 = binSplitDataSet(testMat,1,0.5)\n", "mat0" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[ 1., 0., 0., 0.],\n", " [ 0., 0., 1., 0.],\n", " [ 0., 0., 0., 1.]])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mat1" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.用最佳方式切分数据集\n", "# 2.生成相应的叶节点\n", "def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1, 4)):\n", " \"\"\"chooseBestSplit(用最佳方式切分数据集 和 生成相应的叶节点)\n", "\n", " Args:\n", " dataSet 加载的原始数据集\n", " leafType 建立叶子点的函数\n", " errType 误差计算函数(求总方差)\n", " ops [容许误差下降值,切分的最少样本数]。\n", " Returns:\n", " bestIndex feature的index坐标\n", " bestValue 切分的最优值\n", " Raises:\n", " \"\"\"\n", "\n", " # ops=(1,4),非常重要,因为它决定了决策树划分停止的threshold值,被称为预剪枝(prepruning),其实也就是用于控制函数的停止时机。\n", " # 之所以这样说,是因为它防止决策树的过拟合,所以当误差的下降值小于tolS,或划分后的集合size小于tolN时,选择停止继续划分。\n", " # 最小误差下降值,划分后的误差减小小于这个差值,就不用继续划分\n", " tolS = ops[0]\n", " # 划分最小 size 小于,就不继续划分了\n", " tolN = ops[1]\n", " # 如果结果集(最后一列为1个变量),就返回退出\n", " # .T 对数据集进行转置\n", " # .tolist()[0] 转化为数组并取第0列\n", " if len(set(dataSet[:, -1].T.tolist()[0])) == 1: # 如果集合size为1,也就是说全部的数据都是同一个类别,不用继续划分。\n", " # exit cond 1\n", " return None, leafType(dataSet)\n", " # 计算行列值\n", " m, n = shape(dataSet)\n", " # 无分类误差的总方差和\n", " # the choice of the best feature is driven by Reduction in RSS error from mean\n", " S = errType(dataSet)\n", " # inf 正无穷大\n", " bestS, bestIndex, bestValue = inf, 0, 0\n", " # 循环处理每一列对应的feature值\n", " for featIndex in range(n-1): # 对于每个特征\n", " # [0]表示这一列的[所有行],不要[0]就是一个array[[所有行]],下面的一行表示的是将某一列全部的数据转换为行,然后设置为list形式\n", " for splitVal in set((dataSet[:,featIndex].T.A.tolist())[0]):\n", " # 对该列进行分组,然后组内的成员的val值进行 二元切分\n", " mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)\n", " # 判断二元切分的方式的元素数量是否符合预期\n", " if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN):\n", " continue\n", " newS = errType(mat0) + errType(mat1)\n", " # 如果二元切分,算出来的误差在可接受范围内,那么就记录切分点,并记录最小误差\n", " # 如果划分后误差小于 bestS,则说明找到了新的bestS\n", " if newS < bestS:\n", " bestIndex = featIndex\n", " bestValue = splitVal\n", " bestS = newS\n", " # 判断二元切分的方式的元素误差是否符合预期\n", " # if the decrease (S-bestS) is less than a threshold don't do the split\n", " if (S - bestS) < tolS:\n", " return None, leafType(dataSet)\n", " mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)\n", " # 对整体的成员进行判断,是否符合预期\n", " # 如果集合的 size 小于 tolN \n", " if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 当最佳划分后,集合过小,也不划分,产生叶节点\n", " return None, leafType(dataSet)\n", " return bestIndex, bestValue" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'left': 1.0180967672413792,\n", " 'right': -0.044650285714285719,\n", " 'spInd': 0,\n", " 'spVal': 0.48813}" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myDat = loadDataSet('ex00.txt')\n", "myMat = mat(myDat)\n", "#print (myMat)\n", "createTree(myMat)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD6CAYAAACoCZCsAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHg5JREFUeJzt3V+IXNd9B/Dvb1fa2psmxh4r1LjsrEsgLQnJg7ahNE6T\nuLFJ1bRQ/GKyDm2CUbymEOqkhWDyKALFLwbXVvQQKNoNfQmhjeNgYifQuiR2Vw82wa1p1WiFIBTJ\ngrhBSRrk04c7E42u7rn3d84959xz53w/cNFo9s7MubPS73f+XzHGgIiIyrMydAGIiGgYTABERIVi\nAiAiKhQTABFRoZgAiIgKxQRARFQoJgAiokIxARARFYoJgIioUIeGLkCb22+/3Wxubg5dDCKiUTlz\n5swlY8yRrvOyTgCbm5vY398fuhhERKMiIgea89gFRERUKCYAIqJCMQEQERWKCYCIqFBMAEREhWIC\nIBqzvT1gcxNYWan+3NsbukQ0IllPAyWiFnt7wPHjwJUr1d8PDqq/A8D29nDlotFgC4BorB577Frw\nn7typXqeSIEJgGiszp93e56ohgmAaKw2NtyeJ6phAiAaqxMngPX1659bX6+eJ1JgAiAaq+1t4NQp\nYDoFRKo/T53iADCpcRYQ0ZhtbzPgkze2AIiICsUEQERUKCYAIqJCBU0AInJYRL7Z8vOPi8gFEXlx\ndrw75OcTEZFesAQgIjcDOAPg3o5TnzbG3D07Xg/1+USD4748NDLBEoAx5mfGmPcBuNBx6v0i8rKI\nfF1EJNTnEw1qvi/PwQFgzLV9eZgEKGOpxwDOAviSMeYDAO4A8OH6CSJyXET2RWT/4sWLiYtH5In7\n8tAIpU4AlwE8P3t8DsA76ycYY04ZY7aMMVtHjnTe1J4oD9yXh0YodQJ4FMADIrIC4L0Afpj484ni\n4L48NELREoCI3CUij9eefhLApwG8BOAbxpjXYn0+UVLcl4dGKPhWEMaYd83+/BGAL9R+9mMAHwn9\nmUSDm2/H8NhjVbfPxkYV/LlNA2WMC8GIQtneBs6dA956q/pzrMGf01mLwc3giOga3mayKGwBEKWW\ncw2b01mLwhYAUUq517A5nbUobAEQpZR7DZvTWYvCBECU0sGB2/OpcTprUZgAiFJaXXV7PjXeZrIo\nTABEKV296vY8oBs0DjmwvCzTWakTEwBRStOp2/OaXUZL3Yk059lUI8EEQJSSax+7ZtBYO7C8TAGz\n1KQXGBMAUUqufeyaaZmac7QB0ydJ9Eksvq/NfTbVWBhjsj2OHj1qiIo2nRpThezrj+k0/Dm7u8as\nr1//8/X16nkbn9eEeK1I8/WIdL+2AAD2jSLGDh7k2w4mACqeJkhqzmkKlvWAqUkSdT6vGfq1BdAm\nAHYBEeVM02XUdc7eXvV8k8UFXj6rgPusHO7z2mPH3J6nRlIlizxtbW2Z/f39oYtBNG6bm80LzUSA\n06evJQrbedNpNR3U5b3bXjP0awsgImeMMVtd57EFQDSU0LNybO9nq1Ebc31LwmcVcJ+Vw31eyz2L\nwtD0Ew11cAyAllZbv/3ubtWXLVL92XdA1aW/3PezXV/T97WxxwD6XFMGwEFgoozZAthk4jczpi0g\n9pltkyvbNe3s9A/cqb+vCMmGCYAoZ7ZpjLZjMrkWJCaT6lgMGF3TIoeopcdWL9fOTpjAnXKGUaRk\nwwRAlJpLoLQFGZ9jfb1KCDGClm+AGiJphArcKdcYREo2TABEKbkGStv5tkCuaSHE6LbwCVBDdTmF\nCtwpWwCRkg0TAFFKvoGyXktuCp6aQ8S91q053ydADbVIK9TnpkxgbAEwAdASCFmTW5y5ox0riBXk\nfALUUNs0NF3T4cM3jpdo3ytFFxbHAJgAaAnE6IPXjhPEHOj0CVB9arV9A+/i6ycTY9bWggfX4DgL\niAmARmx3t6pp1gPe2lq//8xttX+fWq3mvZtq6T5dS74DxyFrwwXvF6RNANwKgqgv27YEkwlw6VL4\n9w2x3UHsrRT29qqtmc+fr/YbOnGiecvrxfNWVprvjOZbppWVKuTXiVR3O1ti3AqCKBXb9gOXL/d7\n35g3aI9983fNbSXr9yiw3RbTd3uHxY3uNM9rLdONdTTNhKEOdgHRKMTsaog5GDn0Ai/tGIfv9+iy\n3YZ2BfFIVlWDYwBEkbXN1skwKPQWOmFoZjj1/R59p9raPnck4wpMAEQxNQWReUDLabuEUGLUfLta\nALbvsW8i6tPyGMmdyJgAiGIaSU0wmFAL3eo/twVU2/u21d5XV6uunC7atRVNQX0kv3dtAuAgMJGP\nUPvRj2VA0fV6NTeh394GHn64+W5lP/1p83fRdDP4uatXgaefBh55pP1atIPAKys3/l5iDZ7P/x2I\nAIcOVX+m+PegyRJDHWwBULZC1ARDrlyNzfV6u7anrg/ANi2ka+pi0tTeV1fbr8Vnu43Fsmi7oFzO\ns5XHs5sN7AIiisilT9wWCDR90bkMJruOAbQFapdN8OoJRtt/r7ke2yyg1VV7YnHZSkL7fWnGQhwx\nARDFpplK2FbL19Y+c+lfDrHdtS242o56P7ym9t7VAujSlbwWr9s1ubsMLLsktBomAKKUQm/v3BYE\nfcq2WI7JJH6rwvZ9uF57U8BcnH7bdGgGgtt01cjnN+eZ/26afu7yu+z6vPlOrw6YAIhSCnmDF00Q\n1Iq1T1HT5zS1furP2b4n3/sZ7Oxca1VoZwFprsUnWdWDtktC6/o8x38DgyQAAIcBfLPl5zcBeAbA\nKwBOA9VeRLaDCYBGw/UWj7YgGHr3yrbE5NKn3cZ1PES7OrfPTqB9B9B3d927qzSHLUF1tWocW4HJ\nEwCAmwG8CuDnLec8BODk7PEzAO5re08mABqNvi2AeQ0v9WrbpkDtWgbXGUIxtqCIsVBtZydMYnep\nyQdaZzBYFxCA/2r52dcA3D97/CiAL7e9FxMAjYa22yBkLV8TSDWJaTG4+ATStvdOJfQCLdtK77e9\nrV8C6KrJB0pkuSaA5wB8bPb4IQBfaXsvJgAala5mfFstXxPM6zc7qfft22rzTWMAtqDkE0jbpk2m\nEnqLBpexCtvn+iakAC2kXBPA3kIL4PMATjSccxzAPoD9jY0N5wsnGozv5nCaWp+2hWEbZGybmTKZ\nXDvXJ5DGagGEmHbq2wJo+x40CXxnZ9BdQ3NNAJ+Z1/oBfGveGrAdbAHQaPTZHE4TvLRjDG2Benf3\nxu4noGoh+Mxfdym/K9eukNCrqkOt9A491qE0eAIAcBeAx2s/+7XZ4O+rnAVES6VPwNDUukPdHL5r\nxa1PH3TKnUK1m8/1HWsJcU0lJoCQBxMAjUafPuhQLQBNgNKU03U8omk/n76Lzfr26Q9dg4+RFB0w\nARCl1CfgaLqPms5ZW3Pv4mgb3NQGO1t3S8g1DF2D6SESXUwxusUcMAEQpdS3xqcZQA7RpWBLJE0z\nhWy1eJc1D4srgF3K3TXo3fXdDhyAh05ATABEqdX7oBdr59p7zmp3xQxVzum0fYaQ75bM9ffwSWja\nabW2axzy3r1sATABUKE0UzZtc/Zt58esOXYF9HrQcmkB2NYINM2nt3Vp+damUwzC2j6DYwBMAFQo\n7Q6gLoE1Zs2xK6BrtmS2jQG4tBRsSXLo7hybriDPWUBMAFSYtlp8V2Btq4nHDB5dLRbb4jLNauYQ\neyQNWZtuC+J9E1PEBMEEQDQE1wFSzWsXV+rGYlst3DfQ2oK3tpU0T5JD1Ka7Ek+fgd7ISY0JgGgI\n2gFS2xjA0IuPfPcp8nlP360tUumq4fdpAUTu1mICIBqCtgXQtqVBLouPbK2Cxa0j+ujao0izAjlm\nq6Crht/n+448TZQJgGgImpptrFptyFpl13WE6paylbnrZjUpxgU036dvEmILgAmAlpRmUVcMIWuV\nmpbMkGVOMTMoZpLJZAxgBUQU1vY2cO5c9d/69GlgOgVEqj9Pnap+HsPGhtvzbc6f71cWLd8y28rn\nWu69PWBzE1hZqf7c27v2s+3t6vcV4/cX871daLLEUAdbAEQObLVK7SrkRV0tgFBdQL414VCbvQ25\nWjgisAuIqEBNu3T6BLm2MYC1tbBBMtReQa7BO9cFZgEwARCVxBZE24Kcy1488y0dQs62iTG91MXQ\nO4ZGxARAVIq22nDbuoSU3R+hWiYhsQXAQWCi0XvsMeDKleufu3Klet42mLq6an9NaHt7wPHjwMFB\nFWIPDoCTJ9N9vs2JE8D6+vXPra9Xzw+hbUA6EiYAorFrmxFjC3JXr7q9Vx9NCcqYdJ9vk8tMHKA5\nSR4/Hj0JMAEQjV3bVEpbkJtO3d6rD5egHvLz6zXqRx65sYY9n7L71lvVn0MEf6C9FReTpp9oqINj\nAERGN1ibw43cbWx97b6L5DSDv773ZBhK4AFpcBCYaAloA7XPjJidnWuze1ZXq7/HEHJ9gvb70O7J\nlMuAb+ABaSYAomUQa6ZK6kVQoTZu034f2l1Zc5nyGfj3oU0AHAMgylmoLQ/qUvc5h+pr134f2rGE\nGGMeGvXxCWCQAWkmAKKchdzfZ1GsxBKb9vtomv1UN9SUT9uMHyD5gDQTAFHOYs1Vj5VYYtN+H02z\nn3Z28pjyOdSMnyaafqKhDo4BEJk4Nz4Z80ZoA95sPYgEW1CAYwBESyLGXPWcFkG5ymXufpu2Vb0Z\ntb6YAIiWhetWAmMIpGPUtao3oy0omACIlsFAWwlQg64+/oxaX1J1F+Vpa2vL7O/vD10MovxtblZB\nv246rWr3lM7KSvNeRyJVaysBETljjNnqOo8tAKJlMNZpncsooz7+LkwARMtgREFn6WXUx9+FCYBo\nGaQOOgPsXT8aGfXxd2ECIFoGKYNOigHnsSeYkcyw4iAwEbmJPeA8TzCLM2nW17OtReeIg8BEFEfs\nAeectkpYckwAROQm9oBzU+ui7XnyxgRARG5iDzivrro9T96CJAARuUlEnhGRV0TktIhIwzkfF5EL\nIvLi7Hh3iM8mosRiDzjbblhve568hWoBPAjggjHm/QBuBXCv5bynjTF3z47XA302EaUWc5aL7Yb1\ntufJW6gEcA+A78wefxfARy3n3S8iL4vI15taCUREY1pINXahEsAEwE9mj98EcFvDOWcBfMkY8wEA\ndwD4cNMbichxEdkXkf2LFy8GKh4RjcaIFlKN3aFA73MJwC2zx7fM/l53GcDzs8fnALyz6Y2MMacA\nnAKqdQCBykdEY7K9zYCfQKgWwAsA7ps9vgfA9xrOeRTAAyKyAuC9AH4Y6LOJiMhDqASwB+BOEXkV\nVU3/rIg8XjvnSQCfBvASgG8YY14L9NlEROQhSBeQMeYXAD5Re/oLtXN+DOAjIT6PiIj6K3sh2Ng3\nnCIi6iHUIPD41Decmu9oCHDwiYiKUG4LgBtOEVHhyk0AvIUeERWu3ATAW+gRUeHKTQBcbk5EhSs3\nAXC5OREVrtwEAIzmvp2kwCm9RM7KnQZKy4NTeom8lN0CoDCGrn1zSi+RFyaA1IYOlqHNa98HB4Ax\n12rfKa+LU3qJvDABpJRDsAwth9o3p/QSeWECSCmHYBlaDrVvTukl8sIEkFIOwTK0HGrfnNJL5IUJ\nIKUcgqWPtnGLXGrfnNJL5IwJIKVcgqWLrnEL1r6JRkuMyfe2u1tbW2Z/f3/oYoS1t1f1+Z8/X9X8\nT5zIO1hublZBv246rWraRJQdETljjNnqOo8tgFC00zvH1lWxjOMWRASACSCMMU7v1CasUOMWy7b+\ngWgJMAGEEGJ6Z8oA6ZKwQoxbNH3egw8Ct9/OREA0JGNMtsfRo0fNKIgYU4W26w8R3et3d41ZX7/+\ntevr1fMxTKfN5Z1O7eWbTqvrmU7dy2X7vNjXSVQoAPtGEWPZAgjBpZukqabv0oII0VJw7df3HbeY\nl7VpEHnuyhXgc5/TvR8RBcUEEIK2m8TW9WILkPWAHGqsIcV6hMWydnnjDXYFEQ2g7AQQqt9dOxfe\nVtNfXW1+33pADrWVRIr1CE1l7TqfiJIqdx1AfQ95oAqCMRcxraxUNfcm6+vdZbG9XqTqnnERez1C\n27U28bkGImrEdQBdhtiYzdbFMpkAN998/XP1v7e93qfrJvZ6BFuZViz/5HLfDoNoCZWbAIZY4NTU\n9bK2Brz5ZtUPvuiNN27s3x/TVhK2sn72s+O5BqIlV24CGGJjtqaxgre/HfjlL5vPr7dIFl8PVGMH\n83NyG0S1jYs89RT3DiLKhWau6FBH1HUAoebe950jb1tD0LSWYP5Z8+c5n56IGqD4dQBdM3xC7GIZ\nYlpmV4tj/vP6tMr6AOvYbyxjwy0kiOLRZImhDu8WQKqVta4rarVlbSpz22pa15XHY5F6hTTRkkDR\nLQDbDJ/QK05tA8YHB/raalO/PnBji0QzOB16IdfQNe+YM7VyuD6ioWmyxFCHVwtgd7e9ltxWe3Tt\nz++qlYesrab8rFxq3n33WLLJ5fqIIoGyBTB4kG87nBNAW3fK/JhM9K+dBwVbYtB8nkt3kOu1zQOk\nz+Bz/b0Xr28yiXstts+tX0OILrYmsd6XKBNlJgBNPzng9trJpL22uDgzx7e2qm159J1xZHvPriQW\nY4xBUwsPWVNf/O5KGUOhYpWZALqmVLYlAO1rbbVF31rl0N0R2qQZooa8GIRXV3WfESLpaZMcWwC0\nJLQJYLkGgTWDoJOJ/2sX1Qdljx1rPs/2/NwQW1Is0q587rtatz5l9upVXXlCbFmh2Zju8GGuRqbi\nBEkAInKTiDwjIq+IyGkREZ9zemvafmDR2hrwxBP6166v6xPGs882n2d7fm7oe+627U/UtUbCZSaN\ndnfQGCuxNd/lO97B1chUHk0zoesA8BCAk7PHzwC4z+ec+uE9C2jeZTCZVIe2+6Cpu0HbReM7YyXG\ngKRLt4lvF5TroLmmiy1W11eJayioaEg5BgDgawDunz1+FMCXfc6pH9ncEtKWGELMnAk9BuDzfj79\n7K6D5rbvZ3U17KC27fpSzdYiykDqBPAcgI/NHj8E4Cs+58x+dhzAPoD9jY2NqF+St6aAsrZmzOHD\n3YFXk0z6BEKXFkWfz3UdNO+aTRUb91GigqROAHsLtfvPAzjhc079yKYFUNdW+20LqClm/Gi7ovqW\nxWX20PzzY0xj9ZFLOYgiSZ0APjOv0QP41rym73pO/cg2AeTc36/timoL4NrxEpeuHnaxECWjTQCh\npoHuAbhTRF4FcBnAWRF5vOOcFwJ9dnq+9xIIPeOnaTfS+o1lgOYpnG2fqdnV1Lab6hNPdN/wJad9\neHIqC1Fqmiwx1JFtC8Cl+8Rn8ZOWphtmMmkul+a1vuVq62IZeuFbvZy5lIUoIBS5EjglTT+yZvZJ\nn4CjGYi1BXFN2WL02+e0D09OZSEKiAkgB7YAE2rqY9/57V37GMWYuRNrh8+YZeGgMY2MNgEs11YQ\nMfn0Fdv62d96q9/WBnNdK5+B9nGJ+TYLu7vN/fZA+G0qhrgXs42mLCHu+kaUK02WGOrIpgXg21ec\noouhz/0P6u+jXb3bVFt32dG0aQ2Fy4pt189se33X75XdRDRCYBfQgr6BIvedPtvWJcR436YdO12u\ns75dh2YBXdN7hPhuu/5t5NRlRaTEBDAXIlD0CQIp+o9jJRrt+/apJfu+NlXNnC0AGiEmgLkQ/4HH\nEARiJRrN+9q6n7oGoG2LxjTJNUbNvM9mgEQZYQKYCxEoSg0C2qmutu+4bQrq2po9+A/RAmj6Hc+v\ny3VXWaKBMQHMhQoUpU0F7Nv9A9i/o67pqynHAEKWiSgTTABzpdbe+9ImzrbFaLbvuGsBW6pZQC5l\nyq3Lj6iFNgEs/zoA2541vPtTO+2+RW3z923rBdpeM53qfzea20Vq129o1iGkuksbUSLLnwCAMPeV\nLY12wVbbfXRtAfPEier2nE0ODsJtyuayiKvvojqiESojAZA72z2S6wF/e7v9vslNNfDtbeCrX7W/\nLtRq26b7ENtWMi+2FIGqtbio6dqJxk7TTzTUkc1K4NDGMqBcX7BlmwljG2fZ2Ym/jqBN7us3iCIB\nB4EzNcZBaU2ZmwJm34Hkvqttx7B+gygCbQJgF1BqLt0SudCUuWmcpe9Asq0LSUvbjUVUKCaA1ELf\nFSwF3zK7DCQ3Bepjx/rtxMkZYEStmAB89KmV5rQdspZvmV0GkpsC9bPP9m8tcQYYkZ2mn2ioI8sx\ngL59+Ms6BtD2Wt/BVO7ESeQFHASOJMTA4hhnmAxRZg7iEnnRJgB2AXWpd/ccHDSf59KHP8ZuiSHK\nfOyY2/NE5OTQ0AXI2nwl6bwf+uCg6qM25sZzc+7DH6tnn3V7noicsAXQpmn6ozH9V4n2GUSO8T65\nGuOMKaIRYQJoYws0xtw4YwXQBeNQNxkv4WblY5wxRTQiTABtugLN6dNVfzigD8ahFoL5vs+YWg1c\nyEUUl2akeKhj8FlATdMfm6ZCusxWCTW10ed9bHe92tlx++yUxjhjimhg4CygAOo7RNbNa9wufdWh\nujV83sc2pnHyZNqWgEsrJOSe/0R0PU2WGOoYvAWwqK3G7dICCLUQzOd92u56lWpufeiFcGNcWEcU\nGbgQLLC2IO8ahEJ1a7i+T9t9b1Otrg29uIuLxYhuoE0AUp2bp62tLbO/vz90MSr1NQFANSA531xs\nb+9ad9DGRjVQmdsCr7094FOfal7HMJ1eG9COaWWl+fNFqm6eod+PaAmIyBljzFbXeRwD0OraWXIM\nq3u3t4GHHx72blehp3ZyqiiRNyYArTHU8DWeeqqavjrUFsmhp3ZyqiiRP00/0VBHNmMAHGgMK/TU\nTk4VJboOOAYQkG0TuFT95kREDjgGEFLJe9Jwjj3R0mIC0Ch1oLGE/YaICsYEoFHqQOMYb2BPRGpM\nABql3ly85K4vogIESQAicpOIPCMir4jIaZH6RPNfnfdxEbkgIi/OjneH+PwkxjDPP7RSu76IChGq\nBfAggAvGmPcDuBXAvS3nPm2MuXt2vB7o8ymGUru+iAoRKgHcA+A7s8ffBfDRlnPvF5GXReTrtpYC\nZaLUri+iQoRKABMAP5k9fhPAbZbzzgL4kjHmAwDuAPDh+gkiclxE9kVk/+LFi4GKR95K7PoiKkSo\nBHAJwC2zx7fM/t7kMoDnZ4/PAXhn/QRjzCljzJYxZuvIkSOBikdERHWhEsALAO6bPb4HwPcs5z0K\n4AERWQHwXgA/DPT5tAy46IwoqVAJYA/AnSLyKqpa/gsicpeIPF4770kAnwbwEoBvGGNeC/T5NHZc\ndEaUHPcCojxwvyWiYLgXEI0LF50RJccEQHngojOi5JgAKA9cdEaUHBMA5YGLzoiSOzR0AYh+ZXub\nAZ8oIbYAiIgKxQRARFQoJgAiokIxARARFYoJgIioUFlvBSEiFwE07A+gcjvsu5IuqxKvGSjzunnN\nZfC95qkxpnM75awTQB8isq/ZC2OZlHjNQJnXzWsuQ+xrZhcQEVGhmACIiAq1zAng1NAFGECJ1wyU\ned285jJEvealHQMgIqJ2y9wCICKiFqNOACJyk4g8IyKviMhpERGfc8ZEec0iIn8vIj8QkX8SkVFv\n+ufyOxSRvxKR51OWLwbtNYvI34jIv4jIt0VkLXU5Q1L+236biPyjiPyriPztEOWMQUQOi8g3W34e\nJY6NOgEAeBDABWPM+wHcCuBez3PGRHM9HwRwyBjzewDeAeC+hOWLQfU7FJEpgL9IWK6YOq9ZRH4L\nwHuMMR8C8G0Av5m2iMFpfs/bAH5gjPkggPeIyO+kLGAMInIzgDNoj01R4tjYE8A9AL4ze/xdAB/1\nPGdMNNfzPwCemD3+vxSFikz7O3wCwBeTlCg+zTX/IYBbReSfAXwIwI8SlS0WzTX/AsD6rAZ8E5bg\n37cx5mfGmPcBuNByWpQ4NvYEMAHwk9njNwHc5nnOmHRejzHmP40xL4vInwFYA/BcwvLF0HnNIvJJ\nAK8AeC1huWLS/Ls9AuCiMeYPUNX+705Utlg01/w1AH8E4N8B/Icx5myisg0tShwbewK4BOCW2eNb\n0LxkWnPOmKiuR0T+FMDnAPyJMeZqorLFornmT6CqEf8DgKMi8peJyhaL5prfBPD67PF/A7gzQbli\n0lzzFwGcNMb8NoDbROT3UxVuYFHi2NgTwAu41r99D4DveZ4zJp3XIyK/AeCvAfyxMeZ/E5Ytls5r\nNsZ80hhzN4AHAJwxxjyZsHwxaP7dngHwu7PH70KVBMZMc81vB/Dz2eNfAPj1BOXKQZQ4NvYEsAfg\nThF5FcBlAGdF5PGOc15IXMbQNNf85wDuAPCciLwoIp9JXcjANNe8bDqv2RjzfQCXROTfALxujHl5\ngHKGpPk9/x2AHRH5PoCbMf7/zzcQkbtSxTEuBCMiKtTYWwBEROSJCYCIqFBMAEREhWICICIqFBMA\nEVGhmACIiArFBEBEVKj/B9B9U72DTjmgAAAAAElFTkSuQmCC\n", "text/plain": [ "<matplotlib.figure.Figure at 0x109266cc0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt \n", "myDat=loadDataSet('ex00.txt') \n", "myMat=mat(myDat) \n", "createTree(myMat) \n", "plt.plot(myMat[:,0],myMat[:,1],'ro') \n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD6CAYAAACF131TAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHGhJREFUeJzt3V+IJWl5x/Hf0z3dOu2o2akZUAxzJsGLiKCB6YBEDbii\nkPEP8SIie6YZt5FlZxEGEwzIXM9FQggM6LjMhes6fYKQq6iMWfybRFFMz8V6YSKisYe9kZ4edHec\njdPOvLmoru1zquuteqtOVZ2q098PHPr0OdVVb7Gzz3nP8z7v+5pzTgCA7luYdQMAAGEI2ADQEwRs\nAOgJAjYA9AQBGwB6goANAD1BwAaAniBgA0BPELABoCeO1HmyEydOuNOnT9d5SgCYezdv3rztnDtZ\ndFytAfv06dPa3Nys85QAMPfMbCvkOFIiANATBGwA6AkCNgD0BAEbAHqCgA0APUHABtC+0Ug6fVpa\nWIh/jkazblEv1FrWBwCFRiPpiSeke/fi37e24t8laTicXbt6gB42gHZdurQfrBP37sWvIxcBG0C7\nbt0q9zpeQcAG0K5Tp8q9jlcQsAG06/JlaWVl8rWVlfh15CJgA2jXcChduyYNBpJZ/PPaNQYcAxCw\nAbRvOJR++Uvp4cP4Z1GwpgxQEmV9ALqOMsBX0MMG0G2+MsBz5w5db5uADaDb8sr9kt521aDds1QL\nARtAtxWV+1WddJOkWra2JOemD/4tIGAD6LasMsC0KpNuejjjkoANoNvGywB9qky6qWvGZYtpFQI2\ngNkrCnpJGeDGRn2TbuqYcdlyWoWADWC2ygS9uibdjEbS3bsHXy8b/FtOqxCwAcxW2aBXdtJNIunF\nm0lra9LOzuT7UeQP/r5vAC0vZEXABjBbbQS98V68FPfks1y6dDAo530DaHkhKwI2gNk6fjz79bJB\nLy8PntWLT9vZyQ7Ked8AWl7IioANYHZGI+nFFw++fuRIuaCX1QteW4vTH6dP7/esy0iCct43gJYX\nsiJgA2ier/d76ZK0u3vw+N//PvwcyXnSveAk7bG1FQfTKm7d8vf0nYvbIVXLqVfhnKvtcebMGQcA\nEzY2nFtZcS4OcfFjZSV+3Wzy9fHHYBB2Duf858h7JNceDJyLIn8bsq7ta0dFkjZdQIwN6mGb2afM\n7JvNfWwAmFt5OeC8PPV4KsJ3josX4+eLi+XaFEXS9etxyPWlXpJcdNHEnXv3pPPnW5nSXhiwzWwg\n6eONtwTAfMrLAV++7E9XjAdz3zl2duJA+eBBuTYdOxYH4iT3XVTil5QS+tr64IH0+OONB+2QHvYV\nSZ9ptBUA5lde6dtwKD355MFAmK60yOuJnz8fB9gykg8AX/VIEtCz2uyzu7vf429IbsA2s8ckPS/p\nJznHPGFmm2a2ub29XXf7APRdUenb1atxeiKv0iKvYuTBA+nXvy7XpiTwlq0BL1qIKt1Tr1lRD/uD\nkt4r6cuSzpjZJ9MHOOeuOedWnXOrJ0+ebKKNAPospPStaPbicJjfi85LieT13qtMfDl61P9ew3ID\ntnPuMefcuyR9TNJN59xn22kWgLmRTD5JSuSSgbyyrlwpXmY1qyf/5JP+D4syE198+e5xZVMzJVGH\nDaA5da5ml/TU82T15K9eney9S/v13JcuxTnwkIkvRbMll5fjD5UGmfPNqa9gdXXVbW5u1nY+AD3n\nm2U4GOwHz7JOnMju5UaRdPt2/t+mN/SV4h51yOzEhQX/GiSDQfVvDpLM7KZzbrXoOHrYAJrTxMJO\nV65IS0uTry0thfVup1kO1ZfXXliIP5TOnYs/TNjAAEAvNbGa3XAoPfPMZBrjmWfCerfTfID4KkQe\nPtx/vrPTaD02ARtAc5paza7qmtjTfICkq118syt3d9nAAEAPhZT0ldkTcdr9E6t+gCTXXVuLf79+\nPb+UsKENDFj8CcDsFC3qVPXYomsOBvHiT8niTlXauLAQtnBVAAUu/kSVCIDZKVNF0kTFSYgq62lv\nbJSqGKFKBEB3lN0TcWtruv0Tq6ZOsv6ubHojihpbE/tII2cFgES69jmZPCPFg32+3uv4RJu8Y9MD\nhnnXywukvr87ftxf9/3yywdrupucPBOSNwl9kMMGcMBgUH1zgKJjs3LYeder0s4o8l+3bD7cQ4E5\nbAI2gGb5dpUxi98fD3p5u8Okj/UFyKLrVWlnTYHZJzRgM+gIoFltDyxWPcesBjXFoCPQXdPWEvdN\nmdrnOibaVD1HU5N86hTSDQ99kBIBCtRVS9w3ZVIKdaQfqp6j4dSHj0iJAB00w6/dvVXXetodFpoS\noawPaFMTq9fNs6olenOKHDbQpiZWr5tn0yyHOocI2ECb+jCw1SV8I5lAwAbaFLJ6HfbxjWQCARto\nW9W1nA8jvpFMIGAD82Ie67v5RjKBgA3MgzK7k/ctsPON5BUEbKANTQfJ0GqKMoEdncPEGaBp6Vpi\nKc7D1vnVfmEhDsBpZpObxDJxp5NYSwToirpqifN66aHVFJTJ9RoBG2haHUGyKJURWk1BmVyvEbCB\nptURJH299IsX4+eh1RTTlMll9fD7NoDZdyErRIU+WK0PyFB1hb6Qhf2l5ncN993D8rJzS0uHb+XB\nBogdZ4AOSYKk5NziopvY9sp3fMjWWemtrzY24i2txre3qiOA+rbPqrIVFw4IDdhUiQBtKVMt4qvm\nyJJUgoxG0uOPS7u7/mOjKN4ktmx1iq8KJa89CEaVCNA1ZapFygxIJrnwS5fyg7UU7/69vl6ca07n\npo8fD29PmWNRCgEbaEuZapEyA5Jnz+afP+3+/Ti4+wYMsypSXnpJWloKbxMaURiwzeyImf2LmX3f\nzL7QRqOAuVSmWiSrmsPnxo3882dJygLHg/Ljj0snTkjnzh38JnD/vvS6101WofjcuRPeDpQS0sP+\nK0nPO+feKemNZvanDbcJmE9lSurGy/SKJD3rMivYLS4eDMq7u3HKxOfOnck1PaIo+zhSIo0JCdj/\nJumfzOyIpD+Q9GKzTQLmULIv4b17cbCUileeSxY9KgraSc96OJQuXAhrz4MHYcdlXafIzg412Q0p\nDNjOubvOuXuSvi/pV865X4y/b2ZPmNmmmW1ub2831U6gv8ZzwlIcLJOedUi1Rl56JN1Dv3pV2tjI\nD/JR5O8d+2R9E8hLfbCoVDOK6v4kRZJeJWlR0r9Leo/vWOqwgQy+GuYy9cpl67iTv8mq5Y4i517z\nmnJ11VnXCanNpiY7iALrsENSIn8r6a+dcw8k3ZN0tKHPDmA+1bGWSJIecU76/e/jn8nqer6p4Uke\nPN2b3tmRfvvb4muurMS9dd8a1CEDoywqVauQgP05Setm9gNJO5Kea7ZJwJzxDcKFDs751vBIKjry\n1rYeDqVjx8LbmuTXk0HJpPwvS8jAKItK1SukGx76ICUCZBifKp5OTRS5cOHgWiJLS/E6HqFpiKK1\nSMbXAblwIX/dk/F1SKIofiTPWVekMtWYEgEwDd/gXFG98mgkPf30wSnhu7txXbRPOg3h6+VG0cHV\n/W7cyJ6Nef58fNza2n6PfmcnfiTPzeJzsvdiYwjYQNOqLq966VL4+h155/XVf1+5cnCvRF/OOSkD\nzGvP/ftx+oW9FxtDwAaaVnUN6qoDdslU9UQ615yXn54258wgY6MI2EDTQjcXSKsaPJOp6uk2JB8c\nSW85a5CyzJT4LAwyNoqADbQhKctL0gVS8U4tvuCZVHL4+Hq5IasFpj9ciq6V1WY0hoANtK1of8ZE\nVs88ioqnlft6uaH14OMfLs8+G97jjiLy1g0jYANtK7MudrpnXlRZkpcb9wXyvHrwrA+NCxf8g5ho\nFAEbaJtvJ5mQHWbycsRFufHLl7PXtH7ppfw1P9IfGlevVsvJY2psEQa07ciR7LTG4mI87TxPmW3G\nspw4kb2E6mCwn1tH69giDOgqXw46ZMnTqhUnCV9KhXK8Xjgy6wYAh85gkJ3+CNmsQIqDc9X0w6lT\n2demHK8X6GEDbas6kabv18bUCNhA26ZNa/T12pgag44AMGMMOgLAnCFgA0BPELABoCcI2Dh8srbc\nAnqAgI35kxeQQxdeAjqIgI35UhSQyyy8BHQMARvzpSgg5y28RJoEHUfAxnwpWvM5b0F+0iToOAI2\n5kvRhrchCyxJ/jQJA5aYIQI25otvrYyzZ+MAW0a6t86AJWaMgI35kl4rI4rin5//vD9/bZb9erq3\nzoAlZoyAjfmT7JBy/br08svSb3/rP3YwkB59NPu9s2cnfw/dExFoCAEb/ROaR87qEafduiV997vZ\n7924Mfl7UX4caBgBG/2SlUd+/PF466t0AA/p+TrnH4hM/z1rSWPGCNjol6xe8+5uvE9hEsDPnYvz\n0gtT/vNOdhNPevRra9LRo/t58WQtaYnKEbSCgI1+KZMvDi3hy5Pu0e/sxHnx69f3N61dX5/s8a+v\nE7TRCDYwQL+cPu2v9vBZXKwWvM38eyAmu4z7diGPIun27fLXxKFU2wYGFnvWzH5oZl8xMzbuRXl1\nTTjJyiMXefgwDqBlnTpVXBmSFazzXgemEJISeaekI865d0h6naT3N9skzJ06J5yk66xD8tRVqjiS\nwUQqQ9AhIQH7V5Ku7D2/32BbMK+qTjjx9cqTOuuHD6UvfUlaXvafIwm8d+6Et3dhYX9j2qLKEF/P\nvUqPHihQGLCdcz9zzv3IzD4iaVnSc+Pvm9kTZrZpZpvb29tNtRN9VpRWGI3iXLBZ/DhxQnrqqbBe\n+XAofeELkzMb01Ucw2G5HvHDh5Pnz9tl/KMfzT6H73VgCkGDjmb2YUl/I+lDzrmXfMcx6IhMVQYK\nfZLBvrJGo7gsL3SQPfQ6vnur2k4cSnUOOr5B0qclfSAvWANeVQYKfapOAx8OpSef9K8bUvU6TFdH\ni0Jy2OclvVHSc2b2PTNbb7hNmDdJWiFvLepQ0wz2Xb0a108PBvHvecE79DoMSqJFITnsv3fOvdk5\n9669xxfaaBjm0LQTWaadBj4axQOdt27FHx6+9EiZ6zBdHS1ipiOal5T1TWt8sK9qG5JBzLwPjzLX\nKRqUBGrETEc0r45Bx2kH8ULb4LvOeO/81Km4B01QRk1qG3QEpjbtAFySYphmtmRIG3ypDHaaQUcQ\nsNG8KgNwi4sHV8QrEzTTwT1Zea/oOsPhwb+9eJGdZtANzrnaHmfOnHHAARsbzq2sOBeH2uLHykr8\nN+MGg+xjB4OD14qig8ctLTm3vFx8nbJtBWogadMFxFh62GheemAuj2/QzpfS2Nra7wknsyOzFl7a\n3ZVe+9riwcGQXWoSdZQpAiUw6Ij2VZkdGDJoaJY/k9Fsctp5loWF8NmQUrljAQ8GHdFdVWqXQ2ZL\nFgXPkFx6mXx7MgEHaAkBG+2rUrtcJq2SJXQyS9YHw/KytLRU7XxAjUiJoJ98KZKstEgUSVeuhNdN\nZ9VcS9RhozGhKRECNvopqY0eHyBcWZHOn5du3CCwoldCAzbbfaGfkiBMrxeHCDnsWaprn8PDanzn\nmV/+kmCNuUcPe1bSX+mTmXsSgQdAJnrYs1J1n8Ou4tsC0Dh62LMyTzuV8G0BaAU97FmZp51K5u3b\nAtBRBOxZmWankq6lH+bp2wLQYQTsWam6U0kX12aep28LQIcxcaZvqiyc1DTfJBa2ygKCsPhTX5RN\nb3Qx/cC+hkArCNizVCW9MW36oan8N5NYgMYRsGepSnVFlcHKJEibSWtr3cp/AwhGwJ6lkPRGukcs\nlUs/jPfipYMr2VF+B/QGE2dm6dSp7AHEhYX9Xm96QsramvTkk+EDjCFbXlF+B/QCPexZ8u2i8uBB\nHKizdut2Tnr66fA0Rkgw9u0oDqBTCNizlFRXZG3meu9e9mayUhy0Q9MYs66F7tokH6DHCNizNhwW\nbwybJTSNEbIX4p075a8vFQfjLk7yAXqMgN0Fvl7wsWPl/yZtvEZ62nONCwnGWSkdBjmBygjYXeAr\n1XvVq7KPNyu3AWxSI72xcfA6S0vS3bvlUxZFJYmjkT+lwyAnUElQwDazJTP7atONObR8MwV9qQrn\nqk1MSa4TRfuv7e7GgbVsysIXdLe24sB/8aL/b2edVwd6qjBgm9lRSTclva/55hxiWTMFfYEtL70R\n4uWX/e+Fpizygu7Wlr93LZX7dgDgFYUB2zn3snPubZJeaKE9GDfNEqw+ddVlhwxmZokipq0DFU2d\nwzazJ8xs08w2t7e362gTxh09uv88iqZfVCkkGIekLLLSK0VWVqQrV8KPBzBh6oDtnLvmnFt1zq2e\nPHmyjjZB2q/CGE8t5KUyQoVMkrl7N3zwMa9NUcQKfkCNqBLpqia23RqNpJdeKj5uZ6d48HE0ks6f\n96dXkt40K/gBtSFgd1XZda9DZhReuiTdv3/w9YWMfwbpD4fx8584Ia2vx1PofehNA7ULDtjOuTc3\n2ZBDLSvYlln3OnRGoS/Y+2ZaJsenz7+zkx34E4MBwRpoAD3sWfMF27NnwytEQtMnvg+BrB62tJ/v\nDqksKWojgKkRsGfNF2xv3Ahf9zo0feIrExyvRAk5j8/iIqkQoEEE7FnLC7ah226Fpk98Myp9vedk\npmXozMRnnyVYAw0iYM+aLxgePx6+LGmZCTZlZlQmr4dMkmFCDNA4AvasZQXDpaW4/C50WdJpdy0v\nCvjpFf/MDh7LhBigec652h5nzpxxqGBjw7nBwDmz+GcUOReH6snHYNBeGzY2/MeNty+K/McCCCJp\n0wXEWHPpTVmnsLq66jY3N2s736G1sHBws1wp7tlW2eygLklFSzrnHUVxD5uUCFCJmd10zq0WHUdK\npIvK1GC3yVfeFzIzEsDUCNhd1MQqfXXIK+9jJxmgcQTsLpp2ELEpRT18dpIBGkXA7qrQGuw2FZX3\nzTplA8w5AjbC5a2B3YWUDTDnCNgoZziUbt+ON/TtWsoGmHNHZt0A9NRwSIAGWkYPGwB6goANAD1B\nwAaAniBgA0BPELABoCcI2ADQEwRsAOgJAjYA9MR8BOzRKHw7rbquc+JE/Gj6mgCwp/8zHUcjaX1d\nun8//n1rK/5dqncmXnrx/p2d/feSLbzqviYAjJl9D3va3vHFi/vBOnH/fvx6nXyL9ydYDxpAw2bb\nw073Wqv0VMd7uiGvVxWy1jPrQQNo0Gx72Fm91qKearpHXpeinn7IWs8LC/HftZVTB3C4hOzUG/oo\nvWu6Wfbu4GaTxyU7eifvZf1N+hFFxdfPO+/KyuRu4Bsb8WtF111edm5pKf9cADBGgbumz7aHHbLZ\nbJI22dqKfw/Z5X1pSfroR+PerZl05Ej8c7y3W3TedE8/vW1XFMU96LT796Xd3YPnOneOyhIA0wmJ\n6qGP0j3srF5rujea9ICLHoNB3EseDJy7cMHfG07OH3LedE9/vN2h7cp70PMG4MJ72OZCeqyBVldX\n3ebmZrk/Go3inuytW3HP+vLlyQHHhYXiXvVgEO97mDh9er/nnCWKpDt3yp83ae/4QOm0oijewQXA\noWVmN51zq0XH5aZEzOzVZvY1M3vezK6bmdXXxD3JZrPXr8e/r61NpguKBvuy9hLMC9ZSXEFy/Hj+\nMUtL0t27B9MXReV9y8vZqZK8tpAaARCgKLKck/SCc+7tkh6R9L5GWjGeT3Zuv7xvNMreqTv53Mja\nS3A02n+/SN4O4Lu7cTBN2rO2Jj31VH7pnpn07nfHOfMyqN8GEKAoYD8q6Rt7z78t6T2NtCKvvC89\n2DcYxL1x5+Keebpe+9KlsIHJO3fCBhETzklPP53fM3dO+u53D07kKUL9NoAARQE7kvSbvecvSjoQ\nrczsCTPbNLPN7e3taq3wBaytrbjHnKRNHj7MDtIh50o7dWryvMeOxT/zOBcH+jwPHoRdP90WAChQ\nFLBvS3r93vPX7/0+wTl3zTm36pxbPXnyZLVW5AWs9fVyOd6Q4JfkvccnuBTlvRNFvffFRf97R47E\nufGstgBAgaKA/S1J7997/qik7zTSisuX/XnnsuuCZOW8l5fjlEeSUrl2LX59fX0/b16HlZU4956V\nG48i6YtflJ55ZjK9k87BA4BPXs2fpFdJ+pqkH0u6LsVlgL5H6TrsyULE/EdWvXJSD53UXyfH+F4f\nF0XT11GP12uXvT4A7FFv6rATRbXT6ZrorHrolZXwHmtdFYpZtdoAUEItdditunz5YH53XHowscrC\nUXUj/wygRd0J2MOh9IlP+N9PDyb6qkGKqkSSgcYQeWV+i4v7HxBMfAHQgu4E7NFIevbZ7PeyerIh\nC0dlXWN8wae0hYXJwclHHsk+zmy/fG98kg8ANKg7Ads35XtxMTsvnVUNUpSiyJtWPhhIX/pSvK5H\nUu/tq7lO5/3ZbQZAC7qzp6MvleGbiJIE8LyFo0KvYZY9cHjqVHh9NrMVATSsOz3svFSGL+VQZgZk\n3jV8r589e7CaxFddwmxFAA3rTsDOSnEk6ko5lEmjJDn18fSHmfSWt2Sf++zZ6dsHADm6E7CTRZ58\n6kg5ZC0k5avbzsp3Oyf99KfZ575xY/r2AUCO7kycSfgm0LQ9QSVk44RxZsWLRwFAhv5NnElUqf5o\ngi8n7VvciRw2gIZ1L2CXSVs0yffBkbW4EzMeAbSgewFbKl/90VQbsj44rl7txgcKgEOnezlsADhk\n+pvDDjG+8cD4BrkAMMe6M9MxVHpZ1WQtD4m0BIC51r8edheWVQWAGehfwK66rCoA9Fz/AnaVZVUB\nYA70L2B3ZWINALSsfwG7KxNrAKBl/asSkeLgTIAGcMj0r4cNAIcUARsAeoKADQA9QcAGgJ4gYANA\nT9S6Wp+ZbUsK3GZ8wglJt2trSD8cxnuWDud9c8+HwzT3PHDOnSw6qNaAXZWZbYYsLThPDuM9S4fz\nvrnnw6GNeyYlAgA9QcAGgJ7oSsC+NusGzMBhvGfpcN4393w4NH7PnchhAwCKdaWHDQAo0FrANrNX\nm9nXzOx5M7tuZlblmD4JvGczs2fN7Idm9hUz6+eCXHvK/Dc0s0+Z2TfbbF8TQu/ZzP7OzP7TzL5u\nZsttt7Nugf++X2Nm/2pm3zezf5hFO5tgZktm9tWc9xuJZW32sM9JesE593ZJj0h6X8Vj+iTkft4p\n6Yhz7h2SXifp/S22rwlB/w3NbCDp4y22q0mF92xmfyzprc65d0v6uqQ/bLeJjQj5bz2U9EPn3Dsl\nvdXM3tJmA5tgZkcl3VR+fGoklrUZsB+V9I2959+W9J6Kx/RJyP38StKVvef322hUw0L/G16R9JlW\nWtS8kHt+r6RHzOw/JL1b0v+21LYmhdz37ySt7PUwX605+DfunHvZOfc2SS/kHNZILGszYEeSfrP3\n/EVJxyse0yeF9+Oc+5lz7kdm9hFJy5Kea7F9TSi8ZzN7TNLzkn7SYruaFPLv9qSkbefcXyjuXb+r\npbY1KeS+/1nSX0r6b0n/45z7eUttm7VGYlmbAfu2pNfvPX+9sqdwhhzTJ0H3Y2YflnRR0oeccw9a\naltTQu75g4p7nF+WdMbMPtlS25oScs8vSvrp3vNfSHpTC+1qWsh9f0bS0865P5F03Mz+vK3GzVgj\nsazNgP0t7ednH5X0nYrH9Enh/ZjZGyR9WtIHnHMvtdi2phTes3PuMefcuyR9TNJN59xnW2xfE0L+\n3d6U9Gd7z9+sOGj3Xch9v1bS/+09/52kYy20qwsaiWVtBuyRpDeZ2Y8l3ZH0czP7x4JjvtVi+5oQ\ncs/nJb1R0nNm9j0zW2+7kTULued5U3jPzrkfSLptZv8l6afOuR/NoJ11C/lv/TlJF8zsB5KOqv//\nTx9gZn/UVixj4gwA9AQTZwCgJwjYANATBGwA6AkCNgD0BAEbAHqCgA0APUHABoCe+H8CkAvme3HW\negAAAABJRU5ErkJggg==\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1092769b0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt \n", "myDat1=loadDataSet('ex0.txt') \n", "myMat1=mat(myDat1) \n", "createTree(myMat1) \n", "plt.plot(myMat1[:,1],myMat1[:,2],'ro') \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. 树剪枝" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 预剪枝" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'left': {'left': {'left': {'left': 105.24862350000001,\n", " 'right': 112.42895575000001,\n", " 'spInd': 0,\n", " 'spVal': 0.958512},\n", " 'right': {'left': {'left': {'left': {'left': 87.310387500000004,\n", " 'right': {'left': {'left': 96.452866999999998,\n", " 'right': {'left': 104.82540899999999,\n", " 'right': {'left': 95.181792999999999,\n", " 'right': 102.25234449999999,\n", " 'spInd': 0,\n", " 'spVal': 0.872883},\n", " 'spInd': 0,\n", " 'spVal': 0.892999},\n", " 'spInd': 0,\n", " 'spVal': 0.910975},\n", " 'right': 95.275843166666661,\n", " 'spInd': 0,\n", " 'spVal': 0.85497},\n", " 'spInd': 0,\n", " 'spVal': 0.944221},\n", " 'right': {'left': 81.110151999999999,\n", " 'right': 88.784498800000009,\n", " 'spInd': 0,\n", " 'spVal': 0.811602},\n", " 'spInd': 0,\n", " 'spVal': 0.833026},\n", " 'right': 102.35780185714285,\n", " 'spInd': 0,\n", " 'spVal': 0.790312},\n", " 'right': 78.085643250000004,\n", " 'spInd': 0,\n", " 'spVal': 0.759504},\n", " 'spInd': 0,\n", " 'spVal': 0.952833},\n", " 'right': {'left': {'left': {'left': 114.554706,\n", " 'right': {'left': 104.82495374999999,\n", " 'right': 108.92921799999999,\n", " 'spInd': 0,\n", " 'spVal': 0.698472},\n", " 'spInd': 0,\n", " 'spVal': 0.706961},\n", " 'right': 114.15162428571431,\n", " 'spInd': 0,\n", " 'spVal': 0.666452},\n", " 'right': {'left': 93.673449714285724,\n", " 'right': {'left': 123.2101316,\n", " 'right': {'left': 97.200180249999988,\n", " 'right': {'left': {'left': 109.38961049999999,\n", " 'right': 110.979946,\n", " 'spInd': 0,\n", " 'spVal': 0.543843},\n", " 'right': 101.73699325000001,\n", " 'spInd': 0,\n", " 'spVal': 0.51915},\n", " 'spInd': 0,\n", " 'spVal': 0.553797},\n", " 'spInd': 0,\n", " 'spVal': 0.582311},\n", " 'spInd': 0,\n", " 'spVal': 0.613004},\n", " 'spInd': 0,\n", " 'spVal': 0.640515},\n", " 'spInd': 0,\n", " 'spVal': 0.729397},\n", " 'right': {'left': {'left': 12.50675925,\n", " 'right': 3.4331330000000007,\n", " 'spInd': 0,\n", " 'spVal': 0.467383},\n", " 'right': {'left': {'left': {'left': -12.558604833333334,\n", " 'right': {'left': 14.38417875,\n", " 'right': {'left': -0.89235549999999952,\n", " 'right': 3.6584772500000016,\n", " 'spInd': 0,\n", " 'spVal': 0.385021},\n", " 'spInd': 0,\n", " 'spVal': 0.412516},\n", " 'spInd': 0,\n", " 'spVal': 0.437652},\n", " 'right': {'left': {'left': -15.085111749999999,\n", " 'right': -22.693879600000002,\n", " 'spInd': 0,\n", " 'spVal': 0.350725},\n", " 'right': {'left': 15.059290750000001,\n", " 'right': {'left': -19.994155200000002,\n", " 'right': {'left': {'left': {'left': {'left': {'left': 0.40377471428571476,\n", " 'right': -13.070501,\n", " 'spInd': 0,\n", " 'spVal': 0.25807},\n", " 'right': 6.770429,\n", " 'spInd': 0,\n", " 'spVal': 0.228473},\n", " 'right': -11.822278500000001,\n", " 'spInd': 0,\n", " 'spVal': 0.217214},\n", " 'right': 3.4496025000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.202161},\n", " 'right': {'left': -12.107972500000001,\n", " 'right': -6.2479000000000013,\n", " 'spInd': 0,\n", " 'spVal': 0.156067},\n", " 'spInd': 0,\n", " 'spVal': 0.166765},\n", " 'spInd': 0,\n", " 'spVal': 0.297107},\n", " 'spInd': 0,\n", " 'spVal': 0.324274},\n", " 'spInd': 0,\n", " 'spVal': 0.335182},\n", " 'spInd': 0,\n", " 'spVal': 0.373501},\n", " 'right': {'left': 6.5098432857142843,\n", " 'right': {'left': -2.5443927142857148,\n", " 'right': 4.0916259999999998,\n", " 'spInd': 0,\n", " 'spVal': 0.044737},\n", " 'spInd': 0,\n", " 'spVal': 0.084661},\n", " 'spInd': 0,\n", " 'spVal': 0.126833},\n", " 'spInd': 0,\n", " 'spVal': 0.457563},\n", " 'spInd': 0,\n", " 'spVal': 0.499171}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myDat2 = loadDataSet('ex2.txt')\n", "myMat2 = mat(myDat2)\n", "createTree(myMat2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'left': 101.35815937735848,\n", " 'right': -2.6377193297872341,\n", " 'spInd': 0,\n", " 'spVal': 0.499171}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "createTree(myMat2, ops=(10000, 4))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#后剪枝\n", "# 判断节点是否是一个字典\n", "def isTree(obj):\n", " \"\"\"\n", " Desc:\n", " 测试输入变量是否是一棵树,即是否是一个字典\n", " Args:\n", " obj -- 输入变量\n", " Returns:\n", " 返回布尔类型的结果。如果 obj 是一个字典,返回true,否则返回 false\n", " \"\"\"\n", " return (type(obj).__name__ == 'dict')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 计算左右枝丫的均值\n", "def getMean(tree):\n", " \"\"\"\n", " Desc:\n", " 从上往下遍历树直到叶节点为止,如果找到两个叶节点则计算它们的平均值。\n", " 对 tree 进行塌陷处理,即返回树平均值。\n", " Args:\n", " tree -- 输入的树\n", " Returns:\n", " 返回 tree 节点的平均值\n", " \"\"\"\n", " if isTree(tree['right']):\n", " tree['right'] = getMean(tree['right'])\n", " if isTree(tree['left']):\n", " tree['left'] = getMean(tree['left'])\n", " return (tree['left']+tree['right'])/2.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 后剪枝" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 检查是否适合合并分枝\n", "def prune(tree, testData):\n", " \"\"\"\n", " Desc:\n", " 从上而下找到叶节点,用测试数据集来判断将这些叶节点合并是否能降低测试误差\n", " Args:\n", " tree -- 待剪枝的树\n", " testData -- 剪枝所需要的测试数据 testData \n", " Returns:\n", " tree -- 剪枝完成的树\n", " \"\"\"\n", " # 判断是否测试数据集没有数据,如果没有,就直接返回tree本身的均值\n", " if shape(testData)[0] == 0:\n", " return getMean(tree)\n", "\n", " # 判断分枝是否是dict字典,如果是就将测试数据集进行切分\n", " if (isTree(tree['right']) or isTree(tree['left'])):\n", " lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])\n", " # 如果是左边分枝是字典,就传入左边的数据集和左边的分枝,进行递归\n", " if isTree(tree['left']):\n", " tree['left'] = prune(tree['left'], lSet)\n", " # 如果是右边分枝是字典,就传入左边的数据集和左边的分枝,进行递归\n", " if isTree(tree['right']):\n", " tree['right'] = prune(tree['right'], rSet)\n", "\n", " # 上面的一系列操作本质上就是将测试数据集按照训练完成的树拆分好,对应的值放到对应的节点\n", "\n", " # 如果左右两边同时都不是dict字典,也就是左右两边都是叶节点,而不是子树了,那么分割测试数据集。\n", " # 1. 如果正确 \n", " # * 那么计算一下总方差 和 该结果集的本身不分枝的总方差比较\n", " # * 如果 合并的总方差 < 不合并的总方差,那么就进行合并\n", " # 注意返回的结果: 如果可以合并,原来的dict就变为了 数值\n", " if not isTree(tree['left']) and not isTree(tree['right']):\n", " lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])\n", " # power(x, y)表示x的y次方;这时tree['left']和tree['right']都是具体数值\n", " errorNoMerge = sum(power(lSet[:, -1] - tree['left'], 2)) + sum(power(rSet[:, -1] - tree['right'], 2))\n", " treeMean = (tree['left'] + tree['right'])/2.0\n", " errorMerge = sum(power(testData[:, -1] - treeMean, 2))\n", " # 如果 合并的总方差 < 不合并的总方差,那么就进行合并\n", " if errorMerge < errorNoMerge:\n", " print (\"merging\")\n", " return treeMean\n", " # 两个return可以简化成一个\n", " else:\n", " return tree\n", " else:\n", " return tree\n" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n", "merging\n" ] }, { "data": { "text/plain": [ "{'left': {'left': {'left': {'left': 92.523991499999994,\n", " 'right': {'left': {'left': {'left': 112.386764,\n", " 'right': 123.559747,\n", " 'spInd': 0,\n", " 'spVal': 0.960398},\n", " 'right': 135.83701300000001,\n", " 'spInd': 0,\n", " 'spVal': 0.958512},\n", " 'right': 111.2013225,\n", " 'spInd': 0,\n", " 'spVal': 0.956951},\n", " 'spInd': 0,\n", " 'spVal': 0.965969},\n", " 'right': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': 96.41885225,\n", " 'right': 69.318648999999994,\n", " 'spInd': 0,\n", " 'spVal': 0.948822},\n", " 'right': {'left': {'left': 110.03503850000001,\n", " 'right': {'left': 65.548417999999998,\n", " 'right': {'left': 115.75399400000001,\n", " 'right': {'left': {'left': 94.396114499999996,\n", " 'right': 85.005351000000005,\n", " 'spInd': 0,\n", " 'spVal': 0.912161},\n", " 'right': {'left': {'left': 106.814667,\n", " 'right': 118.513475,\n", " 'spInd': 0,\n", " 'spVal': 0.908629},\n", " 'right': {'left': 87.300624999999997,\n", " 'right': {'left': {'left': 100.133819,\n", " 'right': 108.09493399999999,\n", " 'spInd': 0,\n", " 'spVal': 0.900699},\n", " 'right': {'left': 82.436685999999995,\n", " 'right': {'left': 98.544549499999988,\n", " 'right': 106.16859550000001,\n", " 'spInd': 0,\n", " 'spVal': 0.872199},\n", " 'spInd': 0,\n", " 'spVal': 0.888426},\n", " 'spInd': 0,\n", " 'spVal': 0.892999},\n", " 'spInd': 0,\n", " 'spVal': 0.901421},\n", " 'spInd': 0,\n", " 'spVal': 0.901444},\n", " 'spInd': 0,\n", " 'spVal': 0.910975},\n", " 'spInd': 0,\n", " 'spVal': 0.925782},\n", " 'spInd': 0,\n", " 'spVal': 0.934853},\n", " 'spInd': 0,\n", " 'spVal': 0.936524},\n", " 'right': {'left': {'left': 89.20993,\n", " 'right': 76.240983999999997,\n", " 'spInd': 0,\n", " 'spVal': 0.847219},\n", " 'right': 95.893130999999997,\n", " 'spInd': 0,\n", " 'spVal': 0.84294},\n", " 'spInd': 0,\n", " 'spVal': 0.85497},\n", " 'spInd': 0,\n", " 'spVal': 0.944221},\n", " 'right': 60.552307999999996,\n", " 'spInd': 0,\n", " 'spVal': 0.841625},\n", " 'right': 124.87935300000001,\n", " 'spInd': 0,\n", " 'spVal': 0.841547},\n", " 'right': {'left': 76.723834999999994,\n", " 'right': {'left': 59.342323,\n", " 'right': 70.054507999999998,\n", " 'spInd': 0,\n", " 'spVal': 0.819722},\n", " 'spInd': 0,\n", " 'spVal': 0.823848},\n", " 'spInd': 0,\n", " 'spVal': 0.833026},\n", " 'right': {'left': 118.319942,\n", " 'right': {'left': 99.841379000000003,\n", " 'right': 112.981216,\n", " 'spInd': 0,\n", " 'spVal': 0.811363},\n", " 'spInd': 0,\n", " 'spVal': 0.811602},\n", " 'spInd': 0,\n", " 'spVal': 0.815215},\n", " 'right': 73.494399250000001,\n", " 'spInd': 0,\n", " 'spVal': 0.806158},\n", " 'right': {'left': 114.4008695,\n", " 'right': 102.26514075,\n", " 'spInd': 0,\n", " 'spVal': 0.786865},\n", " 'spInd': 0,\n", " 'spVal': 0.790312},\n", " 'right': 64.041940999999994,\n", " 'spInd': 0,\n", " 'spVal': 0.769043},\n", " 'right': 115.199195,\n", " 'spInd': 0,\n", " 'spVal': 0.763328},\n", " 'right': 78.085643250000004,\n", " 'spInd': 0,\n", " 'spVal': 0.759504},\n", " 'spInd': 0,\n", " 'spVal': 0.952833},\n", " 'right': {'left': {'left': {'left': {'left': {'left': {'left': {'left': 110.90282999999999,\n", " 'right': {'left': 103.345308,\n", " 'right': 108.55391899999999,\n", " 'spInd': 0,\n", " 'spVal': 0.710234},\n", " 'spInd': 0,\n", " 'spVal': 0.716211},\n", " 'right': 135.41676699999999,\n", " 'spInd': 0,\n", " 'spVal': 0.70889},\n", " 'right': {'left': {'left': {'left': {'left': 106.18042699999999,\n", " 'right': 105.062147,\n", " 'spInd': 0,\n", " 'spVal': 0.70639},\n", " 'right': 115.58660500000001,\n", " 'spInd': 0,\n", " 'spVal': 0.699873},\n", " 'right': 92.470635999999999,\n", " 'spInd': 0,\n", " 'spVal': 0.69892},\n", " 'right': {'left': 120.521925,\n", " 'right': {'left': 101.91115275,\n", " 'right': 112.78136649999999,\n", " 'spInd': 0,\n", " 'spVal': 0.666452},\n", " 'spInd': 0,\n", " 'spVal': 0.689099},\n", " 'spInd': 0,\n", " 'spVal': 0.698472},\n", " 'spInd': 0,\n", " 'spVal': 0.706961},\n", " 'right': {'left': 121.98060700000001,\n", " 'right': {'left': 115.687524,\n", " 'right': 112.715799,\n", " 'spInd': 0,\n", " 'spVal': 0.652462},\n", " 'spInd': 0,\n", " 'spVal': 0.661073},\n", " 'spInd': 0,\n", " 'spVal': 0.665329},\n", " 'right': 82.500765999999999,\n", " 'spInd': 0,\n", " 'spVal': 0.642707},\n", " 'right': 140.61394100000001,\n", " 'spInd': 0,\n", " 'spVal': 0.642373},\n", " 'right': {'left': {'left': {'left': {'left': 82.713621000000003,\n", " 'right': {'left': 91.656616999999997,\n", " 'right': 93.645292999999995,\n", " 'spInd': 0,\n", " 'spVal': 0.632691},\n", " 'spInd': 0,\n", " 'spVal': 0.637999},\n", " 'right': {'left': 117.62834599999999,\n", " 'right': 105.970743,\n", " 'spInd': 0,\n", " 'spVal': 0.624827},\n", " 'spInd': 0,\n", " 'spVal': 0.628061},\n", " 'right': 82.04976400000001,\n", " 'spInd': 0,\n", " 'spVal': 0.623909},\n", " 'right': {'left': 168.180746,\n", " 'right': {'left': {'left': {'left': {'left': {'left': {'left': 93.521395999999996,\n", " 'right': {'left': 130.37852899999999,\n", " 'right': {'left': 111.9849935,\n", " 'right': {'left': 82.589327999999995,\n", " 'right': {'left': 114.872056,\n", " 'right': 108.43539199999999,\n", " 'spInd': 0,\n", " 'spVal': 0.569327},\n", " 'spInd': 0,\n", " 'spVal': 0.571214},\n", " 'spInd': 0,\n", " 'spVal': 0.582311},\n", " 'spInd': 0,\n", " 'spVal': 0.589806},\n", " 'spInd': 0,\n", " 'spVal': 0.599142},\n", " 'right': 82.903944999999993,\n", " 'spInd': 0,\n", " 'spVal': 0.560301},\n", " 'right': 129.06244849999999,\n", " 'spInd': 0,\n", " 'spVal': 0.553797},\n", " 'right': {'left': 83.114502000000002,\n", " 'right': {'left': 97.340526499999996,\n", " 'right': 90.995536000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.537834},\n", " 'spInd': 0,\n", " 'spVal': 0.546601},\n", " 'spInd': 0,\n", " 'spVal': 0.548539},\n", " 'right': {'left': {'left': 129.76674299999999,\n", " 'right': 124.795495,\n", " 'spInd': 0,\n", " 'spVal': 0.531944},\n", " 'right': 116.17616200000001,\n", " 'spInd': 0,\n", " 'spVal': 0.51915},\n", " 'spInd': 0,\n", " 'spVal': 0.533511},\n", " 'right': {'left': 101.075609,\n", " 'right': {'left': 93.292828999999998,\n", " 'right': 96.403373000000002,\n", " 'spInd': 0,\n", " 'spVal': 0.508542},\n", " 'spInd': 0,\n", " 'spVal': 0.508548},\n", " 'spInd': 0,\n", " 'spVal': 0.513332},\n", " 'spInd': 0,\n", " 'spVal': 0.606417},\n", " 'spInd': 0,\n", " 'spVal': 0.613004},\n", " 'spInd': 0,\n", " 'spVal': 0.640515},\n", " 'spInd': 0,\n", " 'spVal': 0.729397},\n", " 'right': {'left': {'left': {'left': {'left': {'left': 8.5367700000000006,\n", " 'right': 27.729263,\n", " 'spInd': 0,\n", " 'spVal': 0.487381},\n", " 'right': 5.224234,\n", " 'spInd': 0,\n", " 'spVal': 0.483803},\n", " 'right': {'left': -9.7129250000000003,\n", " 'right': -23.777531,\n", " 'spInd': 0,\n", " 'spVal': 0.46568},\n", " 'spInd': 0,\n", " 'spVal': 0.467383},\n", " 'right': {'left': 30.051931,\n", " 'right': 17.171057000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.463241},\n", " 'spInd': 0,\n", " 'spVal': 0.465561},\n", " 'right': {'left': -34.044555000000003,\n", " 'right': {'left': {'left': {'left': {'left': {'left': -4.1911744999999998,\n", " 'right': {'left': {'left': {'left': {'left': 19.745224,\n", " 'right': 15.224266,\n", " 'spInd': 0,\n", " 'spVal': 0.428582},\n", " 'right': -21.594268,\n", " 'spInd': 0,\n", " 'spVal': 0.426711},\n", " 'right': 44.161493,\n", " 'spInd': 0,\n", " 'spVal': 0.418943},\n", " 'right': {'left': -26.419288999999999,\n", " 'right': 0.63593000000000011,\n", " 'spInd': 0,\n", " 'spVal': 0.403228},\n", " 'spInd': 0,\n", " 'spVal': 0.412516},\n", " 'spInd': 0,\n", " 'spVal': 0.437652},\n", " 'right': 23.197474,\n", " 'spInd': 0,\n", " 'spVal': 0.388789},\n", " 'right': {'left': {'left': {'left': -29.007783,\n", " 'right': {'left': {'left': 13.583555,\n", " 'right': 5.2411960000000004,\n", " 'spInd': 0,\n", " 'spVal': 0.377383},\n", " 'right': -8.2282969999999995,\n", " 'spInd': 0,\n", " 'spVal': 0.373501},\n", " 'spInd': 0,\n", " 'spVal': 0.378965},\n", " 'right': {'left': -32.124495000000003,\n", " 'right': {'left': -9.9938275000000001,\n", " 'right': -26.851234812500003,\n", " 'spInd': 0,\n", " 'spVal': 0.350725},\n", " 'spInd': 0,\n", " 'spVal': 0.35679},\n", " 'spInd': 0,\n", " 'spVal': 0.370042},\n", " 'right': {'left': 22.286959625000001,\n", " 'right': {'left': {'left': -20.397333499999998,\n", " 'right': -49.939515999999998,\n", " 'spInd': 0,\n", " 'spVal': 0.310956},\n", " 'right': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': {'left': 8.8147249999999993,\n", " 'right': {'left': -18.051317999999998,\n", " 'right': {'left': -1.7983769999999999,\n", " 'right': {'left': -14.988279,\n", " 'right': -14.391613,\n", " 'spInd': 0,\n", " 'spVal': 0.290749},\n", " 'spInd': 0,\n", " 'spVal': 0.295993},\n", " 'spInd': 0,\n", " 'spVal': 0.297107},\n", " 'spInd': 0,\n", " 'spVal': 0.300318},\n", " 'right': {'left': 35.623745999999997,\n", " 'right': {'left': -9.4575560000000003,\n", " 'right': {'left': 5.2805790000000004,\n", " 'right': 2.5579230000000002,\n", " 'spInd': 0,\n", " 'spVal': 0.264639},\n", " 'spInd': 0,\n", " 'spVal': 0.264926},\n", " 'spInd': 0,\n", " 'spVal': 0.273863},\n", " 'spInd': 0,\n", " 'spVal': 0.284794},\n", " 'right': {'left': {'left': -9.601409499999999,\n", " 'right': -30.812912000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.228751},\n", " 'right': -2.266273,\n", " 'spInd': 0,\n", " 'spVal': 0.228628},\n", " 'spInd': 0,\n", " 'spVal': 0.25807},\n", " 'right': 6.0992389999999999,\n", " 'spInd': 0,\n", " 'spVal': 0.228473},\n", " 'right': {'left': -16.427370249999999,\n", " 'right': -2.6781804999999999,\n", " 'spInd': 0,\n", " 'spVal': 0.202161},\n", " 'spInd': 0,\n", " 'spVal': 0.211633},\n", " 'right': 9.5773855000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.193282},\n", " 'right': {'left': {'left': {'left': -14.740059,\n", " 'right': -6.5125060000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.166431},\n", " 'right': -27.405211000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.164134},\n", " 'right': 0.225886,\n", " 'spInd': 0,\n", " 'spVal': 0.156273},\n", " 'spInd': 0,\n", " 'spVal': 0.166765},\n", " 'right': {'left': 7.5573490000000003,\n", " 'right': 7.3367839999999998,\n", " 'spInd': 0,\n", " 'spVal': 0.13988},\n", " 'spInd': 0,\n", " 'spVal': 0.156067},\n", " 'right': -29.087463,\n", " 'spInd': 0,\n", " 'spVal': 0.138619},\n", " 'right': 22.478290999999999,\n", " 'spInd': 0,\n", " 'spVal': 0.131833},\n", " 'spInd': 0,\n", " 'spVal': 0.309133},\n", " 'spInd': 0,\n", " 'spVal': 0.324274},\n", " 'spInd': 0,\n", " 'spVal': 0.335182},\n", " 'spInd': 0,\n", " 'spVal': 0.382037},\n", " 'right': -39.524461000000002,\n", " 'spInd': 0,\n", " 'spVal': 0.130626},\n", " 'right': {'left': 22.891674999999999,\n", " 'right': {'left': {'left': 6.1965159999999999,\n", " 'right': {'left': -16.106164,\n", " 'right': {'left': -1.2931950000000001,\n", " 'right': -10.137104000000001,\n", " 'spInd': 0,\n", " 'spVal': 0.085873},\n", " 'spInd': 0,\n", " 'spVal': 0.10796},\n", " 'spInd': 0,\n", " 'spVal': 0.108801},\n", " 'right': {'left': 37.820658999999999,\n", " 'right': {'left': -24.132225999999999,\n", " 'right': {'left': 15.824970500000001,\n", " 'right': {'left': -15.160836,\n", " 'right': {'left': {'left': {'left': 6.6955669999999996,\n", " 'right': -3.131497,\n", " 'spInd': 0,\n", " 'spVal': 0.055862},\n", " 'right': -13.731698,\n", " 'spInd': 0,\n", " 'spVal': 0.053764},\n", " 'right': 4.0916259999999998,\n", " 'spInd': 0,\n", " 'spVal': 0.044737},\n", " 'spInd': 0,\n", " 'spVal': 0.061219},\n", " 'spInd': 0,\n", " 'spVal': 0.068373},\n", " 'spInd': 0,\n", " 'spVal': 0.080061},\n", " 'spInd': 0,\n", " 'spVal': 0.084661},\n", " 'spInd': 0,\n", " 'spVal': 0.085111},\n", " 'spInd': 0,\n", " 'spVal': 0.124723},\n", " 'spInd': 0,\n", " 'spVal': 0.126833},\n", " 'spInd': 0,\n", " 'spVal': 0.455761},\n", " 'spInd': 0,\n", " 'spVal': 0.457563},\n", " 'spInd': 0,\n", " 'spVal': 0.499171}" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myTree = createTree(myMat2, ops=(0,1))\n", "myDateTest = loadDataSet('ex2test.txt')\n", "myMat2Test = mat(myDateTest)\n", "prune(myTree, myMat2Test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. 模型树" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", " # helper function used in two places\n", "def linearSolve(dataSet):\n", " \"\"\"\n", " Desc:\n", " 将数据集格式化成目标变量Y和自变量X,执行简单的线性回归,得到ws\n", " Args:\n", " dataSet -- 输入数据\n", " Returns:\n", " ws -- 执行线性回归的回归系数 \n", " X -- 格式化自变量X\n", " Y -- 格式化目标变量Y\n", " \"\"\"\n", " m, n = shape(dataSet)\n", " # 产生一个关于1的矩阵\n", " X = mat(ones((m, n)))\n", " Y = mat(ones((m, 1)))\n", " # X的0列为1,常数项,用于计算平衡误差\n", " X[:, 1: n] = dataSet[:, 0: n-1]\n", " Y = dataSet[:, -1]\n", "\n", " # 转置矩阵*矩阵\n", " xTx = X.T * X\n", " # 如果矩阵的逆不存在,会造成程序异常\n", " if linalg.det(xTx) == 0.0:\n", " raise NameError('This matrix is singular, cannot do inverse,\\ntry increasing the second value of ops')\n", " # 最小二乘法求最优解: w0*1+w1*x1=y\n", " ws = xTx.I * (X.T * Y)\n", " return ws, X, Y" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 得到模型的ws系数:f(x) = x0 + x1*featrue1+ x2*featrue2 ...\n", "# create linear model and return coeficients\n", "def modelLeaf(dataSet):\n", " \"\"\"\n", " Desc:\n", " 当数据不再需要切分的时候,生成叶节点的模型。\n", " Args:\n", " dataSet -- 输入数据集\n", " Returns:\n", " 调用 linearSolve 函数,返回得到的 回归系数ws\n", " \"\"\"\n", " ws, X, Y = linearSolve(dataSet)\n", " return ws\n" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 计算线性模型的误差值\n", "def modelErr(dataSet):\n", " \"\"\"\n", " Desc:\n", " 在给定数据集上计算误差。\n", " Args:\n", " dataSet -- 输入数据集\n", " Returns:\n", " 调用 linearSolve 函数,返回 yHat 和 Y 之间的平方误差。\n", " \"\"\"\n", " ws, X, Y = linearSolve(dataSet)\n", " yHat = X * ws\n", " # print corrcoef(yHat, Y, rowvar=0)\n", " return sum(power(Y - yHat, 2))" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'left': matrix([[ 1.69855694e-03],\n", " [ 1.19647739e+01]]), 'right': matrix([[ 3.46877936],\n", " [ 1.18521743]]), 'spInd': 0, 'spVal': 0.285477}" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myMat2 = mat(loadDataSet('exp2.txt'))\n", "createTree(myMat2, modelLeaf, modelErr, (1,10))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAD6CAYAAABXh3cLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFwtJREFUeJzt3V9sXGeZx/Hf42lMM0mp6NgSqKjjRdUKVIm9qJFY2iJR\n/kiEwoqrXdXJhuYiSi0ki5VAqnLti13tjSVI0ly0hPggbhcQqAsFiQW1C85FueCPWMCueoMcR6It\nLk3rPHvx+pDJ5MycPz5nZs6Z70eyMrHPxO8h4ee3z3ne9zV3FwCgnmbGPQAAQHGEOADUGCEOADVG\niANAjRHiAFBjhDgA1BghDgA1RogDQI0R4gBQY3dU/Q3m5uZ8YWGh6m8DAI1y5cqVq+4+n3Zd5SG+\nsLCgjY2Nqr8NADSKmW1luY5yCgDUGCEOADVGiANAjRHiAFBjmULczA6Z2Xf2X5uZXTKzF83s22ZW\n+cNRAECy1BA3s8OSrkj65P6nHpJ0h7t/WNI7JX2quuEBQA1EkbSwIM3MhF+jaGTfOnUW7e5vSPqg\nmf3f/qf+JGlt//X1qgYGALWwvCxduCDFp6RtbUmnT4fXS0uVf/vcNXF3/527/9zMPi9pVtJz/deY\n2Wkz2zCzje3t7TLGCQCTJ4puDfDY7q508uRIZuaFHmya2eckrUj6rLvv9X/d3S+6+6K7L87Ppy44\nAoB6Onv29gCP7e2Fr8Uz84qCPHeIm9m7JX1Z0mfc/bXyhwQANfHyy9mu290NgV+BIjPxk5LeI+k5\nM/upmZ0qeUwAUA/33JP92qyBn1PmEHf3+/d//Xd3v9/dH97/eKaSkQHApIoiaW5O2tnJ/p777qtk\nKPR4A0AeURRq3Lu7+d63ulrJcFixCQB5nD2bP8A7ncraDQlxAMgjb2273ZbW1tKvK4gQB4As4lWZ\ng1oKk7Ra0sWLlS76oSYOAGmK1MHb7coDXGImDgDphtXBO53wIYWZtyR1uyMJcImZOACkG1QHN5Ou\nXh3tWPowEweANIN6vCvq/c6DEAeANKurocbdq92urPc7D0IcAKRb9wSfmwsf8S6EUqhxd7uhhDLC\nmncaauIA0N990rucPt6F8OJFaXNzLMMbhpk4AKStwqxwF8KDIsQBIMsqzIp2ITwoQhwAsnSZTEAn\nShJCHACSuk96mU1EJ0oSQhxAs2U5iX5pKTy4jFde9jKTzpyZiE6UJIQ4gOaKu062tm6ed3niRAjm\n/kBfWgqrL9fXb20lvHxZOndubLeQxjzPjlwFLC4u+sbGRqXfAwASLSyE4B5kRJtUFWFmV9x9Me06\nZuIAmiuto2SCWwezIsQBNEtvDXwmQ8RNaOtgVqzYBNAc/Ssv9/bS35PnxPoJxEwcQHMUOf+y5ghx\nAM1RpDRy7Vr54xghQhxAveWtgfeb0JWYWVETB1BfRWrgvSZkT/CDYCYOoL4OUgPvdCa2RzwPZuIA\n6mtYDXxmRrpx4/bPt1rSpUu1D+8YM3EA9TWsnn3HHdKhQ7d+rt1uVIBLhDiAukjayOrYscHXX78u\nzc5O5JFqZaKcAmDy9T/A3NqSTp2S3n57+Pv+8hfp6acbF9y9mIkDmHxJDzCvX0+ueSe9t8EIcQCT\nb9hOhGlqvjdKGkIcwGSLolDTLqrmi3nSEOIAJtvZs+FAhyJmZ2u/mCcNIQ5gsvR3oaSVUgYtte90\npGeeafRDTYnuFACTJKkLxWz4TNy9+Ey9AZiJA5gcSV0o7sNr4g2veachxAFMjkGdJO7JJ9E3YAOr\ng8oU4mZ2yMy+s//6TjP7rpm9ZGaXzQ7y2BjAVOuvfw86ZafbTT6JvoErMPNKrYmb2WFJ/yvp7/c/\ndVzSK+7+mJl9V9InJf13dUME0EhRFFZdXr8efj/oAWbvbHtpaepDu1/qTNzd33D3D0p6Zf9Tj0r6\nwf7rH0n6WEVjA9BkKys3A3wQZtupitTEO5L+vP/6VUm3/fePmZ02sw0z29je3j7I+AA01c7O8K+b\nhRk4AT5UkRC/Kunu/dd37//+Fu5+0d0X3X1xfn7+IOMDMK3cG7/vSRmKhPjzkj61//pRST8ubzgA\npkZSt0m/hu97UoYiIR5JutfMfinpmkKoA8BgSXuBr62l74ky5T3gWWResenu9+//+qakxyobEYBm\nSVqFeeKE9Oij4fSdt95Kfl9cE8dQLLsHUK2VleRVmM8P+Y94M+nMGR5qZkCIA6hOFKV3oSS5fJkA\nz4hl9wCqU6S7pNslwHMgxAGUJ+82sv3YCyU3QhxAOaJIeuKJENzu2QOcvVAOhJo4gHKsrAzuNBmk\n25U2NysZzrRgJg6gHHkfYFI6KQUhDmD0KJ2UhhAHkF/SCsyjR7O91yyUUAjwUlATB5BP0grM06el\nvb1s72cpfakIcQD5JJ2D2f/7QaiDl45yCoDsoih/73eMOnglmIkDyCYuo+T15JPSuXPljweSmIkD\nyCqpjJKm0yHAK0aIA8gm7YCG/r3B2+2wZzgqRYgDyGZYV0m7HbaOZQn9yFETBzDcJz4xfO/vTifM\nuAnssWAmDuB28WIes8EB3u1K6+vS1asE+BgxEwdwUxSFjayKHOSAsSDEAQT9KzHTxCs1JWbiY0Q5\nBUBQpIVwd7fY6T0oDSEOIEhrISz7fSgFIQ4gKLoxFRtajRUhDky7uBNla+v2BTu9Op3QD96LDa3G\njhAHpln8MDPe1Mo9+bp49eXFiyzomTB0pwDTbNDDzE4nHPLw8suhXLK6ejOsCe2JQogD02rYtrI7\nO2ERDyYe5RRg2kSRNDcnHT8++BqzcB0mHiEOTIPeZfTHj6evyHSn/7smKKcATZd3JWaM/u9aYCYO\nNF2RlZgS/d81QYgDTVfkTEz6v2uDEAeaLIqGL+BJ0unQ/10jhDjQZGfPDl7AEzty5ObiHfYHrx0e\nbAJNlqWU8vrr1Y8DlWEmDjRZqzX8693uaMaByhDiQFPEveAzM+HXKJL29gZfz8PLRiDEgbrpD+vl\n5ZsrMLe2Qg08PnWn00n+M1otHl42RKGauJkdkfRNSXOSfubuXyl1VACS9S/c2dqSzp9PvnZ3Vzp8\nOMy4e/vE220CvEGKzsSXJL3o7g9JesDMPlDimAAMknfhzrVrbB/bcEW7U96U1DYzk3SnpOvlDQnA\nQHmXwt93XwhsQruxis7Evynp05J+Lek37v773i+a2Wkz2zCzje3t7YOOEUAsz1J4HlxOhaIh/pSk\nC+7+fkn3mNlHer/o7hfdfdHdF+fn5w88SAD7jh3Ldh2rLqdG0XLKXZL+uv/6TUlHyxkOgERRJK2s\npG8hK4UA50CHqVF0Jv41SU+a2QuSDkt6vrwhAbhF3JGSJcDjszAxNQqFuLtvuvtD7v6P7v7P7j5k\nRQGAVEkLdWJZO1Lo/Z5KLPYBxq33xPl4oc4TT4QFPGbZ9j8xky5dIsCnECEOjFvSTPutt7KVT2Jn\nzhDgU4oQB8btIMegdTph+9hz58obD2qFEAfGrcgxaOvrofTC3t9TjxAHxm11NXSVZNXtEtz4G0Ic\nGLelpZv7m6RhFSb6EOLAJFhakjY3hwc5m1chASEOTJKk0kq7HWrgm5sEOG5DiAPjkHSww8KCdOJE\n2AO802HrWGTCQcnAqKUd7LCzE2bfly8T3kjFTBwYpSiSTp5MX0a/uxsWAQEpCHFgVJaXQ7lk2OHF\nvQ6yCAhTg3IKMApRNPgszEGKLALC1GEmDlQpfoB5/Hi+99EPjowIcaBMvV0nc3NhN8IsuxBKoRtF\noiMFuVBOAcrS33WSZxfCbjfMvAlu5ESIA2XJenhDr6NHpddeq2Y8mAqUU4Cy5O0mmZ2VLlyoZiyY\nGoQ4UJY83STdrvTMM5RPcGCEOFCW1dWbDyeHabXYBwWlIcSBMrmnX5N1sQ+QASEO5JV0Mn3cmZJF\nln3DgYzoTgHySNq86vTpsPNgls6UQ4dYxINSMRMH8khqI9zdzdYT3ulIzz5LLRylYiYO5JG3jbDb\nDQ8xgYowEwfyGNRG2Okkn8hD6QQVI8SBLOKHmVtbt7cRttvS2trNw445kQcjRDkFSNP/MLO3jbB/\nzxNCGyPGTBxIM2hPFDM2rcLYEeLAMFE0eCtZd45Qw9gR4sAgURT2Ax+GI9QwZoQ4MMjKivTWW8Ov\n4Qg1jBkhDvSLonAqT9oCHloIMQHoTgFiURRm31lWX7ZatBBiIjATB+KZ9/Hj2QJ8dla6dIkAx0Rg\nJo7ptLwcZtJ5t4WdmeEwB0wUZuKYPsvL0vnz+QO83Za+8Q0CHBOFEEfz9e//XeRcy6NHqYFjIhUu\np5jZVyR9VtLrkv7J3a+XNiqgLEn7fxfR6RDgmEiFZuJm9j5JD7j7I5K+L+m9pY4KKMugJfN5sagH\nE6poOeXjkt5lZj+R9IikP5Y3JKBEZYUvi3owoYqG+LykbXf/qMIs/OHeL5rZaTPbMLON7e3tg44R\nyKe3Bj5T4J940lazLOrBhCoa4q9K+u3+6z9Iurf3i+5+0d0X3X1xfn7+IOMDsuvt997aChtUJXWg\nzM6Gsy6TtNvSmTPsC47aKPpg84qkf9t/fb9CkAPj0/8As1+rJd24Ecoi8az67NkQ9q1WCPv+vcGB\nGigU4u7+gpldNbNfSPq1u/+85HEB+aQ9wLxxI3z0IqzRAIX7xN39SXf/kLv/a5kDAgpJe4A5MxNm\n60DDsNgHzZDWPbK3F8otBDkahhBHM6yu3n7afL/dXU7iQeMQ4miGpSXp5MnwkHIYFu2gYQhxNEMU\nhe1h0za1YtEOGoYQR331Luo5eTJ9eT2LdtBA7CeOeooPMY7PwBw2Aze72R9OWyEahhBHPWU5xFgK\nC3g2NysfDjAulFNQT1mOUaN8gilAiKN52PMEU4RyCuqp00mejXc60tWrox8PMCbMxFEvcUdKUoDP\nzkprayMfEjBOhDjqI96psPeItXjv726XU+gxlSinoD6Sdip0pwMFU42ZOOpj0JJ5ltJjihHimGxZ\njlpjKT2mGCGOydIb2nNzYVXmsKPW6AXHlCPEMT69gb2wIC0vS6dO3QztnZ3kVZmtFr3gwD4ebGI8\n+s/E3NqSzp/P9t6ko9aAKcVMHOORdibmMNTAgb8hxDEa/aWT3l7vPKiBA7egnILqJZVO8mI7WSAR\nIY7qHaR0IrEfCjAE5RRU7yCLcQ4dYj8UYAhCHNUr+iCy05GefZbyCTAEIY7qHTuW7bre/u/19VBC\nIcCBoaiJo3rf+162606fls6dq3YsQMMwE0f1stbEs4Y9gL8hxFG9rDVxdiMEciPEUZ14gU/WvnBW\nYgK5URNH+aJIWlnJdiJ9jJWYQCHMxFGOeNZtJp04kS3A2Y0QODBm4ji4/mX17tnex26EwIExE8fB\nrawUW1ZPDRw4MEIcxfSWT/LUvmPUwIFSUE5BflEUjk1LOnVnkKNHpXe8Q7p2jd0IgRIR4shvZSV7\ngHe7BDZQIUIc+WUpn7B9LDAShWviZvYlM/thmYNBQ7TbbB8LjEihEDezrqQvlDuUKRNF0txceDBo\nFl5HUTXfp/dYtIN+j7T3t1r0fAMjVHQmvibpqTIHMlXiB4O9ZYmdHenUqXKDPO7f3toKvdtbW+H3\nRb9H/OcNc+MGAQ6MUO4QN7PHJb0k6VflD2dKnD2b/GDw+nXp5MnygjzpWLTd3fD5vKIojC2tH5ze\nb2CkiszEH5P0cUnfkvSgmX2x/wIzO21mG2a2sb29fdAxNs+w3fr29g42W87yffIeVBzPwPf2hl9n\nRu83MGK5Q9zdH3f3hyX9i6Qr7v7VhGsuuvuiuy/Oz8+XMc7miKJQnx5m2Gw5T4172Kw4Tx0+y0HH\nZtKZM5RSgBFjxeZB5QnVKAp177QZrZQ8i06qcZ84cTOQ5+ZuHcfqavjaIEl1+KT7Sdvnu9ORLl/m\nVB5gDMyzblZU0OLiom9sbFT6Pcamf+MnKbTXDerOmJvLvkS91QoPCe+5J/z+2rUQrFl+AOTV7Uqb\nm9LysnThwq0bWLXb0uHDw8dd8b8hYBqZ2RV3X0y7bjpm4mW32cUGPTg8eTLMgO+4I/waf888e4zs\n7YVw3NkJH+7VBLgUZvR33SWdP397IMf3N2hG3+1WMyYAmTQ/xJeXQ8lhWJtdlpDPU2aIwzb+Nf6e\nWQwrf1Tp9dcHf+3atVDv7h8bm1gB4+fulX48+OCDntv6unu3624Wfl1fL/beTsc9RPftH61WuHZ9\n3b3dvvVr7fat3zPpGjP3I0cG//lN+uh2D/73AiAXSRueIWMnL8STArP3o9MZHB5p7+3/aLcHB30c\nUsN+EEjuMzPjD9kqP8wIa2AMsob45D3YzHOwbv8OeXnem8XsbFiAM2rtdrFDFopotYbX2p98kq4T\nYAzq+2AzrZ2tV399O89708zMjCfAjxwJ3S3dbqhBdzqD+8pbrRCy7fatn48fqPZeM8jeXvhhlYQA\nBybe5IV43mXbu7thf+uFhVAAKMs4zn5staSnnw7/ZbG5GcZw9erg+7pxI4Rsb+h3u9LXvx6+5i69\n/Xa4ZlAXSbcrPfNM+GER63Sk9XUCHKiByQvx1dXbZ5ZpdnbKLaNkldZJ0moN/3qnc2v4XrqU3F8+\n6Adb/Pne0N/cTP4zkv53jbtLlpZu/rBwD69ZeQnUwuSF+NJSmFn2zgwnUbcb2u4G/cBpt0OpZ9jX\n19bSw1caHsBZxf+79v7QYMtYoPYmL8Rjb7wxuu81aEbd6SSXGdxD6PaWMqSbM+84INO+njVAywrg\nLDN2ALUyed0pUvYuk06n2EnrUghD9xCIx46FUkbW5fMAULH6dqdIw7tM4pno+nqo3Q4quxw5cnsJ\nIp5xd7thw6akGTWlBgA1MpkHJd93X/JMPN6oqdfaWjglp/eQhUOHQpeHFPY3efnl8GcOO3V9aYnQ\nBlA7kxniq6vJuwMmPciLg3dQWBPMABpsMkM8LZiTriesAUyhyQxxiWAGgAwm88EmACATQhwAaowQ\nB4AaI8QBoMYIcQCoscqX3ZvZtqQiWwzOSbpa8nDqYBrvm3ueDtxzPl13n0+7qPIQL8rMNrLsG9A0\n03jf3PN04J6rQTkFAGqMEAeAGpvkEL847gGMyTTeN/c8HbjnCkxsTRwAkG6SZ+IAgBRjDXEzu9PM\nvmtmL5nZZbPbz0nLck2dZLxnM7NLZvaimX3bzCZ3o7IM8vwdmtmXzOyHoxxfVbLet5l9xcz+x8y+\nb2azox5nmTL++z5iZv9lZj8zs/8YxzirYGaHzOw7Q75eSZaNeyZ+XNIr7v4Pkt4l6ZMFr6mTLPfz\nkKQ73P3Dkt4p6VMjHF8VMv0dmllX0hdGOK6qpd63mb1P0gPu/oik70t672iHWLosf9dLkl5094ck\nPWBmHxjlAKtgZoclXdHwfKoky8Yd4o9K+sH+6x9J+ljBa+oky/38SdLa/uvroxhUxbL+Ha5Jemok\nIxqNLPf9cUnvMrOfSHpE0h9HNLaqZLnnNyW192eid6oB/8bd/Q13/6CkV4ZcVkmWjTvEO5L+vP/6\nVUn3FLymTlLvx91/5+4/N7PPS5qV9NwIx1eF1Hs2s8clvSTpVyMcV9Wy/Nudl7Tt7h9VmIU/PKKx\nVSXLPX9T0qcl/VrSb9z99yMa27hVkmXjDvGrku7ef323kpenZrmmTjLdj5l9TtKKpM+6+96IxlaV\nLPf8mMKs9FuSHjSzL45obFXKct+vSvrt/us/SLp3BOOqUpZ7fkrSBXd/v6R7zOwjoxrcmFWSZeMO\n8ed1s977qKQfF7ymTlLvx8zeLenLkj7j7q+NcGxVSb1nd3/c3R+W9C+Srrj7V0c4vqpk+bd7RdKH\n9l/frxDkdZblnu+S9Nf9129KOjqCcU2CSrJs3CEeSbrXzH4p6Zqk35vZf6Zc8/yIx1i2LPd8UtJ7\nJD1nZj81s1OjHmTJstxzE6Xet7u/IOmqmf1C0m/d/edjGGeZsvxdf03Sk2b2gqTDqv//p29jZn83\nqixjsQ8A1Ni4Z+IAgAMgxAGgxghxAKgxQhwAaowQB4AaI8QBoMYIcQCosf8HSnqTRAzVHjoAAAAA\nSUVORK5CYII=\n", "text/plain": [ "<matplotlib.figure.Figure at 0x109adbef0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(myMat2[:,0],myMat2[:,1],'ro') \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 案例 树回归和标准回归的比较" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 回归树测试案例\n", "# 为了和 modelTreeEval() 保持一致,保留两个输入参数\n", "def regTreeEval(model, inDat):\n", " \"\"\"\n", " Desc:\n", " 对 回归树 进行预测\n", " Args:\n", " model -- 指定模型,可选值为 回归树模型 或者 模型树模型,这里为回归树\n", " inDat -- 输入的测试数据\n", " Returns:\n", " float(model) -- 将输入的模型数据转换为 浮点数 返回\n", " \"\"\"\n", " return float(model)\n" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 模型树测试案例\n", "# 对输入数据进行格式化处理,在原数据矩阵上增加第0列,元素的值都是1,\n", "# 也就是增加偏移值,和我们之前的简单线性回归是一个套路,增加一个偏移量\n", "def modelTreeEval(model, inDat):\n", " \"\"\"\n", " Desc:\n", " 对 模型树 进行预测\n", " Args:\n", " model -- 输入模型,可选值为 回归树模型 或者 模型树模型,这里为模型树模型,实则为 回归系数\n", " inDat -- 输入的测试数据\n", " Returns:\n", " float(X * model) -- 将测试数据乘以 回归系数 得到一个预测值 ,转化为 浮点数 返回\n", " \"\"\"\n", " n = shape(inDat)[1]\n", " X = mat(ones((1, n+1)))\n", " X[:, 1: n+1] = inDat\n", " # print X, model\n", " return float(X * model)\n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 计算预测的结果\n", "# 在给定树结构的情况下,对于单个数据点,该函数会给出一个预测值。\n", "# modelEval是对叶节点进行预测的函数引用,指定树的类型,以便在叶节点上调用合适的模型。\n", "# 此函数自顶向下遍历整棵树,直到命中叶节点为止,一旦到达叶节点,它就会在输入数据上\n", "# 调用modelEval()函数,该函数的默认值为regTreeEval()\n", "def treeForeCast(tree, inData, modelEval=regTreeEval):\n", " \"\"\"\n", " Desc:\n", " 对特定模型的树进行预测,可以是 回归树 也可以是 模型树\n", " Args:\n", " tree -- 已经训练好的树的模型\n", " inData -- 输入的测试数据,只有一行\n", " modelEval -- 预测的树的模型类型,可选值为 regTreeEval(回归树) 或 modelTreeEval(模型树),默认为回归树\n", " Returns:\n", " 返回预测值\n", " \"\"\"\n", " if not isTree(tree):\n", " return modelEval(tree, inData)\n", " # 书中写的是inData[tree['spInd']],只适合inData只有一列的情况,否则会产生异常\n", " if inData[0, tree['spInd']] <= tree['spVal']:\n", " # 可以把if-else去掉,只留if里面的分支\n", " if isTree(tree['left']):\n", " return treeForeCast(tree['left'], inData, modelEval)\n", " else:\n", " return modelEval(tree['left'], inData)\n", " else:\n", " # 同上,可以把if-else去掉,只留if里面的分支\n", " if isTree(tree['right']):\n", " return treeForeCast(tree['right'], inData, modelEval)\n", " else:\n", " return modelEval(tree['right'], inData)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 预测结果\n", "def createForeCast(tree, testData, modelEval=regTreeEval):\n", " \"\"\"\n", " Desc:\n", " 调用 treeForeCast ,对特定模型的树进行预测,可以是 回归树 也可以是 模型树\n", " Args:\n", " tree -- 已经训练好的树的模型\n", " testData -- 输入的测试数据\n", " modelEval -- 预测的树的模型类型,可选值为 regTreeEval(回归树) 或 modelTreeEval(模型树),默认为回归树\n", " Returns:\n", " 返回预测值矩阵\n", " \"\"\"\n", " m = len(testData)\n", " yHat = mat(zeros((m, 1)))\n", " # print yHat\n", " for i in range(m):\n", " yHat[i, 0] = treeForeCast(tree, mat(testData[i]), modelEval)\n", " # print \"yHat==>\", yHat[i, 0]\n", " return yHat" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# # 回归树 VS 模型树 VS 线性回归\n", "trainMat = mat(loadDataSet('bikeSpeedVsIq_train.txt'))\n", "testMat = mat(loadDataSet('bikeSpeedVsIq_test.txt'))" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'spInd': 0, 'spVal': 10.0, 'left': {'spInd': 0, 'spVal': 17.0, 'left': {'spInd': 0, 'spVal': 20.0, 'left': 168.34161286956524, 'right': 157.04840788461539}, 'right': {'spInd': 0, 'spVal': 14.0, 'left': 141.06067981481482, 'right': 122.90893026923078}}, 'right': {'spInd': 0, 'spVal': 7.0, 'left': 94.706657812499998, 'right': {'spInd': 0, 'spVal': 5.0, 'left': 69.02117757692308, 'right': 50.946836650000002}}}\n", "--------------\n", "\n", "regTree: -0.877546145437\n" ] } ], "source": [ "# # 回归树\n", "myTree1 = createTree(trainMat, ops=(1, 20))\n", "print (myTree1)\n", "yHat1 = createForeCast(myTree1, testMat[:, 0])\n", "print (\"--------------\\n\")\n", "# print yHat1\n", "# print \"ssss==>\", testMat[:, 1]\n", "# corrcoef 返回皮尔森乘积矩相关系数\n", "print (\"regTree:\", corrcoef(yHat1, testMat[:, 1],rowvar=0)[0, 1])" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'spInd': 0, 'spVal': 4.0, 'left': {'spInd': 0, 'spVal': 12.0, 'left': {'spInd': 0, 'spVal': 16.0, 'left': {'spInd': 0, 'spVal': 20.0, 'left': matrix([[ 47.58621512],\n", " [ 5.51066299]]), 'right': matrix([[ 37.54851927],\n", " [ 6.23298637]])}, 'right': matrix([[ 43.41251481],\n", " [ 6.37966738]])}, 'right': {'spInd': 0, 'spVal': 9.0, 'left': matrix([[ -2.87684083],\n", " [ 10.20804482]]), 'right': {'spInd': 0, 'spVal': 6.0, 'left': matrix([[-11.84548851],\n", " [ 12.12382261]]), 'right': matrix([[-17.21714265],\n", " [ 13.72153115]])}}}, 'right': matrix([[ 68.87014372],\n", " [-11.78556471]])}\n", "--------------\n", "\n", "modelTree: -0.962132224496\n" ] } ], "source": [ "# 模型树\n", "myTree2 = createTree(trainMat, modelLeaf, modelErr, ops=(1, 20))\n", "yHat2 = createForeCast(myTree2, testMat[:, 0], modelTreeEval)\n", "print (myTree2)\n", "print (\"--------------\\n\")\n", "print (\"modelTree:\", corrcoef(yHat2, testMat[:, 1],rowvar=0)[0, 1])" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 37.58916794]\n", " [ 6.18978355]]\n", "--------------\n", "\n", "lr: 0.943468423567\n" ] } ], "source": [ "# 线性回归\n", "ws, X, Y = linearSolve(trainMat)\n", "print (ws)\n", "m = len(testMat[:, 0])\n", "yHat3 = mat(zeros((m, 1)))\n", "for i in range(shape(testMat)[0]):\n", " yHat3[i] = testMat[i, 0]*ws[1, 0] + ws[0, 0]\n", "print (\"--------------\\n\")\n", "print (\"lr:\", corrcoef(yHat3, testMat[:, 1],rowvar=0)[0, 1])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }