{ "cells": [ { "cell_type": "markdown", "id": "dcc9ef1a", "metadata": {}, "source": [ "**实验六:决策树**" ] }, { "cell_type": "markdown", "id": "ba67a4f6", "metadata": {}, "source": [ "**第一部分:函数介绍**" ] }, { "cell_type": "code", "execution_count": 1, "id": "6ac93628", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math" ] }, { "cell_type": "markdown", "id": "9cfffb4e", "metadata": {}, "source": [ "1)Counter类的使用" ] }, { "cell_type": "code", "execution_count": 2, "id": "85c3787c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({np.int64(133): 2, np.int64(132): 1})\n", "0\n", "2\n", "dict_values([2, 1])\n", "[np.int64(133), np.int64(132)]\n" ] } ], "source": [ "x=np.array([[0,133,1],[0,132,0],[0,133,0]])\n", "#使用Counter类对数组第2列进行遍历\n", "counter = Counter(x[:,1])\n", "#第2列中有1个132和2个133,输出该counter对象可以统计这列的数值情况,便于之后的统计\n", "print(counter)\n", "#因为第2列中没有为0的值,所以返回0\n", "print(counter[0])\n", "#因为第2列中有2个133,所以返回2\n", "print(counter[133])\n", "#一般的字典操作方法都能在该类中使用,例如可以通过values函数返回该列的非重复值的个数,方便对某列的非重复值的个数进行查看\n", "print(counter.values())\n", "#可以输出所有非重复值\n", "print(list(counter))" ] }, { "cell_type": "markdown", "id": "dfc27d84", "metadata": {}, "source": [ "2)使用numpy中的unique实现非重复值的提取" ] }, { "cell_type": "code", "execution_count": 3, "id": "1a954859", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 1 132 133] [132 133]\n" ] } ], "source": [ "x=np.array([[0,133,1],[0,132,0],[0,133,0]])\n", "a=np.unique(x[:])\n", "a1=np.unique(x[:,1])\n", "print(a,a1)" ] }, { "cell_type": "markdown", "id": "95b28b2f", "metadata": {}, "source": [ "**第二部分:实验任务 Part I--编写函数**" ] }, { "cell_type": "markdown", "id": "7cbe46fb", "metadata": {}, "source": [ "该数据集(train_titanic.csv)为分类数据集,为泰坦尼克号的部分乘客信息以及最后是否生还。包括了四个属性值以及一个标记属性(即为Survived类型,代表是否生还),属性信息分别为Sex(性别),sibsp(堂兄弟妹个数),Parch(父母与小孩的个数),Pclass(乘客等级)。 \n", "其中该数据集无缺失值和异常值,且所有连续变量已自动转换为离散变量,标记(Survived类型)也自动转变为离散变量0和1,所以你无需进行数据预处理,可以直接使用该数据集。" ] }, { "cell_type": "code", "execution_count": 4, "id": "1dc6ccdb", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math" ] }, { "cell_type": "markdown", "id": "9570e618", "metadata": {}, "source": [ "1) 使用pandas库将训练数据集'train_titanic.csv'载入到Dataframe对象中" ] }, { "cell_type": "code", "execution_count": 5, "id": "a1b1263e", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#主要使用函数 pd.read_csv 和 np.array\n", "dataTrain = pd.read_csv(\"train_titanic.csv\")\n", "dfTrain = pd.DataFrame(dataTrain)\n", "# 特征和标签分开,保持标签为列向量形状\n", "X = dfTrain.iloc[:, :-1].to_numpy()\n", "y = dfTrain.iloc[:, -1:].to_numpy()\n" ] }, { "cell_type": "markdown", "id": "09f3bc41", "metadata": {}, "source": [ "2) 编写函数,给定任何标记数组计算其信息熵 \n", " 输入:标记数组 \n", " 输出:该数组对应的信息熵 \n", " 计算信息熵公式:\n", "某数组包含K个不同的取值,样本为第k(k=1,2,...,K)个值的数量所占比例为p_k,则其信息熵为$$Ent=-\\sum_{k=1}^K p_k log_2 p_k$$\n", " \n", " \n", "例: \n", " 输入:[[0],[1]] \n", " 输出:(-$\\frac{1}{2}$ $log_2$ $\\frac{1}{2}$)+(-$\\frac{1}{2}$ $log_2$ $\\frac{1}{2}$)=1" ] }, { "cell_type": "code", "execution_count": 6, "id": "eaa569aa", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#主要使用函数 Counter 和 math.log2\n", "\n", "def entropy(label):\n", " labels = np.asarray(label).reshape(-1)\n", " total = labels.shape[0]\n", " counter = Counter(labels)\n", " probs = np.array(list(counter.values())) / total\n", " ent = -np.sum(probs * np.log2(probs))\n", " return ent\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "b0253a63", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.array([[1],[0]])\n", "x.shape[1]" ] }, { "cell_type": "markdown", "id": "3d77ab84", "metadata": {}, "source": [ "3) 编写函数,函数功能为将所给的数据集按照指定维度dimension进行划分为若干个不同的数据集 \n", " 【输入】:属性集合,标记集合,维度索引 \n", " 【输出】:划分后所得到的子树属性集合,子树标记集合\n", "\n", "例子: \n", " 【输入】:feature(属性值集合):[[0,0,0],[0,0,1],[1,0,2]] \n", " label(标记集合):[[0],[1],[2]] \n", " 划分维度:0 \n", " \n", "【输出】:[[0,0,0],[0,0,1]]和[[1,0,2]] \n", " [[0],[1]]和[[2]] \n", " tips:即将feature按其第0维度进行划分,由于feature的0维有0和1两个不同的数值,所以feature划分为[[0,0,0],[0,0,1]]和[[1,0,2]],label划分为[[0],[1]]和[[2]]" ] }, { "cell_type": "code", "execution_count": 8, "id": "8f1460ce", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#主要使用函数 np.unique 和 append\n", "\n", "def split(feature, label, dimension):\n", " fea1 = np.unique(feature[:, dimension])\n", " split_feature = []\n", " split_label = []\n", " for fea in fea1:\n", " Index = feature[:, dimension] == fea\n", " split_feature.append(feature[Index])\n", " split_label.append(label[Index])\n", " return split_feature,split_label" ] }, { "cell_type": "markdown", "id": "47b0230f", "metadata": {}, "source": [ "4) 编写函数,函数功能为进行【一次】决策树的结点划分,遍历找出该属性集合中信息增益(使用ID3算法中的公式计算)【最大】的属性 \n", " 输入:属性集合,标记集合 \n", " 输出:该次划分的最佳信息增益,最佳划分维度 \n", " 计算信息增益公式: \n", " 某数据集D有若干属性值以及对应的标记值,其总样本大小为|D|,这里取其中一个属性类型feature,该特征包含V个不同的取值,属性值为第v(v=1,2,...,V)个值的数量为|$D^v$|$(\\sum_{v=1}^V |D^v|=|D|)$,则该属性值对应的信息增益为为$$Gain(D,feature)=Ent(D)-\\sum_{v=1}^V \\frac{|D^v|}{|D|} Ent(D^v)$$\n", "所以该函数中你需要计算出每个属性值的信息增益并输出,然后返回最大的信息增益并返回该特征的维数以及最大的信息增益值" ] }, { "cell_type": "code", "execution_count": 9, "id": "242439cb", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#调用前面编写的函数 entropy 和 split 求出每个属性的信息增益\n", "\n", "def one_split_ID3(feature, label):\n", " entro = entropy(label)\n", " gain = []\n", " for i in range(feature.shape[1]):\n", " fea, lab = split(feature, label, i)\n", " ent = 0\n", " for I,k in enumerate(lab):\n", " ent += fea[I].shape[0]/feature.shape[0] * entropy(k)\n", " ent = entro - ent\n", " gain.append(ent)\n", " best_entropy = max(gain)\n", " best_dimension = gain.index(best_entropy)\n", " return best_entropy, best_dimension\n" ] }, { "cell_type": "markdown", "id": "75f8593e", "metadata": {}, "source": [ "5) 编写函数,函数功能为进行【一次】决策树的结点划分,遍历找出该属性集合中信息增益率(使用C4.5算法中的公式计算)【最大】的属性 \n", " 输入:属性集合,标记集合 \n", " 输出:该次划分的信息增益率,最佳维度 \n", " 计算信息增益率公式: \n", " 某数据集D有若干属性值以及对应的标记值,其总样本大小为|D|,这里取其中一个属性类型feature,该属性包含V个不同的取值,属性值为第v(v=1,2,...,V)个值的数量为|$D^v$|$(\\sum_{v=1}^V|D^v|=|D|)$,该属性本身的信息熵为Ent(feature),则该属性值对应的信息增益率为$$Gain\\_ratio(D,feature)=\\frac{Gain(D,feature)}{Ent(feature)}$$\n", "所以该函数中你需要输出每个特征的信息增益率,之后返回最大的信息增益率对应的属性维数以及最大的信息增益率" ] }, { "cell_type": "code", "execution_count": 10, "id": "91d51379", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#跟 one_split_ID3 类似,调用前面编写的函数 entropy 和 split,唯一区别是求每个属性的信息增益率\n", "\n", "def one_split_C4_5(feature, label):\n", " base = entropy(label)\n", " best_ratio = float('-inf')\n", " best_dimension = None\n", " total = feature.shape[0]\n", "\n", " for dim in range(feature.shape[1]):\n", " fea, lab = split(feature=feature, label=label, dimension=dim)\n", " cond_ent = 0.0\n", " i = 0.0\n", " for sf, sl in zip(fea, lab):\n", " p = sf.shape[0] / total\n", " cond_ent += p * entropy(sl)\n", " if p > 0:\n", " i -= p * np.log2(p)\n", " gain = base - cond_ent\n", " ratio = gain / i if i > 0 else 0\n", " if ratio > best_ratio:\n", " best_ratio = ratio\n", " best_dimension = dim\n", "\n", " return best_ratio, best_dimension\n" ] }, { "cell_type": "markdown", "id": "ee40f45a", "metadata": {}, "source": [ "6) 编写函数,进行【一次】决策树的结点划分,遍历找出该属性集合中基尼系数(使用CART算法中的公式计算)最小的属性以及最佳的划分值 \n", " 输入:属性集合,标记集合 \n", " 输出:该次划分的最佳的基尼系数,最佳维度,最佳划分值 \n", " 计算基尼系数公式: \n", " 某数据集D有若干属性值以及对应的标记值,其总样本大小为|D|,该集合中样本标记类别总共有K类,第k类样本所占比例为$p_k$(k=1,2,..,K)则该数据集对应的基尼系数为$$Gini(D)=1-\\sum_{k=1}^K{p_k}^2$$ \n", " 而取其中一个属性类型feature,选定该类型的一个值value,根据feature这个属性是否为value将数据集分为两个子集$D_1$和$D_2$,则该属性feature对应的基尼系数为$$Gini\\_index(D,feature)=\\frac{|D_1|}{|D|}Gini(D_1)+\\frac{|D_2|}{|D|}Gini(D_2)$$\n", "所以该函数中你需要首先找出每列中的非重复值,之后根据该列的非重复值进行分类,计算出该列中以该值进行分类的基尼系数,计算出每个属性的每个非重复值的基尼系数,之后找到最小的基尼系数对应的属性维数以及对应的分类值" ] }, { "cell_type": "code", "execution_count": 11, "id": "d073721e", "metadata": {}, "outputs": [], "source": [ "#Your code here\n", "#主要使用函数 Counter 和 np.unique\n", "\n", "def one_split_CART(feature, label):\n", " # 计算基尼指数的辅助函数,兼容一维或列向量标签\n", " def gini(sub_label):\n", " labels = np.asarray(sub_label).reshape(-1)\n", " total = labels.shape[0]\n", " counter = Counter(labels)\n", " return 1 - sum((cnt / total) ** 2 for cnt in counter.values())\n", "\n", " best_gini = float('inf')\n", " best_dimension = None\n", " best_value = None\n", " total = feature.shape[0]\n", "\n", " for dim in range(feature.shape[1]):\n", " for val in np.unique(feature[:, dim]):\n", " left_idx = feature[:, dim] <= val\n", " right_idx = ~left_idx\n", " if left_idx.sum() == 0 or right_idx.sum() == 0:\n", " continue\n", " g_left = gini(label[left_idx])\n", " g_right = gini(label[right_idx])\n", " g_split = (left_idx.sum() / total) * g_left + (right_idx.sum() / total) * g_right\n", "\n", " if g_split < best_gini:\n", " best_gini = g_split\n", " best_dimension = dim\n", " best_value = val\n", "\n", " return best_gini, best_dimension, best_value\n" ] }, { "cell_type": "markdown", "id": "b4bdc9f0", "metadata": {}, "source": [ "7) 应用之前你在第4、5、6个部分编写的三个函数,在训练数据集'train_titanic.csv'上依次使用这些函数进行【一次】结点划分,并输出对应的最佳属性维数以及相应的信息增益值/信息增益率/(基尼系数和分类值)" ] }, { "cell_type": "code", "execution_count": 12, "id": "97f82f6d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(np.float64(0.10750711887455178),\n", " np.float64(0.11339165967945304),\n", " np.float64(0.2964915724641573),\n", " 0,\n", " 0,\n", " 0)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Your code here\n", "#练习调用编写的三个函数\n", "import numpy as np\n", "import pandas as pd\n", "file = pd.read_csv(\"train_titanic.csv\")\n", "df = pd.DataFrame(file)\n", "feature = np.array(df)[:, :-1]\n", "label = np.array(df)[:, -1]\n", "e1, d1 = one_split_ID3(feature, label)\n", "e2, d2 = one_split_C4_5(feature, label)\n", "e3, d3, v3 = one_split_CART(feature, label)\n", "e1, e2, e3, d1, d2, d3" ] }, { "cell_type": "markdown", "id": "26817408", "metadata": {}, "source": [ "**第三部分:实验任务 Part II--构建决策树**" ] }, { "cell_type": "markdown", "id": "dde3ec3f", "metadata": {}, "source": [ "用【ID3】算法实现一棵完整的决策树。完成DTree类中的TreeGenerate、train函数以完成决策树的构建。并完成DTree类中的predict函数来用构建好的决策树来对测试数据集进行预测并输出预测准确率。" ] }, { "cell_type": "code", "execution_count": 13, "id": "f21d651c", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math" ] }, { "cell_type": "code", "execution_count": 14, "id": "03078af6", "metadata": {}, "outputs": [], "source": [ "# 树结点类\n", "class Node:\n", " def __init__(self, isLeaf=True, label=-1, index=-1):\n", " self.isLeaf = isLeaf # isLeaf表示该结点是否是叶结点\n", " self.label = label # label表示该叶结点的label(当结点为叶结点时有用)\n", " self.index = index # index表示该分支结点的划分属性的序号(当结点为分支结点时有用)\n", " self.children = {} # children表示该结点的所有孩子结点,dict类型,方便进行决策树的搜索\n", " \n", " def addNode(self, val, node):\n", " self.children[val] = node #为当前结点增加一个划分属性的值为val的孩子结点" ] }, { "cell_type": "code", "execution_count": 15, "id": "c40bab6e", "metadata": {}, "outputs": [], "source": [ "# 决策树类\n", "class DTree:\n", " def __init__(self):\n", " self.tree_root = None #决策树的根结点\n", " self.possible_value = {} # 用于存储每个属性可能的取值\n", " \n", " \n", " '''\n", " TreeGenerate函数用于递归构建决策树,伪代码参照课件中的“Algorithm 1 决策树学习基本算法”\n", " '''\n", " def TreeGenerate(self, D, A):\n", " \n", " # 生成结点 node\n", " node = Node()\n", " \n", " \n", " \n", " # if D中样本全属于同一类别C then\n", " # 将node标记为C类叶结点并返回\n", " # end if\n", " label = D[:, -1]\n", " if len(np.unique(label)) == 1:\n", " node.isLeaf = True\n", " node.label = label[0]\n", " return node\n", " \n", " \n", " # if A = Ø OR D中样本在A上取值相同 then\n", " # 将node标记叶结点,其类别标记为D中样本数最多的类并返回\n", " # end if\n", " majority = Counter(label).most_common(1)[0][0]\n", " if len(A) == 0:\n", " node.isLeaf = True\n", " node.label = majority\n", " return node\n", " is_same = True\n", " for every in A:\n", " if len(np.unique(D[:, every])) > 1:\n", " is_same = False\n", " break\n", " if is_same:\n", " node.isLeaf = True\n", " node.label = majority\n", " return node\n", " \n", " # 从A中选择最优划分属性a_star\n", " # (选择信息增益最大的属性,用到上面实现的one_split_ID3函数)\n", " A_list = list(A)\n", " feature = D[:, A_list]\n", " _, best_dimension = one_split_ID3(feature, label)\n", " a_star = A_list[best_dimension]\n", " node.isLeaf = False\n", " node.index = a_star\n", " node.label = majority\n", " \n", " \n", " \n", " # for a_star 的每一个值a_star_v do\n", " # 为node 生成每一个分支;令D_v表示D中在a_star上取值为a_star_v的样本子集\n", " # if D_v 为空 then\n", " # 将分支结点标记为叶结点,其类别标记为D中样本最多的类\n", " # else\n", " # 以TreeGenerate(D_v,A-{a_star}) 为分支结点\n", " # end if\n", " # end for\n", " for a_star_v in self.possible_value[a_star]:\n", " D_v = D[D[:, a_star] == a_star_v]\n", " if D_v.shape[0] == 0:\n", " node.addNode(a_star_v, Node(isLeaf=True, label=majority))\n", " else:\n", " node.addNode(a_star_v, self.TreeGenerate(D_v, A - {a_star}))\n", " \n", " return node\n", " \n", " \n", " \n", " \n", " '''\n", " train函数可以做一些数据预处理(比如Dataframe到numpy矩阵的转换,提取属性集等),并调用TreeGenerate函数来递归地生成决策树\n", " '''\n", " def train(self, D):\n", "# D = np.array(D) # 将Dataframe对象转换为numpy矩阵(也可以不转,自行决定做法)\n", "# A = set(range(D.shape[1] - 1)) # 属性集A\n", " \n", "# #记下每个属性可能的取值\n", "# for every in A:\n", "# self.possible_value[every] = np.unique(D[:, every])\n", " \n", "# self.tree_root = self.TreeGenerate(D, A) # 递归地生成决策树,并将决策树的根结点赋值给self.tree_root\n", " \n", " D = np.array(D)\n", " A = set(range(D.shape[1] - 1))\n", " for every in A:\n", " self.possible_value[every] = np.unique(D[:, every])\n", " self.tree_root = self.TreeGenerate(D, A)\n", " \n", " \n", " \n", " \n", " \n", " \n", " '''\n", " predict函数对测试集D进行预测, 并输出预测准确率(预测正确的个数 / 总数据数量)\n", " '''\n", " def predict(self, D):\n", "# D = np.array(D) # 将Dataframe对象转换为numpy矩阵(也可以不转,自行决定做法)\n", " \n", "# #对于D中的每一行数据d,从当前结点x=self.tree_root开始,当当前结点x为分支结点时,\n", "# #则搜索x的划分属性为该行数据相应的属性值的孩子结点(即x=x.children[d[x.index]]),不断重复,\n", "# #直至搜索到叶结点,该叶结点的label就是数据d的预测label\n", " \n", " D = np.array(D)\n", " right = 0\n", " for d in D:\n", " x = self.tree_root\n", " while not x.isLeaf:\n", " if d[x.index] in x.children:\n", " x = x.children[d[x.index]]\n", " else:\n", " break\n", " if x.label == d[-1]:\n", " right += 1\n", " return right / D.shape[0]\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "7e1f721b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8415841584158416" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_frame = pd.read_csv('train_titanic.csv')\n", "dt = DTree()\n", "\n", "# 构建决策树\n", "dt.train(train_frame)\n", "\n", "# 利用构建好的决策树对测试数据集进行预测,输出预测准确率(预测正确的个数 / 总数据数量)\n", "test_frame = pd.read_csv('test_titanic.csv')\n", "dt.predict(test_frame)" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }