{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "随机森林简介\n", "==========" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 基本原理\n", "\n", "随机森林的概念比较简单,其实就是并行训练多颗决策树,在预测时,分类问题用少数服从多数的方式投票,回归问题用平均值。\n", "\n", "很多颗树就组成森林,而前缀随机,是说每颗树的样本是从原始数据中有放回的随机抽样,而且在树的生长中,在每个节点并不会遍历所有特征,而只会随机抽一定量的特征。所以说,随机有两处,一处是在样本集的抽取,一处是树分割的特征遍历。\n", "\n", "为什么要用随机呢?\n", "\n", "这是为了削弱树与树之间的相关性,如果每颗树都相似,那么随机森林的意义便不复存在。只有每颗树的着眼点不同,是相异的,结果才能更好地反映出统筹兼顾的特点。\n", "\n", "所以随机森林树越多,结果越趋于平均,越不会过拟合。而GBDT随着树的颗数增加,就会越趋向于过拟合。\n", "\n", "另外,因为随机森林用的是有放回随机抽样,对于$N$个样本的训练集,则样本不会被抽取到的概率是:\n", "\\begin{equation}\n", " P = (1 - \\frac{1}{N})^N = 1 + \\frac{1}{-N})^{-N \\cdot -1} \\to e^{-1} \\approx \\frac{1}{3}\n", "\\end{equation}\n", "\n", "也就是说,对于每颗树,都有将近1/3的训练样本没有使用,这些样本就可以当作测试集。即,在训练的同时跑测试,通过测试指标获取到当前训练的信息,这个称为Out-of-bag Estimator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 工程实现\n", "\n", "因为是单颗树的扩增,工程实现没有太多讲的。sklearn比较正统,独立实现了决策树,再此基础上实现了随机森林,而spark的决策树就是1颗树的随机森林。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }