{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "# In-Depth: Decision Trees and Random Forests\n", "\n", "\n", "\n", "![image.png](images/author.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "\n", "\n", "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", "\n", "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Previously\n", "\n", "- simple generative classifier (naive Bayes; see **In Depth: Naive Bayes Classification**) \n", "- powerful discriminative classifier (support vector machines; see **In-Depth: Support Vector Machines**).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "\n", "## *Random Forests*\n", "- Another powerful & non-parametric algorithm \n", "- Random forests are an example of an **ensemble method**, \n", " - meaning that it relies on aggregating the results of an ensemble of simpler estimators.\n", "\n", "The sum can be greater than the parts: \n", "- a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting!\n", "\n", "We will see examples of this in the following sections." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Motivating Random Forests: Decision Trees" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-05-21T08:48:09.851888Z", "start_time": "2021-05-21T08:48:08.134416Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Random forests are an example of an *ensemble learner* built on decision trees.\n", "- For this reason we'll start by discussing decision trees.\n", "\n", "Decision trees are extremely intuitive ways to classify or label objects: \n", "- you simply ask a series of questions designed to zero-in on the classification.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "For example, if you wanted to build a decision tree to classify an animal you come across while on a hike, you might construct the one shown here:\n", "\n", "![](./img/figures/05.08-decision-tree.png)\n", "\n", "[figure source in Appendix](06.00-Figure-Code.ipynb#Decision-Tree-Example)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/tree2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/tree3.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The binary splitting makes this extremely efficient: in a well-constructed tree, \n", "- each question will cut the number of options by approximately half, \n", "- very quickly narrowing the options even among a large number of classes.\n", "\n", "The trick comes in deciding which questions to ask at each step.\n", "\n", "Using axis-aligned splits in the data: \n", "- each node in the tree splits the data into two groups using a cutoff value within one of the features.\n", "\n", "Let's now look at an example of this." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 案例分析:今天是否打球⛹?\n", "- 一组14天天气数据(指标包括outlook,temperature,humidity,windy),并已知这些天气是否打球(play)。\n", "- 如果给出新一天的气象指标数据:sunny,cool,high,TRUE,判断一下会不会去打球。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-05-21T08:52:57.042254Z", "start_time": "2021-05-21T08:52:56.634924Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/anaconda3/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py:312: UserWarning: Unknown extension is not supported and will be removed\n", " warn(msg)\n" ] }, { "data": { "text/html": [ "
\n", " | Outlook | \n", "temperature | \n", "humidity | \n", "windy | \n", "play | \n", "y | \n", "
---|---|---|---|---|---|---|
0 | \n", "Sunny | \n", "hot | \n", "high | \n", "False | \n", "no | \n", "0 | \n", "
1 | \n", "Sunny | \n", "hot | \n", "high | \n", "True | \n", "no | \n", "0 | \n", "
2 | \n", "Overcast | \n", "hot | \n", "high | \n", "False | \n", "yes | \n", "1 | \n", "
3 | \n", "Rainy | \n", "mild | \n", "high | \n", "False | \n", "yes | \n", "1 | \n", "
4 | \n", "Rainy | \n", "cool | \n", "normal | \n", "False | \n", "yes | \n", "1 | \n", "
5 | \n", "Rainy | \n", "cool | \n", "normal | \n", "True | \n", "no | \n", "0 | \n", "
6 | \n", "Overcast | \n", "cool | \n", "normal | \n", "True | \n", "yes | \n", "1 | \n", "
7 | \n", "Sunny | \n", "mild | \n", "high | \n", "False | \n", "no | \n", "0 | \n", "
8 | \n", "Sunny | \n", "cool | \n", "normal | \n", "False | \n", "yes | \n", "1 | \n", "
9 | \n", "Rainy | \n", "mild | \n", "normal | \n", "False | \n", "yes | \n", "1 | \n", "
10 | \n", "Sunny | \n", "mild | \n", "normal | \n", "True | \n", "yes | \n", "1 | \n", "
11 | \n", "Overcast | \n", "mild | \n", "high | \n", "True | \n", "yes | \n", "1 | \n", "
12 | \n", "Overcast | \n", "hot | \n", "normal | \n", "False | \n", "yes | \n", "1 | \n", "
13 | \n", "Rainy | \n", "mild | \n", "high | \n", "True | \n", "no | \n", "0 | \n", "