{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "Chapter 2 Overview of Supervised Learning\n", "\n", "本章节是整本书内容的概览,主要是 supervised learning。定义了一些相关术语,介绍了简单的 learning 模型、问题、和困难。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Terminology\n", "\n", "learning 的过程,就是用 输入 去预测、估计 输出。以下定义会在整本书通用:\n", "\n", "- input -- output\n", "\n", "- preditors - responses\n", "\n", "- independent variables - dependent variables\n", "\n", "input variables 的分类:\n", "\n", "- quantitative variables --- regression;\n", "\n", "- qualitative variables,又称为 categorical/discrete variables, factors。---- classification。\n", "\n", "Symbols:\n", "- 输入 $\\mathbf{X}$\n", "\n", "- 输出 $\\mathbf{Y}$ \n", "\n", "- 预测 $\\hat{\\mathbf{Y}}$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 两个基本的预测方法\n", "\n", "### Least Squares\n", "\n", "基于线性模型的最小化方差,Residual sum of squares\n", "\\begin{align}\n", "\\textrm{RSS}(\\beta) = (\\mathbf{y} -\\mathbf{X}\\beta)^T(\\mathbf{y} -\\mathbf{X}\\beta )\n", "\\end{align}\n", "\n", "### Nearest-Neighbor Methods\n", "\n", "$k$-nearest neighbor (KNN) 拟合\n", "\\begin{align}\n", "\\hat{Y}(x) = \\frac{1}{k} \\sum_{x_i \\in N_k(x)} y_i\n", "\\end{align}\n", "where $N_k(x)$ 是 $k$ 个最邻近 $x$ 的变量。\n", "\n", "### Least Squares vs. Nearest-Neighbor Methods\n", "\n", "这两个方法正好是两个极端,是后续 learning 的基础:\n", "\n", "- Least Squares, 全局优化,得到平滑的曲线,低 variance 和高 bias。\n", "\n", "- Nearest-Neighbor Methods,局部优化,通常得到许多不规整的区域,高 variance 和低 bias。当 $k \\rightarrow \\infty$ 时,基本就与 least square 相似了。\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistical Decision Theory\n", "\n", "我们需要一个 *loss function* $L(Y,f(X))$ 来衡量预测的好坏,例如 *squared error loss*: $L(Y,f(X)) = (Y-f(X))^2$。不同 loss function 的选择,会影响预测结果。\n", "\n", "- regression: $L_2$ or $L_1: E|Y-f(X)|$ loss function\n", "\n", "- classifier: $L(G,\\hat{G})$ 是预测正确的奖励,然后基于概率求的最大值。最常用的如 Bayes classifier。\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Curse of Dimensionality\n", "\n", "高维度预测的困难。对于一个 $p$ 维的变量,其空间是以 $p$ 的指数增长的,并且向外扩散增长,导致三个问题:\n", "\n", "1. local 预测算法失灵。一个 local 的预测,需要大部分范围的变量。比如,每个变量维度都取 $\\tfrac{1}{2}$, 那么得到的是 $\\left(\\tfrac{1}{2}\\right)^p$。 反之,一个local的预测,比如只需要 $\\left(\\tfrac{1}{2}\\right)^p$ 的数据,但是需要保证这些数据在每个维度占有 $\\tfrac{1}{2}$ 的量。也就是说,这些变量已经不那么 local 了(因为 local 的好处是只取一小部分变量)。\n", "\n", "1. 边缘变量增加预测难度。数据越外围,量越大,这导致很多数据会堆积在最外层。而在 prediction 过程中,边缘点的 training 难度会高很多。令 $d(p,N)$ 为变量到中心的最短距离的 中值 (median),简易计算方法,所有点都在 $d(p,N)$ 外面的概率为 $\\tfrac{1}{2}$,即 $\\left(1 - \\frac{d^p}{1^p}\\right)^N =\\tfrac{1}{2}$。从而可以得到\n", "\\begin{align}\n", "d(p,N) = \\left( 1 - \\left(\\frac{1}{2}\\right)^{1/N}\\right)^{1/p}.\n", "\\end{align}\n", "\n", "1. 变量 density 降低。变量的 sampling density 是和 $N^{1/p}$ 成正比的。通常 density 越高,learning 越准。维度 $p$ 的增加,会导致 training data 的需求数量以 $p$ 的级数增加。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }