{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#5 Machine Learning Basics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "* [손고리즘] middle learning - 파이썬을 이용한 기계학습 알고리즘 기초 / 딥러닝 파트 5장 [2]\n",
    "* 김무성"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.1 Learning Algorithms\n",
    "* 5.2 Example: Linear Regression\n",
    "* 5.3 Generalization, Capacity, Overfitting and Under-fitting\n",
    "* 5.4 Hyperparameters and Validation Sets\n",
    "* 5.5 Estimators, Bias and Variance\n",
    "* 5.6 Maximum Likelihood Estimation\n",
    "* 5.7 Bayesian Statistics\n",
    "* 5.8 Supervised Learning Algorithms\n",
    "* 5.9 Unsupervised Learning Algorithms\n",
    "* 5.10 Weakly Supervised Learning\n",
    "* 5.11 Building a Machine Learning Algorithm\n",
    "* 5.12 The Curse of Dimensionality and Statistical Lim-itations of Local Generalization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=\"red\">Deep learning is a specific kind of machine learning. In order to understand deeplearning well, one must have a solid understanding of the basic principles of ma-chine learning</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.1 Learning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.1.1 The Task, T\n",
    "* 5.1.2 The Performance Measure, P\n",
    "* 5.1.3 The Experience, E"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### A machine learning algorithm is an algorithm that is able to learn from data. Butwhat do we mean by learning?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* A popular definition of learning in the context ofcomputer programs is \n",
    "    - “A computer program is said to learn \n",
    "        - from experience E \n",
    "        - with respect to some class of tasks T \n",
    "        - and performance measure P , \n",
    "    - if its performance at tasks in T , as measured by P , improves with experience E”"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.1 The Task, T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Classification\n",
    "* Classification with missing inputs\n",
    "* Regression \n",
    "* Transcription\n",
    "* Translation\n",
    "* Structured output\n",
    "* Anomaly detection\n",
    "* Synthesis and sampling\n",
    "* Imputation of missing values\n",
    "* Denoising\n",
    "* Density or probability function estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap1.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap2.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://ipython-books.github.io/images/ml.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Classification with missing inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. \n",
    "* In order to solve the classification task, the learning algorithm only has to define a single function mappingfrom a vector input to a categorical output. \n",
    "* When some of the inputsmay be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap3.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://ipython-books.github.io/images/ml.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.bindichen.co.uk/uploads/regression%20ng.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transcription"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In this type of task, the machine learning system is asked toobserve a relatively unstructured representation of some kind of data andtranscribe it into discrete, textual form."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://playingwithcode.files.wordpress.com/2013/03/calcpad_screenshot_2.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Translation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In a translation task, the input already consists of a sequenceof symbols in some language, and the computer program must convert thisinto a sequence of symbols in another language."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://amas.ie/wp-content/uploads/google-translate-2.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Structured output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Structured output tasks involve any task where the output is a vector con-taining important relationships between the different elements."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://nlp.stanford.edu/projects/fig1.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Anomaly detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In this type of task, the computer program sifts througha set of events or objects, and flags some of them as being unusual or atypi-cal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.merchantrms.com/images/how-it-works.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Synthesis and sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In this type of task, the machine learning algorithmis asked to generate new examples that are similar to those in the trainingdata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://i867.photobucket.com/albums/ab236/littlebigbucket/SteppingError.png\" width=600 >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SpeechSynthesisProgrammingGuide/art/speechsynth_speechprocess.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Imputation of missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The algorithm must provide a prediction of the values of themissing entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### a new example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap4.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### missing value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap5.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.adapticon.com/images/impByIntpl.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Denoising"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### a corrupted example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap6.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### a clean example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap7.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The learner must predict the cleanexample x from its corrupted version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap8.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* or more generally predict the con-ditional probability distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap9.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://people.tuebingen.mpg.de/burger/neural_denoising/images/denoising.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Density or probability function estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* P_model(x) can be interpreted as a probability density func-tion (if x is continuous) or a probability function (if x is discrete) on thespace that the examples were drawn from. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap12.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap11.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*  if we have performed density estimation toobtain a probability distribution p(x), we can use that distribution to solvethe missing value imputation task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap13.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In practice, density estimationdoes not always allow us to solve all of these related tasks, because in manycases the required operations on p(x) are computationally intractable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap14.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://sebastianraschka.com/Images/2014_parzen_density_estimation/parzen_goal.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.intechopen.com/source/html/42874/media/image4.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.2 The Performance Measure, P"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* accuracy\n",
    "* error rate\n",
    "    - 0-1 loss\n",
    "* probability    \n",
    "* test set & training set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to evaluate the abilities of a machine learning algorithm, we must designa quantitative measure of its performance. Usually this performance measure Pis specific to the task T being carried out by the system"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Accuracy is just theproportion of examples for which the model produces the correct output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.welaptega.com/wp-content/uploads/2014/09/testing-accuracy.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### error rate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We canalso obtain equivalent information by measuring the error rate, the proportion ofexamples for which the model produces an incorrect output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0-1 loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://images.cnitblog.com/blog/61573/201411/081829455185940.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### probability"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* For tasks such as density estimation,we can measure the probability the model assigns to some examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### test set & training set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We therefore evaluate these performance mea-sures using a test set of data that is separate from the data used for training themachine learning system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning_files/Image%20[1].png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.3 The Experience, E"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Unsupervised learning algorithms\n",
    "* Supervised learning algorithms\n",
    "* reinforcement learing algorithms\n",
    "* dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Most of the learning algorithms in this book can be understood \n",
    "\n",
    "* as being allowed to experience an entire <font color=\"red\">dataset</font>. \n",
    "* A dataset is a \n",
    "    - collection of many objects called <font color=\"red\">examples</font>, \n",
    "        - with each example containing many <font color=\"red\">features</font> \n",
    "            - that have been objectively measured. \n",
    "        - Sometimes we will also call examples <font color=\"red\">data points</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Unsupervised learning algorithms "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://images.slideplayer.biz.tr/7/1917098/slides/slide_2.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* unsupervised"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap15.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap16.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* supervised"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap17.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap18.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Supervised learning algorithms "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* label or target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://wiki.cs.princeton.edu/images/2/2a/SupervisedLearning.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Unsupervised learning and supervised learning are not formally defined terms.\n",
    "\n",
    "* The lines between them are often blurred. \n",
    "* Many machine learning technologies can be used to perform both tasks. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### For example, the chain rule of probability states that for a vector "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap19.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### the joint distribution can be decomposed as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap20.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* This decomposition means that we can solve the ostensibly unsupervised problemof modeling p(x) by splitting it into n supervised learning problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://image.slidesharecdn.com/basicprobabilitystatistic-140531034844-phpapp02/95/basic-probability-statistics-4-638.jpg?cb=1401508165\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://image.slidesharecdn.com/lecturev2-150724073404-lva1-app6892/95/text-mining-from-bayes-rule-to-dependency-parsing-20-638.jpg?cb=1437726491\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### we can solve the supervised learning problem of learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap22.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### by using tra-ditional unsupervised learning technologies to learn the joint distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap23.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### and inferring"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap24.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### reinforcement learing algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Some machine learning algorithms do not just experience a fixed dataset.\n",
    "* For example, reinforcement learning algorithms interact with an environment, sothere is a feedback loop between the learning system and its experiences. \n",
    "* Such algorithms are beyond the scope of this book."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* features\n",
    "* design matrix\n",
    "* heterogeneous data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Most machine learning algorithms simply experience a dataset. A dataset canbe described in many ways. In all cases, a dataset is a collection of examples.Each example is a collection of observations called features collected from a dif-ferent time or place. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://image.slidesharecdn.com/t-235p-211yui-v2-140617144512-phpapp01/95/hivemail-scalable-machine-learning-library-for-apache-hive-21-638.jpg?cb=1403016416\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### design matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* One common way of describing a dataset is with a design matrix. \n",
    "* A design matrix is a matrix containing a different example in each row. \n",
    "* Each column of thematrix corresponds to a different feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://i.stack.imgur.com/VZtEr.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### design matrix example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* For instance, the Iris dataset contains150 examples with four features for each example. \n",
    "* This means we can representthe dataset with a design matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap25.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap26.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* is the sepal lengthof plant i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap27.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* is the sepal width of plant i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### heterogeneous data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Different sections of this book describe how to handle differenttypes of heterogeneous data.\n",
    "* In cases like these, rather than describing the datasetas a matrix with m rows, we will describe it as a set containing m elements, e.g."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap28.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* This notation does not imply that any two example vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap29.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap30.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* have the same size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Often when working with a dataset containing a design matrix of feature observations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap31.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* we alsoprovide a vector of labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap32.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "with"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap33.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* providing the label for example i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.2 Example: Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/400px-Linear_regression.svg.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://ordination.okstate.edu/plane.jpg\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* input"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap34.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap35.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap36.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap37.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* ith feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap38.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* ith weight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap41.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### performance measure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* design matrix of input"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap43.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* regression target vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap44.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* predictionso of model on the test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap45.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* mean squared error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://image.slidesharecdn.com/forecastingtechniques-091206232900-phpapp02/95/forecasting-techniques-22-728.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap46.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Intuitively, one can see that this error measure decreases to 0 when"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap47.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We can also see that"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap48.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so the error increases whenever the Euclidean distance between the predictionsand the targets increases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* To make a machine learning algorithm, we need to design an algorithm thatwill improve the weights "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap39.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "in a way that reduces"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap49.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "when the algorithmis allowed to gain experience by observing a training set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap50.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* To minimize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap51.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we can simply solve for where its gradient is 0:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.mathsmutt.co.uk/files/dcten_files/image002.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.tlchm.bris.ac.uk/~paulmay/misc/1s/img00108.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap52.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.1.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.1.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* It’s worth noting that the term linear regression is often used to refer to aslightly more sophisticated model with one additional parameter—an intercept term"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap53.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In this model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap54.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so the mapping from parameters to predictions is still a linear function but themapping from features to predictions is now an affine function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Linear regression is of course an extremely simple and limited learning al-gorithm, but it provides an example of how a learning algorithm can work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.3 Generalization, Capacity, Overfitting and Underfitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.3.1 The No Free Lunch Theorem\n",
    "* 5.3.2 Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The central challenge in machine learning is that we must perform well on <font color=\"red\">new,previously unseen inputs</font>—not just those on which our model was trained. ==> <font color=\"red\">generalization</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* generalization error\n",
    "* training error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* training set\n",
    "* test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Traintest.svg/700px-Traintest.svg.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In our linear regression example, we trained the model by minimizing thetraining error,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap55.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* but we actually care about the test error,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap56.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* <font color=\"red\">How can we affect performance on the test set when we only get to observethe training set?</font> \n",
    "    - The field of <font color=\"red\">statistical learning theory</font> provides some answers. \n",
    "    - If the training and the test set are <font color=\"red\">collected arbitrarily</font>, \n",
    "    - there is indeed little we can do. \n",
    "    - If we are allowed to make <font color=\"red\">some assumptions</font> about <font color=\"blue\">how the training and test set are collected</font>, then we can make some progress"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* some assumptions\n",
    "    - <font color=\"red\">i.i.d. assumptions</font>.\n",
    "        - independent\n",
    "        - identically distributed\n",
    "            - <font color=\"red\">data generating distribution</font>, or data generating process "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://artint.info/figures/ch07/plate.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The factors determining how well a machinelearning algorithm will perform are its ability to:\n",
    "    1. Make the training error small.\n",
    "    2. Make the gap between training and test error small."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* These two factors correspond to the two central challenges in machine learning:\n",
    "    - underfitting and \n",
    "    - overfitting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://sanghyukchun.github.io/images/post/59-1.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* capacity\n",
    "    - We <font color=\"red\">can control</font> whether a model is more likely to overfit or underfit <font color=\"red\">by altering its capacity</font>. \n",
    "        - Informally, <font color=\"blue\">a model’s capacity is its ability to fit a wide variety of functions</font>. \n",
    "        - Models with low capacity may struggle to fit the training set. \n",
    "        - Models with high capacity can overfit, i.e., memorize properties of the training set that do not serve them well on the test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* hypothesis space\n",
    "    - One way to <font color=\"red\">control the capacity</font> of a learning algorithm is <font color=\"red\">by choosing its hypothesis space</font>, the set of functions that the learning algorithm is allowed to choose as being the solution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.svms.org/srm/Sewell2006.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* A polynomial of degree one gives us the linear regression model with whichwe are already familiar, with prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap57.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* By introducing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap58.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "as another feature provided to the linear regression model, wecan learn a model that is quadratic as a function of"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap60.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap59.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Note that this is still a linear function of the parameters, so we can still use thenormal equations to train the model in closed form.\n",
    "* We can continue to add more powers of x as additional features, for example to obtain a polynomial of degree 9:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap61.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Machine learning algorithms will generally perform best <font color=\"red\">when their capacityis appropriate in regard to the true complexity of the task</font> they need to performand the amount of training data they are provided with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.2.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Occam’s razor "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://muslimsi.com/wp-content/uploads/2014/12/quote-occam-s-razor-no-more-things-should-be-presumed-to-exist-than-are-absolutely-necessary-i-e-the-william-of-occam-372636.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We must remember that while simpler functions are more likely to generalize(to have a small gap between training and test error) <font color=\"red\">we must still choose a sufficiently complex hypothesis</font> to achieve low training error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.3.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* non-parametric model\n",
    "    - To reach the most extreme case of arbitrarily high capacity, we introducethe concept of non-parametric models. So far, we have seen only parametricmodels, such as linear regression.\n",
    "    - Parametric models learn a function describedby a parameter vector whose size is finite and fixed before any data is observed.\n",
    "    - Non-parametric models have no such limitation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://img.docstoccdn.com/thumb/orig/113799484.png\" width=400 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* k-NN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.statsoft.com/portals/0/Support/KNNOverViewImageA.jpg\"  />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://scikit-learn.sourceforge.net/0.8/_images/plot_neighbors_11.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* nearest neighbor regression(k-NN Regression)\n",
    "    - Unlikely linear regression,which has a fixed-length vector of weights, the nearest neighbor regression modelsimply stores the X and y from the training set.\n",
    "    - When asked to classify a test point x, the model looks up the nearest entry in the training set and returns theassociated regression target."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap62.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap63.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap64.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.saedsayad.com/images/KNN_similarity.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://scikit-learn.sourceforge.net/0.6/_images/plot_neighbors_regression.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Bayes error\n",
    "    - The error incurred by an oraclemaking predictions from the true distribution p(x, y) is called the Bayes error.\n",
    "    - 참고 - http://newsight.tistory.com/127"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://ranabasheer.files.wordpress.com/2011/04/bayes_error.jpg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://www.projectrhea.org/rhea/images/thumb/1/1d/Upper_bounds_Bayes_error_Pic1.jpg/700px-Upper_bounds_Bayes_error_Pic1.jpg\" width=600 /> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.4.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* representational capacity & effective capacity\n",
    "    - It’s worth mentioning that capacity is not just determined by which model we use. \n",
    "    - The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective.\n",
    "    - This is called the <font color=\"red\">representational capacity</font> of the model. \n",
    "    - In many cases, finding the best function within this family is a very difficult optimization problem. \n",
    "    - In practice, the learning algorithm does not actually find the best function, <font color=\"red\">just one that significantly reduces the training error</font>. \n",
    "    - These additional restrictions mean that the <font color=\"red\">model’s effective capacity may be less than its representational capacity</font>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.3.1 The No Free Lunch Theorem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The no freelunch theorem for machine learning (Wolpert, 1996) states that, <font color=\"red\">averaged overall possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points</font>. In other words,in some sense, <font color=\"blue\">no machine learning algorithm is universally any better than anyother</font>. The most sophisticated algorithm we can conceive of has the same averageperformance (over all possible tasks) as merely predicting that every point belongsto the same class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Fortunately, these results hold only when we average over all possible datagenerating distributions. <font color=\"red\">If we make assumptions about the kinds of probability distributions we encounter in real-world applications</font>, then <font color=\"blue\">we can design learning algorithms that perform well on these distributions</font>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.3.2 Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://yosinski.com/mlss12/media/slides/MLSS-2012-Fukumizu-Kernel-Methods-for-Statistical-Learning_050.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### preference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. \n",
    "* We do so by building a set of preferences into the learning algorithm. \n",
    "* When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### hypotheis space of solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* So far, the only method of modifying a learning algorithm we have discussedis to increase or decrease the model’s capacity by adding or removing functionsfrom the hypothesis space of solutions the learning algorithm is able to choose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The behavior of our algorithm is strongly affected not just by how large wemake the set of functions allowed in its hypothesis space, but by the specificidentity of those functions.\n",
    "    - linear functions\n",
    "        - The learning algorithm we have studied so far, linearregression, has a hypothesis space consisting of the set of linear functions of itsinput.\n",
    "    - nonlinear functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap70.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### weight decay"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.alglib.net/dataanalysis/i/art0_trainreg.gif\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* For example, we can modify the training criterion for linear regression to include weight decay. \n",
    "* To perform linear regression with weight decay, we minimizenot only the mean squared error on the training set, but instead a criterion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap72.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "that expresses a preference for the weights to have smaller squared"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap73.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "norm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specifically,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap74.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap75.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "is a value chosen ahead of time that controls the strength of our preferencefor smaller weights."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap76.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we impose no preference, "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and larger"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap77.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "forces the weights to become smaller."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Minimizing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap78.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "results in a choice of weights thatmake a tradeoff between fitting the training data and being small. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.5.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* In our weight decay example, we expressed our preference for linear functions defined with smaller weights explicitly, via an extra term in the criterion we minimize. \n",
    "* There are many other ways of expressing preferences for different solutions, both implicitly and explicitly. \n",
    "* Together, these different approaches are known as regularization. \n",
    "* Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.4 Hyperparameters and Validation Sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.4.1 Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Most machine learning algorithms have <font color=\"red\">several settings</font> that we can use to <font color=\"red\">control the behavior of the learning algorithm</font>. \n",
    "* These <font color=\"red\">settings are called hyperparameters</font>.\n",
    "* The <font color=\"red\">values of hyperparameters are not adapted by the learning algorithm itself</font>(though we can design a nested learning procedure where one learning algorithmlearns the best hyperparameters for another learning algorithm).\n",
    "* In the polynomial regression example we saw in Fig. 5.2, \n",
    "    - there is a single hyperparameter: the <font color=\"red\">degree of the polynomial</font>, which <font color=\"blue\">acts as a capacity hyperparameter</font>. \n",
    "    - The <font color=\"red\">λ value</font> used to control the strength of weight decay is another example of a hyperparameter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.2.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://d1vn86fw4xmcz1.cloudfront.net/content/royptb/365/1555/3247/F2.large.jpg\" width= 600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://cdn2.hubspot.net/hub/426799/hubfs/images/Screen_Shot_2015-05-27_at_8.40.18_AM.png?t=1440617420708&width=400\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### validation set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* More frequently, <font color=\"red\">we do not learn the hyperparameter</font> because it is not appropriate to learn that hyperparameter <font color=\"red\">on the training set</font>. \n",
    "* If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in <font color=\"red\">overfitting</font> (referto Figure 5.3).\n",
    "* To solve this problem, <font color=\"red\">we need a validation set of examples</font> that the training algorithm <font color=\"red\">does not observe</font>.\n",
    "* Earlier we discussed how a held-out <font color=\"red\">test set</font>, composed of examples coming from the same distribution as the training set, can be used <font color=\"red\">to estimate</font> the generalization error of a learner, <font color=\"red\">after the learning process has completed</font>. \n",
    "* It is important that the <font color=\"red\">test examples are not used in any way to make choices about the model</font>, including its hyperparameters. \n",
    "* For this reason, <font color=\"red\">no example from the test set can be used in the validation set</font>.\n",
    "* For this reason, we always <font color=\"red\">construct the validation set from the training data</font>.\n",
    "* Specifically, we split the training data into two disjoint subsets. \n",
    "* Typically, one uses about 80% of the data for training and 20% for validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://2.bp.blogspot.com/-Fxbj18t3qUk/TnJVmn6mu4I/AAAAAAAAAFE/JT2rx_RngcM/s1600/TrainingSplitting.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://images.cppblog.com/cppblog_com/sosi/WindowsLiveWriter/Crossvalidation_127F5/image_thumb.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.4.1 Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* One issue with the idea of splitting the dataset into train/test or train/validation/test subsets is that only a small fraction of examples are used to evaluate generaliza-tion. \n",
    "* train/test\n",
    "    - These procedures are based on the idea of repeating the training / testing computation on different randomly chosen subsets or splits of the original dataset.\n",
    "* train/validation/test\n",
    "    - If <font color=\"red\">model selection or hyperparameter optimization</font> is required, things get more computationally expensive: \n",
    "    - one can recurse the k-fold cross-validation idea, in-side the training set. \n",
    "    - So we can have an <font color=\"red\">outer loop</font> that estimates test error and provides a “training set” for a <font color=\"red\">hyperparameter-free learner</font>, calling it k times to“train”. \n",
    "    - That hyperparameter-free learner can then split its received training set by k-fold cross-validation into internal training/validation subsets (for example,splitting into k − 1 subsets is convenient, to reuse the same test blocks as the outer loop), \n",
    "        - call a <font color=\"red\">hyperparameter-specific learner</font> for each choice of hyperparameter value on each of the training partition of this <font color=\"red\">inner loop</font>, \n",
    "        - and compute the validation error by averaging across the k −1 validation sets \n",
    "            - the errors made by the k −1 hyperparameter-specific learners trained on each of the internal training subsets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://www.intechopen.com/source/html/39037/media/image4.jpeg\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.5 Estimators, Bias and Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.5.1 Point Estimation\n",
    "* 5.5.2 Bias\n",
    "* 5.5.3 Variance\n",
    "* 5.5.4 Trading off Bias and Variance and the Mean Squared Error\n",
    "* 5.5.5 Consistency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.5.1 Point Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap79.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap80.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap81.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.2.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap82.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Function Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap83.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap85.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap86.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap87.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap88.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap89.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.5.2 Bias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Example: Bernoulli Distribution \n",
    "* Example: Gaussian Distribution Estimator of the Mean    \n",
    "* Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.3.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap90.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap91.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap92.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap93.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap94.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Bernoulli Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap95.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap96.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap97.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap98.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap99.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Gaussian Distribution Estimator of the Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap100.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap101.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap102.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap103.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.4.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap104.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap105.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.5.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap106.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap107.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap108.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.6.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap109.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.5.3 Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Example: Bernoulli Distribution\n",
    "* Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.7.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.8.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Bernoulli Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap110.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap111.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap112.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap113.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap114.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.9.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/eq5.10.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap115.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap116.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap117.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap118.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap119.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.5.4 Trading off Bias and Variance and the Mean Squared Error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap120.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/fig5.6.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Gaussian Distribution Estimators of the Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap121.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.5.5 Consistency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap122.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap123.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap124.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap125.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap126.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap127.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cap128.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.6 Maximum Likelihood Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.6.1 Conditional Log-Likelihood and Mean Squared Error\n",
    "* 5.6.2 Properties of Maximum Likelihood"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.6.1 Conditional Log-Likelihood and Mean Squared Error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.6.2 Properties of Maximum Likelihood"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.7 Bayesian Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.7.1 Maximum A Posteriori (MAP) Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.7.1 Maximum A Posteriori (MAP) Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.8 Supervised Learning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.8.1 Probabilistic Supervised Learning\n",
    "* 5.8.2 Support Vector Machines\n",
    "* 5.8.3 Other Simple Supervised Learning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.8.1 Probabilistic Supervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.8.2 Support Vector Machines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.8.3 Other Simple Supervised Learning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.9 Unsupervised Learning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.9.1 Principal Components Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.9.1 Principal Components Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.10 Weakly Supervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.11 Building a Machine Learning Algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.12 The Curse of Dimensionality and Statistical Lim-itations of Local Generalization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 5.12.1 The Curse of Dimensionality\n",
    "* 5.12.2 Smoothness and Local Constancy A Priori Preference\n",
    "* 5.12.3 Manifold Learning and the Curse of Dimensionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.12.1 The Curse of Dimensionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.12.2 Smoothness and Local Constancy A Priori Preference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.12.3 Manifold Learning and the Curse of Dimensionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 참고자료"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "* [1] 딥러닝 주교재 - http://www.iro.umontreal.ca/~bengioy/dlbook/\n",
    "* [2] 5 Machine Learning Basics - http://www.iro.umontreal.ca/~bengioy/dlbook/ml.html"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}