{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#5 Machine Learning Basics" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* [손고리즘] middle learning - 파이썬을 이용한 기계학습 알고리즘 기초 / 딥러닝 파트 5장 [2]\n", "* 김무성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.1 Learning Algorithms\n", "* 5.2 Example: Linear Regression\n", "* 5.3 Generalization, Capacity, Overfitting and Under-fitting\n", "* 5.4 Hyperparameters and Validation Sets\n", "* 5.5 Estimators, Bias and Variance\n", "* 5.6 Maximum Likelihood Estimation\n", "* 5.7 Bayesian Statistics\n", "* 5.8 Supervised Learning Algorithms\n", "* 5.9 Unsupervised Learning Algorithms\n", "* 5.10 Weakly Supervised Learning\n", "* 5.11 Building a Machine Learning Algorithm\n", "* 5.12 The Curse of Dimensionality and Statistical Lim-itations of Local Generalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Deep learning is a specific kind of machine learning. In order to understand deeplearning well, one must have a solid understanding of the basic principles of ma-chine learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.1 Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.1.1 The Task, T\n", "* 5.1.2 The Performance Measure, P\n", "* 5.1.3 The Experience, E" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A machine learning algorithm is an algorithm that is able to learn from data. Butwhat do we mean by learning?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A popular definition of learning in the context ofcomputer programs is \n", " - “A computer program is said to learn \n", " - from experience E \n", " - with respect to some class of tasks T \n", " - and performance measure P , \n", " - if its performance at tasks in T , as measured by P , improves with experience E”" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1.1 The Task, T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Classification\n", "* Classification with missing inputs\n", "* Regression \n", "* Transcription\n", "* Translation\n", "* Structured output\n", "* Anomaly detection\n", "* Synthesis and sampling\n", "* Imputation of missing values\n", "* Denoising\n", "* Density or probability function estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Classification with missing inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. \n", "* In order to solve the classification task, the learning algorithm only has to define a single function mappingfrom a vector input to a categorical output. \n", "* When some of the inputsmay be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transcription" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this type of task, the machine learning system is asked toobserve a relatively unstructured representation of some kind of data andtranscribe it into discrete, textual form." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Translation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In a translation task, the input already consists of a sequenceof symbols in some language, and the computer program must convert thisinto a sequence of symbols in another language." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Structured output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Structured output tasks involve any task where the output is a vector con-taining important relationships between the different elements." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Anomaly detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this type of task, the computer program sifts througha set of events or objects, and flags some of them as being unusual or atypi-cal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Synthesis and sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this type of task, the machine learning algorithmis asked to generate new examples that are similar to those in the trainingdata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Imputation of missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The algorithm must provide a prediction of the values of themissing entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### a new example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### missing value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Denoising" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### a corrupted example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### a clean example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The learner must predict the cleanexample x from its corrupted version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* or more generally predict the con-ditional probability distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Density or probability function estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* P_model(x) can be interpreted as a probability density func-tion (if x is continuous) or a probability function (if x is discrete) on thespace that the examples were drawn from. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* if we have performed density estimation toobtain a probability distribution p(x), we can use that distribution to solvethe missing value imputation task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In practice, density estimationdoes not always allow us to solve all of these related tasks, because in manycases the required operations on p(x) are computationally intractable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1.2 The Performance Measure, P" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* accuracy\n", "* error rate\n", " - 0-1 loss\n", "* probability \n", "* test set & training set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to evaluate the abilities of a machine learning algorithm, we must designa quantitative measure of its performance. Usually this performance measure Pis specific to the task T being carried out by the system" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Accuracy is just theproportion of examples for which the model produces the correct output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### error rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We canalso obtain equivalent information by measuring the error rate, the proportion ofexamples for which the model produces an incorrect output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 0-1 loss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### probability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For tasks such as density estimation,we can measure the probability the model assigns to some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### test set & training set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We therefore evaluate these performance mea-sures using a test set of data that is separate from the data used for training themachine learning system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1.3 The Experience, E" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Unsupervised learning algorithms\n", "* Supervised learning algorithms\n", "* reinforcement learing algorithms\n", "* dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Most of the learning algorithms in this book can be understood \n", "\n", "* as being allowed to experience an entire dataset. \n", "* A dataset is a \n", " - collection of many objects called examples, \n", " - with each example containing many features \n", " - that have been objectively measured. \n", " - Sometimes we will also call examples data points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Unsupervised learning algorithms " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* unsupervised" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* supervised" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Supervised learning algorithms " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* label or target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Unsupervised learning and supervised learning are not formally defined terms.\n", "\n", "* The lines between them are often blurred. \n", "* Many machine learning technologies can be used to perform both tasks. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For example, the chain rule of probability states that for a vector " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### the joint distribution can be decomposed as" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* This decomposition means that we can solve the ostensibly unsupervised problemof modeling p(x) by splitting it into n supervised learning problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### we can solve the supervised learning problem of learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### by using tra-ditional unsupervised learning technologies to learn the joint distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### and inferring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### reinforcement learing algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Some machine learning algorithms do not just experience a fixed dataset.\n", "* For example, reinforcement learning algorithms interact with an environment, sothere is a feedback loop between the learning system and its experiences. \n", "* Such algorithms are beyond the scope of this book." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* features\n", "* design matrix\n", "* heterogeneous data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Most machine learning algorithms simply experience a dataset. A dataset canbe described in many ways. In all cases, a dataset is a collection of examples.Each example is a collection of observations called features collected from a dif-ferent time or place. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### design matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* One common way of describing a dataset is with a design matrix. \n", "* A design matrix is a matrix containing a different example in each row. \n", "* Each column of thematrix corresponds to a different feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### design matrix example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For instance, the Iris dataset contains150 examples with four features for each example. \n", "* This means we can representthe dataset with a design matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* is the sepal lengthof plant i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* is the sepal width of plant i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### heterogeneous data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Different sections of this book describe how to handle differenttypes of heterogeneous data.\n", "* In cases like these, rather than describing the datasetas a matrix with m rows, we will describe it as a set containing m elements, e.g." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* This notation does not imply that any two example vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* have the same size." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Often when working with a dataset containing a design matrix of feature observations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* we alsoprovide a vector of labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "with" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* providing the label for example i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.2 Example: Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* ith feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* ith weight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### performance measure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* design matrix of input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* regression target vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* predictionso of model on the test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* mean squared error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Intuitively, one can see that this error measure decreases to 0 when" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can also see that" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "so the error increases whenever the Euclidean distance between the predictionsand the targets increases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To make a machine learning algorithm, we need to design an algorithm thatwill improve the weights " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "in a way that reduces" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "when the algorithmis allowed to gain experience by observing a training set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To minimize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can simply solve for where its gradient is 0:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* It’s worth noting that the term linear regression is often used to refer to aslightly more sophisticated model with one additional parameter—an intercept term" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "so the mapping from parameters to predictions is still a linear function but themapping from features to predictions is now an affine function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Linear regression is of course an extremely simple and limited learning al-gorithm, but it provides an example of how a learning algorithm can work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.3 Generalization, Capacity, Overfitting and Underfitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.3.1 The No Free Lunch Theorem\n", "* 5.3.2 Regularization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The central challenge in machine learning is that we must perform well on new,previously unseen inputs—not just those on which our model was trained. ==> generalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* generalization error\n", "* training error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* training set\n", "* test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In our linear regression example, we trained the model by minimizing thetraining error," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* but we actually care about the test error," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* How can we affect performance on the test set when we only get to observethe training set? \n", " - The field of statistical learning theory provides some answers. \n", " - If the training and the test set are collected arbitrarily, \n", " - there is indeed little we can do. \n", " - If we are allowed to make some assumptions about how the training and test set are collected, then we can make some progress" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* some assumptions\n", " - i.i.d. assumptions.\n", " - independent\n", " - identically distributed\n", " - data generating distribution, or data generating process " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The factors determining how well a machinelearning algorithm will perform are its ability to:\n", " 1. Make the training error small.\n", " 2. Make the gap between training and test error small." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* These two factors correspond to the two central challenges in machine learning:\n", " - underfitting and \n", " - overfitting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* capacity\n", " - We can control whether a model is more likely to overfit or underfit by altering its capacity. \n", " - Informally, a model’s capacity is its ability to fit a wide variety of functions. \n", " - Models with low capacity may struggle to fit the training set. \n", " - Models with high capacity can overfit, i.e., memorize properties of the training set that do not serve them well on the test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* hypothesis space\n", " - One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to choose as being the solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A polynomial of degree one gives us the linear regression model with whichwe are already familiar, with prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* By introducing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as another feature provided to the linear regression model, wecan learn a model that is quadratic as a function of" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Note that this is still a linear function of the parameters, so we can still use thenormal equations to train the model in closed form.\n", "* We can continue to add more powers of x as additional features, for example to obtain a polynomial of degree 9:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Machine learning algorithms will generally perform best when their capacityis appropriate in regard to the true complexity of the task they need to performand the amount of training data they are provided with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Occam’s razor " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We must remember that while simpler functions are more likely to generalize(to have a small gap between training and test error) we must still choose a sufficiently complex hypothesis to achieve low training error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* non-parametric model\n", " - To reach the most extreme case of arbitrarily high capacity, we introducethe concept of non-parametric models. So far, we have seen only parametricmodels, such as linear regression.\n", " - Parametric models learn a function describedby a parameter vector whose size is finite and fixed before any data is observed.\n", " - Non-parametric models have no such limitation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* k-NN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* nearest neighbor regression(k-NN Regression)\n", " - Unlikely linear regression,which has a fixed-length vector of weights, the nearest neighbor regression modelsimply stores the X and y from the training set.\n", " - When asked to classify a test point x, the model looks up the nearest entry in the training set and returns theassociated regression target." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Bayes error\n", " - The error incurred by an oraclemaking predictions from the true distribution p(x, y) is called the Bayes error.\n", " - 참고 - http://newsight.tistory.com/127" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* representational capacity & effective capacity\n", " - It’s worth mentioning that capacity is not just determined by which model we use. \n", " - The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective.\n", " - This is called the representational capacity of the model. \n", " - In many cases, finding the best function within this family is a very difficult optimization problem. \n", " - In practice, the learning algorithm does not actually find the best function, just one that significantly reduces the training error. \n", " - These additional restrictions mean that the model’s effective capacity may be less than its representational capacity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3.1 The No Free Lunch Theorem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The no freelunch theorem for machine learning (Wolpert, 1996) states that, averaged overall possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. In other words,in some sense, no machine learning algorithm is universally any better than anyother. The most sophisticated algorithm we can conceive of has the same averageperformance (over all possible tasks) as merely predicting that every point belongsto the same class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Fortunately, these results hold only when we average over all possible datagenerating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3.2 Regularization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### preference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. \n", "* We do so by building a set of preferences into the learning algorithm. \n", "* When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### hypotheis space of solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* So far, the only method of modifying a learning algorithm we have discussedis to increase or decrease the model’s capacity by adding or removing functionsfrom the hypothesis space of solutions the learning algorithm is able to choose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The behavior of our algorithm is strongly affected not just by how large wemake the set of functions allowed in its hypothesis space, but by the specificidentity of those functions.\n", " - linear functions\n", " - The learning algorithm we have studied so far, linearregression, has a hypothesis space consisting of the set of linear functions of itsinput.\n", " - nonlinear functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### weight decay" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For example, we can modify the training criterion for linear regression to include weight decay. \n", "* To perform linear regression with weight decay, we minimizenot only the mean squared error on the training set, but instead a criterion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "that expresses a preference for the weights to have smaller squared" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "norm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specifically," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "is a value chosen ahead of time that controls the strength of our preferencefor smaller weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we impose no preference, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and larger" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "forces the weights to become smaller." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Minimizing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "results in a choice of weights thatmake a tradeoff between fitting the training data and being small. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### regularization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In our weight decay example, we expressed our preference for linear functions defined with smaller weights explicitly, via an extra term in the criterion we minimize. \n", "* There are many other ways of expressing preferences for different solutions, both implicitly and explicitly. \n", "* Together, these different approaches are known as regularization. \n", "* Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.4 Hyperparameters and Validation Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.4.1 Cross-Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### hyperparameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm. \n", "* These settings are called hyperparameters.\n", "* The values of hyperparameters are not adapted by the learning algorithm itself(though we can design a nested learning procedure where one learning algorithmlearns the best hyperparameters for another learning algorithm).\n", "* In the polynomial regression example we saw in Fig. 5.2, \n", " - there is a single hyperparameter: the degree of the polynomial, which acts as a capacity hyperparameter. \n", " - The λ value used to control the strength of weight decay is another example of a hyperparameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### validation set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* More frequently, we do not learn the hyperparameter because it is not appropriate to learn that hyperparameter on the training set. \n", "* If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overfitting (referto Figure 5.3).\n", "* To solve this problem, we need a validation set of examples that the training algorithm does not observe.\n", "* Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set, can be used to estimate the generalization error of a learner, after the learning process has completed. \n", "* It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. \n", "* For this reason, no example from the test set can be used in the validation set.\n", "* For this reason, we always construct the validation set from the training data.\n", "* Specifically, we split the training data into two disjoint subsets. \n", "* Typically, one uses about 80% of the data for training and 20% for validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4.1 Cross-Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* One issue with the idea of splitting the dataset into train/test or train/validation/test subsets is that only a small fraction of examples are used to evaluate generaliza-tion. \n", "* train/test\n", " - These procedures are based on the idea of repeating the training / testing computation on different randomly chosen subsets or splits of the original dataset.\n", "* train/validation/test\n", " - If model selection or hyperparameter optimization is required, things get more computationally expensive: \n", " - one can recurse the k-fold cross-validation idea, in-side the training set. \n", " - So we can have an outer loop that estimates test error and provides a “training set” for a hyperparameter-free learner, calling it k times to“train”. \n", " - That hyperparameter-free learner can then split its received training set by k-fold cross-validation into internal training/validation subsets (for example,splitting into k − 1 subsets is convenient, to reuse the same test blocks as the outer loop), \n", " - call a hyperparameter-specific learner for each choice of hyperparameter value on each of the training partition of this inner loop, \n", " - and compute the validation error by averaging across the k −1 validation sets \n", " - the errors made by the k −1 hyperparameter-specific learners trained on each of the internal training subsets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.5 Estimators, Bias and Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.5.1 Point Estimation\n", "* 5.5.2 Bias\n", "* 5.5.3 Variance\n", "* 5.5.4 Trading off Bias and Variance and the Mean Squared Error\n", "* 5.5.5 Consistency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5.1 Point Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5.2 Bias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Example: Bernoulli Distribution \n", "* Example: Gaussian Distribution Estimator of the Mean \n", "* Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Bernoulli Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Gaussian Distribution Estimator of the Mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5.3 Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Example: Bernoulli Distribution\n", "* Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Bernoulli Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5.4 Trading off Bias and Variance and the Mean Squared Error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Gaussian Distribution Estimators of the Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.5.5 Consistency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.6 Maximum Likelihood Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.6.1 Conditional Log-Likelihood and Mean Squared Error\n", "* 5.6.2 Properties of Maximum Likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.6.1 Conditional Log-Likelihood and Mean Squared Error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.6.2 Properties of Maximum Likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.7 Bayesian Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.7.1 Maximum A Posteriori (MAP) Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.7.1 Maximum A Posteriori (MAP) Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.8 Supervised Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.8.1 Probabilistic Supervised Learning\n", "* 5.8.2 Support Vector Machines\n", "* 5.8.3 Other Simple Supervised Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.8.1 Probabilistic Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.8.2 Support Vector Machines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.8.3 Other Simple Supervised Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.9 Unsupervised Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.9.1 Principal Components Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.9.1 Principal Components Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.10 Weakly Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.11 Building a Machine Learning Algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.12 The Curse of Dimensionality and Statistical Lim-itations of Local Generalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 5.12.1 The Curse of Dimensionality\n", "* 5.12.2 Smoothness and Local Constancy A Priori Preference\n", "* 5.12.3 Manifold Learning and the Curse of Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.12.1 The Curse of Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.12.2 Smoothness and Local Constancy A Priori Preference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.12.3 Manifold Learning and the Curse of Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 참고자료" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* [1] 딥러닝 주교재 - http://www.iro.umontreal.ca/~bengioy/dlbook/\n", "* [2] 5 Machine Learning Basics - http://www.iro.umontreal.ca/~bengioy/dlbook/ml.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }