{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.1 Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Regression\n",
" - the task of predicting a real valued target $y$ given a data point $x$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1.1 Basic Elements of Linear Regression\n",
"- Prediction can be expressed as a *linear* combination of the input features.\n",
"- Linear Model \n",
" - Example: estimating the price of a house\n",
"$$\\mathrm{price} = w_{\\mathrm{area}} \\cdot \\mathrm{area} + w_{\\mathrm{age}} \\cdot \\mathrm{age} + b$$\n",
" - General Form\n",
" - In the case of $d$ variables $$\\hat{y} = w_1 \\cdot x_1 + ... + w_d \\cdot x_d + b$$ $$\\hat{y} = \\mathbf{w}^\\top \\mathbf{x} + b$$ \n",
" - We'll try to find the weight vector $w$ and bias term $b$ that approximately associate data points $x_i$ with their corresponding labels $y_i$.\n",
" - For a collection of data points $\\mathbf{X}$ the predictions $\\hat{\\mathbf{y}}$ can be expressed via the matrix-vector product $${\\hat{\\mathbf{y}}} = \\mathbf{X} \\mathbf{w} + b$$\n",
"\n",
" - ***Model parameters***: $\\mathbf{w}$, $b$\n",
"\n",
"- Training Data\n",
" - ‘features’ or 'covariates'\n",
" - The two factors used to predict the label \n",
" - $n$: the number of samples that we collect. \n",
" - Each sample (indexed as $i$) is described by $x^{(i)} = [x_1^{(i)}, x_2^{(i)}]$, and the label is $y^{(i)}$.\n",
"\n",
"- Loss Function\n",
" - Square Loss for a data sample $$l^{(i)}(\\mathbf{w}, b) = \\frac{1}{2} \\left(\\hat{y}^{(i)} - y^{(i)}\\right)^2,$$\n",
"\n",
" - the smaller the error, the closer the predicted price is to the actual price\n",
" - To measure the quality of a model on the entire dataset, we can simply average the losses on the training set.$$L(\\mathbf{w}, b) =\\frac{1}{n}\\sum_{i=1}^n l^{(i)}(\\mathbf{w}, b) =\\frac{1}{n} \\sum_{i=1}^n \\frac{1}{2}\\left(\\mathbf{w}^\\top \\mathbf{x}^{(i)} + b - y^{(i)}\\right)^2.$$\n",
" - In model training, we want to find a set of model parameters, represented by $\\mathbf{w}^*$, $b^*$, that can minimize the average loss of training samples: $$\\mathbf{w}^*, b^* = \\operatorname*{argmin}_{\\mathbf{w}, b}\\ L(\\mathbf{w}, b).$$\n",
" \n",
"- Optimization Algorithm\n",
" - ***The mini-batch stochastic gradient descent***\n",
" - In each iteration, we randomly and uniformly sample a mini-batch $\\mathcal{B}$ consisting of a fixed number of training data samples.\n",
" - We then compute the derivative (gradient) of the average loss on the mini batch the with regard to the model parameters.\n",
" - This result is used to change the parameters in the direction of the minimum of the loss.$$ \\begin{aligned} \\mathbf{w} &\\leftarrow \\mathbf{w} - \\frac{\\eta}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}} \\partial_{\\mathbf{w}} l^{(i)}(\\mathbf{w}, b) = \\mathbf{w} - \\frac{\\eta}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}} \\mathbf{x}^{(i)} \\left(\\mathbf{w}^\\top \\mathbf{x}^{(i)} + b - y^{(i)}\\right) \\\\b &\\leftarrow b - \\frac{\\eta}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}} \\partial_b l^{(i)}(\\mathbf{w}, b) = b - \\frac{\\eta}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}} \\left(\\mathbf{w}^\\top \\mathbf{x}^{(i)} - y^{(i)}\\right). \\end{aligned} $$\n",
" - $|\\mathcal{B}|$: the number of samples (batch size) in each mini-batch\n",
" - $\\eta$: learning rate\n",
" \n",
" - hyper-parameters\n",
" - $|\\mathcal{B}|$, $\\eta$\n",
" - They are set somewhat manually and are typically not learned through model training. \n",
" \n",
"- Model prediction (or Model inference)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1.2 From Linear Regression to Deep Networks\n",
"- Neural Network Diagram\n",
" - a neural network diagram to represent the linear regression model \n",
"![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/singleneuron.svg?sanitize=true)\n",
"\n",
" - $d$: feature dimension (the number of inputs)\n",
" \n",
"- A Detour to Biology\n",
"![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/Neuron.svg?sanitize=true)\n",
"\n",
" - ***Dendrites***: input terminals\n",
" - ***Nucleus***: CPU\n",
" - ***Axon***: output wire\n",
" - Axon terminals (output terminals) are connected to other neurons via ***synapses***."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Vectorzation for Speed\n",
" - Vectorizing code is a good way of getting order of mangitude speedups."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from mxnet import nd\n",
"from time import time\n",
"\n",
"a = nd.ones(shape=10000)\n",
"b = nd.ones(shape=10000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 1) add them one coordinate at a time using a for loop."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8771929740905762\n"
]
}
],
"source": [
"start = time()\n",
"c = nd.zeros(shape=10000)\n",
"for i in range(10000):\n",
" c[i] = a[i] + b[i]\n",
"print(time() - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 2) add the vectors directly:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.00016832351684570312\n"
]
}
],
"source": [
"start = time() \n",
"d = a + b\n",
"print(time() - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1.3 The Normal Distribution and Squared Loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- ***Maximum Likelihood Principle***\n",
" - The notion of maximizing the likelihood of the data subject to the parameters\n",
" - its estimators are usually called ***Maximum Likelihood Estimators (MLE)***. \n",
" - minimize the Negative Log-Likelihood\n",
"- The maximum likelihood in a linear model with additive Gaussian noise is equivalent to linear regression with squared loss."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.2 Linear regression implementation from scratch"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from IPython import display\n",
"from matplotlib import pyplot as plt \n",
"from mxnet import autograd, nd \n",
"import random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2.1 Generating Data Sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- The randomly generated batch example feature $\\mathbf{X}\\in \\mathbb{R}^{1000 \\times 2}$, \n",
"- The actual weight $\\mathbf{w} = [2, -3.4]^\\top$ and bias $b = 4.2$ of the linear regression model\n",
"- A random noise term $\\epsilon$\n",
" - It obeys a normal distribution with a mean of 0 and a standard deviation of 0.01 ($\\epsilon \\sim \\mathcal{N}(0, 0.01^2)$. $$\\mathbf{y}= \\mathbf{X} \\mathbf{w} + b + \\mathbf\\epsilon$$\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"num_inputs = 2\n",
"num_examples = 1000\n",
"true_w = nd.array([2, -3.4])\n",
"true_b = 4.2\n",
"\n",
"features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) # scale --> standard deviation\n",
"\n",
"labels = nd.dot(features, true_w) + true_b\n",
"\n",
"labels += nd.random.normal(scale=0.01, shape=labels.shape)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"[1.1630787 0.4838046]\n",
"\n",
"\n",
"[4.879625]\n",
"\n"
]
}
],
"source": [
"print(features[0])\n",
"print(labels[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- By generating a scatter plot using the second features and labels, we can clearly observe the linear correlation between the two.\n",
"- For future plotting, we only need to call `gluonbook.set_figsize()` to print the vector diagram and set its size."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"