{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression in one variable\n",
    "\n",
    "Linear regression will find a straight line that will try to best fit the data provided. It does so by learning the slope of the line, and the bais term (y-intercept) \n",
    "\n",
    "Given a table:\n",
    "\n",
    "|size of house(int sq. ft) (x)|price in $1000(y)|\n",
    "|-----------------------------|-----------------|\n",
    "|            450              |       100       |\n",
    "|            324              |        78       |\n",
    "|            844              |       123       |\n",
    "\n",
    "Our hypothesis (prediction) is:\n",
    "$$h_\\theta(x) = \\theta_0 + \\theta_1x$$\n",
    "Will give us an equation of line that will predict the price. The above equation is nothing but the equation of line. __When we say the machine learns, we are actually adjusting the parameters $\\theta_0$ and $\\theta_1$__. So for a new x (size of house) we will insert the value of x in the above equation and produce a value $\\hat y$ (our prediction)\n",
    "\n",
    "Our prediction $\\hat y$ will not always be accurate, and will have a certain error which we will define by an equation. We will also need this equation to minimise the error, this equation is called as __loss function__. One of the most used one is called __Mean Squared Error (MSE)__ which is nothing but the means of all the errors squared. \n",
    "\n",
    "### $$J(\\theta) = \\frac{1}{2m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)^2}$$\n",
    "\n",
    "#### Where,\n",
    "1. $m$ = the number of training examples\n",
    "2. $x_i$ = the value of x at ith row\n",
    "3. $y_i$ = the actual value at the ith row\n",
    "4. $h_\\theta(x)$ = our prediction function that predicts $\\hat y$\n",
    "\n",
    "#### Why square the difference? \n",
    "1. The error will be positive\n",
    "2. If you take the absolute function (to cover point 1), the absolute function isn't differentiable at the origin. Hence, we square the error.  \n",
    "\n",
    "\n",
    "#### &nbsp;Why $\\frac{1}{2m}$ instead of $\\frac{1}{m}$ ?\n",
    "As we will see later, when we differentiate the squared error, the $\\frac{1}{2}$ will cancel out. If we don't do that we'll be stuck with a $2$ in the equation which is useless. \n",
    "\n",
    "### Minimising the cost function \n",
    "\n",
    "Our objective function is\n",
    "### $$\\displaystyle \\operatorname*{argmin}_\\theta J(\\theta)$$\n",
    "\n",
    "Which simply means, find the value of $\\theta$ that minimises the error function $J(\\theta)$. In order to do that, we will differentiate our cost function. When we differentiate it, it will give us gradient, which is the direction in which the error will be reduced. Upon having the gradient, we will simply update our $\\theta$ values to reflect that step (a step in the direction of lower error) \n",
    "\n",
    "So, the update rule is the following equation\n",
    "### $$\\theta = \\theta - \\alpha \\frac{\\partial}{\\partial \\theta} J(\\theta)$$\n",
    "\n",
    "Where,\n",
    "\n",
    "   $\\alpha$ = learning rate. Which is the rate at which we will travel to the direction of the lower error.\n",
    "   \n",
    "This process is nothing but __Gradient Descent__. There are few version of gradient descent, few of them are:\n",
    "1. __Batch Gradient Descent__: Go through __all__ your input samples, compute the gradient once, and then update $\\theta$s.\n",
    "2. __Stochastic Gradient Descent__: Go through a __single__ sample, compute gradient, update $\\theta$s, repeat $m$ times\n",
    "3. __Mini Batch Gradient Descent__: Go through a __batch__ of $k$ samples, compute gradient, update $\\theta$s, repeat $\\frac{m}{k}$ times. \n",
    "\n",
    "\n",
    "### Differentiating the loss function:\n",
    "In the update rule:\n",
    "### $$\\theta = \\theta - \\alpha \\frac{\\partial}{\\partial \\theta} J(\\theta)$$\n",
    "\n",
    "The important part is calculating the derivative. Since we have two variables, we will have two derivatives, one for $\\theta_0$ and another for $\\theta_1$. \n",
    "\n",
    "So the first equation is:  \n",
    "\n",
    "$\n",
    "\\begin{align}\n",
    "\\notag\n",
    "\\theta_0 &= \\theta_0 - \\frac{\\partial}{\\partial \\theta_0} J(\\theta) \\\\\n",
    "\\notag\n",
    "&= \\theta_0 -  \\frac{\\partial}{\\partial \\theta_0}(\\frac{1}{2m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)^2}) \\\\\n",
    "\\notag\n",
    "&= \\theta_0 -  \\frac{2}{2m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(h_\\theta(x_i) - y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_0 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(h_\\theta(x_i) - y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_0 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(\\theta_0 + \\theta_1x- y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_0 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}(1 + 0 - 0)\n",
    "\\end{align}\n",
    "$\n",
    "\n",
    "$$\\therefore \\theta_0= \\theta_0 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)} $$\n",
    "\n",
    "\n",
    "#### Similarly, for $\\theta_1$  \n",
    "\n",
    "$\n",
    "\\begin{align}\n",
    "\\notag\n",
    "\\theta_1 &= \\theta_1 - \\frac{\\partial}{\\partial \\theta_1} J(\\theta) \\\\\n",
    "\\notag\n",
    "&= \\theta_1 -  \\frac{\\partial}{\\partial \\theta_1}(\\frac{1}{2m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)^2}) \\\\\n",
    "\\notag\n",
    "&= \\theta_1 -  \\frac{2}{2m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(h_\\theta(x_i) - y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_1 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(h_\\theta(x_i) - y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_1 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}\\frac{\\partial}{\\partial \\theta_0}{(\\theta_0 + \\theta_1x- y_i)} \\\\\n",
    "\\notag\n",
    "&= \\theta_1 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}(x + 0 - 0) \n",
    "\\end{align}\n",
    "$\n",
    "\n",
    "$$\\therefore \\theta_1= \\theta_1 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}(x) $$\n",
    "\n",
    "We will implement __Batch Gradient Descent__ i.e. we'll update the gradients after 1 pass through the entire dataset. Our Algorithm hence becomes:\n",
    "\n",
    "### Repeat till convergence:\n",
    "#### 1. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\\theta_0= \\theta_0 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)} $\n",
    "#### 2. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\\theta_1= \\theta_1 -  \\frac{1}{m}\\sum_{i=0}^m{(h_\\theta(x_i) - y_i)}(x) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Extending to Multivariable regression \n",
    "\n",
    "The update rule stated above can be extended to as many $\\theta$s as possible using matrix notation. the key is to store the $\\theta$s in one vector and the input data in a matrix then take a dot product of them. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}