{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bd3a2516",
   "metadata": {},
   "source": [
    "# Regression\n",
    "\n",
    "\n",
    "\n",
    "- **[1]** (#) (a) Write down the generative model for Bayesian linear ordinary regression (i.e., write the likelihood and prior).     \n",
    "   (b) State the inference task for the weight parameter in the model.    \n",
    "   (c) Why do we call this problem linear?     \n",
    "   \n",
    "\n",
    "- **[2]** (##) Consider a linear regression problem \n",
    "$$\\begin{align*}\n",
    "p(y\\,|\\,\\mathbf{X},w,\\beta) &= \\mathcal{N}(y\\,|\\,\\mathbf{X} w,\\beta^{-1} \\mathbf{I}) \\\\\n",
    "  &= \\prod_n \\mathcal{N}(y_n\\,|\\,w^T x_n,\\beta^{-1})\n",
    "\\end{align*}$$\n",
    "with $y, X$ and $w$ as defined in the notebook.     \n",
    "   (a) Work out the maximum likelihood solution for linear regression by solving\n",
    "$$\n",
    "\\nabla_{w} \\log p(y|X,w) = 0 \\,.\n",
    "$$    \n",
    "   (b) Work out the MAP solution. How does it relate to the ML solution?\n",
    "\n",
    "\n",
    "- **[3]** (###) Show that the variance of the predictive distribution for linear regression decreases as more data becomes available.\n",
    "\n",
    "\n",
    "- **[4]** (#) Assume a given data set $D=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}$ with $x \\in \\mathbb{R}^M$ and $y \\in \\mathbb{R}$. We propose a model given by the following data generating distribution and weight prior functions: \n",
    "$$\\begin{equation*} p(y_n|x_n,w)\\cdot p(w)\\,. \\end{equation*}$$\n",
    "  (a) Write down Bayes rule for generating the posterior $p(w|D)$ from a prior and likelihood.    \n",
    "  (b) Work out how to compute a distribution for the predicted value $y_\\bullet$, given a new input $x_\\bullet$.   \n",
    "  \n",
    "\n",
    "- **[5]** (#) In the class we use the following prior for the weights:\n",
    "$$\\begin{equation*}\n",
    "p(w|\\alpha) = \\mathcal{N}\\left(w | 0, \\alpha^{-1} I \\right)\n",
    "\\end{equation*}$$\n",
    "  (a) Give some considerations for choosing a Gaussian prior for the weights.    \n",
    "  (b) We could have chosen a prior with full (not diagonal) covariance matrix $p(w|\\alpha) = \\mathcal{N}\\left(w | 0, \\Sigma \\right)$. Would that be better? Give your thoughts on that issue.          \n",
    "  (c) Generally we choose $\\alpha$ as a small positive number. Give your thoughts on that choice as opposed to choosing a large positive value. How about choosing a negative value for $\\alpha$?\n",
    "\n",
    "- **[6]** Consider an IID data set $D=\\{(x_1,y_1),\\ldots,(x_N,y_N)\\}$. We will model this data set by a model $$y_n =\\theta^T  f(x_n) + e_n\\,,$$ where $f(x_n)$ is an $M$-dimensional feature vector of input $x_n$; $y_n$ is a scalar output and $e_n \\sim \\mathcal{N}(0,\\sigma^2)$.               \n",
    "  (a) Rewrite the model in matrix form by lumping input features in a matrix $F=[f(x_1),\\ldots,f(x_N)]^T$, outputs and noise in the vectors $y=[y_1,\\ldots,y_N]^T$ and $e=[e_1,\\ldots,e_N]^T$, respectively.    \n",
    "\n",
    "  (b) Now derive an expression for the log-likelihood $\\log p(y|\\,F,\\theta,\\sigma^2)$. \n",
    "\n",
    "  (c) Proof that the maximum likelihood estimate for the parameters is given by \n",
    "$$\\hat\\theta_{\\text{ml}} = (F^TF)^{-1}F^Ty$$    \n",
    "\n",
    "  (d) What is the predicted output value $y_\\bullet$, given an observation $x_\\bullet$ and the maximum likelihood parameters $\\hat \\theta_{\\text{ml}}$. Work this expression out in terms of $F$, $y$ and $f(x_\\bullet)$.      \n",
    "\n",
    "  (e) Suppose that, before the data set $D$ was observed, we had reason to assume a prior distribution $p(\\theta)=\\mathcal{N}(0,\\sigma_0^2)$. Derive the Maximum a posteriori (MAP) estimate $\\hat \\theta_{\\text{map}}$.(hint: work this out in the $\\log$ domain.)   \n",
    "\n",
    "<!---\n",
    "- Let's work out the log-likelihood for multiple observations\n",
    "$$\\begin{align*}\n",
    "\\log p(D|w) &\\stackrel{\\text{IID}}{=} \\sum_n \\log \\mathcal{N}(y_n|\\,w^T x_n,\\sigma^2) \\propto -\\frac{1}{2\\sigma^2} \\sum_{n} {(y_n - w^T x_n)^2}\\\\\n",
    "    &= -\\frac{1}{2\\sigma^2}\\left( {y - \\mathbf{X}w } \\right)^T \\left( {y - \\mathbf{X} w } \\right)\n",
    "\\end{align*}$$\n",
    "where  we defined $N\\times 1$ vector $y  = \\left(y_1 ,y_2 , \\ldots ,y_N \\right)^T$ and $(N\\times D)$-dim matrix $\\mathbf{X}  = \\left( x_1 ,x_2 , \\ldots ,x_n \\right)^T$.\n",
    "\n",
    "\n",
    "- - Now, we want to apply the trained model. New data points can be predicted by\n",
    "$$\\begin{equation*}\n",
    "p(y_\\bullet \\,|\\, x_\\bullet,\\hat w_{\\text{ML}}) = \\mathcal{N}(y_\\bullet \\,|\\, \\hat w_{\\text{ML}}^T x_\\bullet, \\sigma^2 ) \n",
    "\\end{equation*}$$\n",
    "\n",
    "- Note that the expected value of a predicted new data point\n",
    "\n",
    "$$\n",
    "\\mathrm{E}[y_\\bullet] = \\hat w_{\\text{ML}}^T x_\\bullet = x_\\bullet^T \\hat{w}_{\\text{ML}} = \\left( x_\\bullet^T \\mathbf{X}^\\dagger \\right) y\n",
    "$$\n",
    "\n",
    "can also be expressed as a linear combination of the observed data points \n",
    "\n",
    "$$y  = \\left( {y_1 ,y_1 , \\ldots ,y_N } \\right)^T \\,.$$\n",
    "--->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4aab317a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.5.2",
   "language": "julia",
   "name": "julia-1.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}