{ "cells": [ { "cell_type": "markdown", "id": "bd3a2516", "metadata": {}, "source": [ "# Regression\n", "\n", "\n", "\n", "- **[1]** (#) (a) Write down the generative model for Bayesian linear ordinary regression (i.e., write the likelihood and prior). \n", " (b) State the inference task for the weight parameter in the model. \n", " (c) Why do we call this problem linear? \n", " \n", "\n", "- **[2]** (##) Consider a linear regression problem \n", "$$\\begin{align*}\n", "p(y\\,|\\,\\mathbf{X},w,\\beta) &= \\mathcal{N}(y\\,|\\,\\mathbf{X} w,\\beta^{-1} \\mathbf{I}) \\\\\n", " &= \\prod_n \\mathcal{N}(y_n\\,|\\,w^T x_n,\\beta^{-1})\n", "\\end{align*}$$\n", "with $y, X$ and $w$ as defined in the notebook. \n", " (a) Work out the maximum likelihood solution for linear regression by solving\n", "$$\n", "\\nabla_{w} \\log p(y|X,w) = 0 \\,.\n", "$$ \n", " (b) Work out the MAP solution. How does it relate to the ML solution?\n", "\n", "\n", "- **[3]** (###) Show that the variance of the predictive distribution for linear regression decreases as more data becomes available.\n", "\n", "\n", "- **[4]** (#) Assume a given data set $D=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}$ with $x \\in \\mathbb{R}^M$ and $y \\in \\mathbb{R}$. We propose a model given by the following data generating distribution and weight prior functions: \n", "$$\\begin{equation*} p(y_n|x_n,w)\\cdot p(w)\\,. \\end{equation*}$$\n", " (a) Write down Bayes rule for generating the posterior $p(w|D)$ from a prior and likelihood. \n", " (b) Work out how to compute a distribution for the predicted value $y_\\bullet$, given a new input $x_\\bullet$. \n", " \n", "\n", "- **[5]** (#) In the class we use the following prior for the weights:\n", "$$\\begin{equation*}\n", "p(w|\\alpha) = \\mathcal{N}\\left(w | 0, \\alpha^{-1} I \\right)\n", "\\end{equation*}$$\n", " (a) Give some considerations for choosing a Gaussian prior for the weights. \n", " (b) We could have chosen a prior with full (not diagonal) covariance matrix $p(w|\\alpha) = \\mathcal{N}\\left(w | 0, \\Sigma \\right)$. Would that be better? Give your thoughts on that issue. \n", " (c) Generally we choose $\\alpha$ as a small positive number. Give your thoughts on that choice as opposed to choosing a large positive value. How about choosing a negative value for $\\alpha$?\n", "\n", "- **[6]** Consider an IID data set $D=\\{(x_1,y_1),\\ldots,(x_N,y_N)\\}$. We will model this data set by a model $$y_n =\\theta^T f(x_n) + e_n\\,,$$ where $f(x_n)$ is an $M$-dimensional feature vector of input $x_n$; $y_n$ is a scalar output and $e_n \\sim \\mathcal{N}(0,\\sigma^2)$. \n", " (a) Rewrite the model in matrix form by lumping input features in a matrix $F=[f(x_1),\\ldots,f(x_N)]^T$, outputs and noise in the vectors $y=[y_1,\\ldots,y_N]^T$ and $e=[e_1,\\ldots,e_N]^T$, respectively. \n", "\n", " (b) Now derive an expression for the log-likelihood $\\log p(y|\\,F,\\theta,\\sigma^2)$. \n", "\n", " (c) Proof that the maximum likelihood estimate for the parameters is given by \n", "$$\\hat\\theta_{\\text{ml}} = (F^TF)^{-1}F^Ty$$ \n", "\n", " (d) What is the predicted output value $y_\\bullet$, given an observation $x_\\bullet$ and the maximum likelihood parameters $\\hat \\theta_{\\text{ml}}$. Work this expression out in terms of $F$, $y$ and $f(x_\\bullet)$. \n", "\n", " (e) Suppose that, before the data set $D$ was observed, we had reason to assume a prior distribution $p(\\theta)=\\mathcal{N}(0,\\sigma_0^2)$. Derive the Maximum a posteriori (MAP) estimate $\\hat \\theta_{\\text{map}}$.(hint: work this out in the $\\log$ domain.) \n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "4aab317a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.2", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.2" } }, "nbformat": 4, "nbformat_minor": 5 }