{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CfDd0DKVtknP"
   },
   "source": [
    "# 3-b: Approximate Inference in Classification\n",
    "\n",
    "In classification taks, even for a mere Logistic Regression, we don't have access to a closed form of the posterior $p(\\pmb{w} \\vert \\mathcal{D})$. Unlike in Linear regression, the likelihood isn't conjugated to the Gaussian prior anymore. We ill need to approximate this posterior.\n",
    "\n",
    "During this session, we will explore and compare approximate inference approaches on 2D binary classification datasets. Studied approaches include Laplacian approximation, variational inference with mean-field approximation and Monte Carlo dropout.\n",
    "\n",
    "**Goal**: Take hand on approximate inference methods and understand how they works on linear and non-linear 2D datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "clhZGFZ2wOxg"
   },
   "source": [
    "### All Imports and Useful Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "i7Cv7C3O7qaw"
   },
   "source": [
    "Here we are going to install and import everything we are going to need for this tutorial. \n",
    "\n",
    "**Note**: *You can double-click the title of the collapsed cells (as the ones below) to expand them and read their content.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "p-kN2Wxbv2Ui"
   },
   "outputs": [],
   "source": [
    "#@title Import libs\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.datasets import make_blobs, make_moons\n",
    "from IPython import display\n",
    "%matplotlib inline\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.utils import data\n",
    "import torch.nn.functional as F\n",
    "from torch.autograd import grad\n",
    "import torch.distributions as dist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "CMDnycZQPMpL"
   },
   "outputs": [],
   "source": [
    "#@title Useful plot function \n",
    "def plot_decision_boundary(model, X, Y, epoch, accuracy, model_type='classic', \n",
    "                           nsamples=100, posterior=None, tloc=(-4,-7), \n",
    "                           nbh=2, cmap='RdBu'):    \n",
    "    \"\"\" Plot and show learning process in classification \"\"\"\n",
    "    h = 0.02*nbh\n",
    "    x_min, x_max = X[:,0].min() - 10*h, X[:,0].max() + 10*h\n",
    "    y_min, y_max = X[:,1].min() - 10*h, X[:,1].max() + 10*h\n",
    "    xx, yy = np.meshgrid(np.arange(x_min*2, x_max*2, h),\n",
    "                         np.arange(y_min*2, y_max*2, h))\n",
    "    \n",
    "    test_tensor = torch.from_numpy(np.c_[xx.ravel(), yy.ravel()]).type(torch.FloatTensor)\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        if model_type=='classic':\n",
    "            pred = torch.sigmoid(model(test_tensor))\n",
    "        elif model_type=='laplace':\n",
    "            #Save original mean weight\n",
    "            original_weight = model.state_dict()['fc.weight'].detach().clone()\n",
    "            outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)\n",
    "            for i in range(nsamples):\n",
    "                state_dict = model.state_dict()\n",
    "                state_dict['fc.weight'] = torch.from_numpy(posterior[i].reshape(1,2))\n",
    "                model.load_state_dict(state_dict)\n",
    "                outputs[i] = torch.sigmoid(model(test_tensor))\n",
    "            \n",
    "            pred = outputs.mean(0).squeeze()\n",
    "            state_dict['fc.weight'] = original_weight\n",
    "            model.load_state_dict(state_dict)\n",
    "            \n",
    "        elif model_type=='vi':\n",
    "            outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)\n",
    "            for i in range(nsamples):\n",
    "                outputs[i] = model(test_tensor)\n",
    "                \n",
    "            pred = outputs.mean(0).squeeze()\n",
    "            \n",
    "        elif model_type=='mcdropout':\n",
    "            model.eval()\n",
    "            model.training = True\n",
    "            outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)\n",
    "            for i in range(nsamples):\n",
    "                outputs[i] = model(test_tensor)\n",
    "            \n",
    "            pred = outputs.mean(0).squeeze()\n",
    "    \n",
    "    Z = pred.reshape(xx.shape).detach().numpy()\n",
    "\n",
    "    plt.cla()\n",
    "    ax.set_title('Classification Analysis')\n",
    "    ax.contourf(xx, yy, Z, cmap=cmap, alpha=0.25)\n",
    "    ax.contour(xx, yy, Z, colors='k', linestyles=':', linewidths=0.7)\n",
    "    ax.scatter(X[:,0], X[:,1], c=Y, cmap='Paired_r', edgecolors='k');\n",
    "    ax.text(tloc[0], tloc[1], f'Epoch = {epoch+1}, Accuracy = {accuracy:.2%}', fontdict={'size': 12, 'fontweight': 'bold'})\n",
    "    display.display(plt.gcf())\n",
    "    display.clear_output(wait=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aAk_gzoRtknr"
   },
   "source": [
    "## Part I: Bayesian Logistic Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Q4aNuqE77jsS"
   },
   "source": [
    "In linear regression, model prediction is of the continuous form $f(\\pmb{x})=\\pmb{w}^T\\pmb{x}+b$.\n",
    "\n",
    "For classification, we wish to predict discrete class labels $\\mathcal{C}_k$ to a sample $\\pmb{x}$. \n",
    "For simplicity, let's consider here binary classification:\n",
    "$$f(\\pmb{x}) = \\sigma(\\pmb{w}^T\\pmb{x} + b)$$\n",
    "where $\\sigma(t)= \\frac{1}{1+e^t}$ is the sigmoid function.\n",
    "\n",
    "As in linear regression, we define a Gaussian prior: \n",
    "$$ p(\\pmb{w}) = \\mathcal{N}(\\pmb{w}; \\pmb{\\mu}_0, \\pmb{\\Sigma}_0^2) $$\n",
    "Unfortunately, the posterior distribution isn't tractable as the likelihood isn't conjugate to the prior anymore.\n",
    "\n",
    "We will explore in the following different methods to obtain an estimate of the posterior distribution and hence the predictive distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZCp4ytXhV6IU"
   },
   "source": [
    "### I.0 Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "m-QrdljAUG5E"
   },
   "outputs": [],
   "source": [
    "#@title Hyperparameters for model and approximate inference { form-width: \"30%\" }\n",
    "\n",
    "WEIGHT_DECAY = 5e-2 #@param\n",
    "NB_SAMPLES = 400 #@param\n",
    "TEXT_LOCATION = (-5,-7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GPI60gDOtkns"
   },
   "outputs": [],
   "source": [
    "# Load linear dataset\n",
    "X, y = make_blobs(n_samples=NB_SAMPLES, centers=[(-2,-2),(2,2)], cluster_std=0.80, n_features=2)\n",
    "X, y = torch.from_numpy(X), torch.from_numpy(y)\n",
    "X, y = X.type(torch.float), y.type(torch.float)\n",
    "torch_train_dataset = data.TensorDataset(X,y) # create your datset\n",
    "train_dataloader = data.DataLoader(torch_train_dataset, batch_size=len(torch_train_dataset))\n",
    "\n",
    "# Visualize dataset\n",
    "plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired_r', edgecolors='k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sNGfQhFqtknu"
   },
   "source": [
    "### I.1 Maximum-A-Posteriori Estimate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t1CE5ewMVdt9"
   },
   "source": [
    "\n",
    "In this \"baseline\", we reduce our posterior distribution $p(\\pmb{w} | \\mathcal{D})$ to a point estimate $\\pmb{w}_{MAP}$. For a new sample $\\pmb{x^*}$, the predictive distribution can then be approximated by\n",
    "$$ p(\\mathbf{y} = 1|\\pmb{x^*},\\mathcal{D}) = \\int p(\\mathbf{y} =1 |\\pmb{x},\\pmb{w})p(\\pmb{w} | \\mathcal{D})d\\pmb{w} \\approx p(y =1 |\\pmb{x},\\pmb{w}_{\\textrm{MAP}}).$$\n",
    "This approximation is called the **plug-in approximation**.\n",
    "\n",
    "The point estimate corresponds to the Maximum-A-Posteriori minimum given by:\n",
    "$$ \\pmb{w}_{\\textrm{MAP}} = arg \\max_{\\pmb{w}} p(\\pmb{w} \\vert \\mathcal{D}) = arg \\max_{\\pmb{w}} p(\\mathcal{D} \\vert \\pmb{w})p(\\pmb{w}) = arg \\max_{\\pmb{w}} \\prod_{n=1}^N p(y_n \\vert \\pmb{x}_n, \\pmb{w})p(\\pmb{w}) $$\n",
    "Looking for the maximum solution of previous equation is equivalent to the minimum solution of $- \\log p(\\pmb{w} \\vert \\mathcal{D})$. In case of a Gaussian prior, it can further be derived as:\n",
    "$$ \\pmb{w}_{\\textrm{MAP}} = arg \\min_{\\pmb{w}} \\sum_{n=1}^N \\big ( -y_n \\log \\sigma(\\pmb{w}^T \\pmb{x}_n + b) - (1-y_n) \\log (1 - \\sigma(\\pmb{w}^T \\pmb{x}_n + b)) + \\frac{1}{2 \\sigma_0^2} \\vert \\vert \\pmb{w} \\vert \\vert_2^2 \\big ) $$\n",
    "\n",
    "Note that:\n",
    "- This actually correspond to the minimum given by the standard **cross-entropy** loss in classification with a weight decay regularization\n",
    "- Unlike in linear regression, $\\pmb{w}_{MAP}$ **cannot be computed analytically**\n",
    "- But we can use optimization methods to compute it, e.g. **stochastic gradient descent**\n",
    "- Nevertheless, we only obtain a **point-wise estimate**, and not a full distribution over parameters $\\pmb{w}$\n",
    "\n",
    "\n",
    "Consequently, **the objective is simply to implement and train a Logistic Regression model** with Pytorch and then compute $p(\\mathbf{y} = 1|\\pmb{x}^*,\\mathcal{D})$ on a new sample $\\pmb{x}^*$ as in a deterministic model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0_I-DiH3tknv"
   },
   "outputs": [],
   "source": [
    "class LogisticRegression(nn.Module):\n",
    "    \"\"\" A Logistic Regression Model with sigmoid output in Pytorch\"\"\"\n",
    "    def __init__(self, input_size):\n",
    "        super().__init__()\n",
    "        self.fc = nn.Linear(input_size, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.fc(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hx-vyfGdtknw"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Train a Logistic Regression model with stochastic gradient descent for 20 epochs.\n",
    "\n",
    "net = LogisticRegression(input_size=X.shape[1])\n",
    "net.train()\n",
    "criterion = nn.BCEWithLogitsLoss()\n",
    "\n",
    "\n",
    "optimizer = torch.optim.SGD(net.parameters(), lr=0.2)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "\n",
    "# ============ YOUR CODE HERE ============\n",
    "# Train previously defined network for 20 epochs with SGD \n",
    "# and plot result for each epoch by uncommenting function below\n",
    "\n",
    "for epoch in range(20):  # loop over the dataset multiple times\n",
    "    # ============ YOUR CODE HERE ============\n",
    "\n",
    "    # Loss function : binary cross entropy + add a L2 regularization \"manually\" (WEIGHT_DECAY ||w||^2)\n",
    "\n",
    "    # For plotting and showing learning process at each epoch\n",
    "    # plot_decision_boundary(net, X, y, epoch, ((output>=0.0) == y).float().mean(), \n",
    "    #                       model_type='classic', tloc=TEXT_LOCATION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FkFuifc5tknx"
   },
   "source": [
    "**[Question 1.1]: Analyze the results provided by previous plot. Looking at $p(\\mathbf{y}=1 | \\pmb{x}, \\pmb{w}_{\\textrm{MAP}})$, what can you say about points far from train distribution?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DsTozn8kQccU"
   },
   "source": [
    "### I.2 Laplace Approximation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BadWX2XfWEZe"
   },
   "source": [
    "We will use Laplace approximation to estimate the intractable posterior $p(\\pmb{w} | \\mathcal{D})$.\n",
    "\n",
    "Here, $p(\\pmb{w} | \\mathcal{D})$ is approximated with a normal distribution $\\mathcal{N}(\\pmb{w} ; \\pmb{\\mu}_{lap}, \\pmb{\\Sigma}_{lap}^2)$ where: \n",
    "\n",
    "- the mean of the normal distribution $\\pmb{\\mu}_{lap}$ corresponds to the mode of $p(\\pmb{w} | \\mathcal{D})$. In other words, it simply consists in taking the optimum weights of Maximum-A-Posteriori estimation : \n",
    "$$\\pmb{\\mu}_{lap} = \\pmb{w}_{\\textrm{MAP}} = \\arg \\min_{\\pmb{w}} -\\log p(\\pmb{w} | \\mathcal{D})$$. \n",
    "- the covariance matrix is obtained by computing the Hessian of the loss function $-\\log p(\\pmb{w} \\vert \\mathcal{D})$ at $\\pmb{w}=\\pmb{w}_{\\textrm{MAP}}$: \n",
    "$$(\\pmb{\\Sigma}^2_{lap})^{-1} = \\nabla\\nabla_{\\pmb{w}} [p(\\pmb{w} \\vert \\mathcal{D}) ]_{\\pmb{w}=\\pmb{w}_{\\textrm{MAP}}}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0LrZKf_dUs5q"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Extract μ_lap from previously trained model. \n",
    "# NB: Select only weights parameters (without bias)\n",
    "\n",
    "# ============ YOUR CODE HERE ============\n",
    "w_map = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "brSZS4nZx4_U"
   },
   "source": [
    "To compute the Hessian, we first compute the gradient at $\\pmb{w}_{\\textrm{MAP}}$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UEOmQfiHS6L-"
   },
   "outputs": [],
   "source": [
    "# Computing first derivative w.r.t to model's weights\n",
    "optimizer.zero_grad()\n",
    "output = net(X).squeeze()\n",
    "loss = criterion(output, y) + WEIGHT_DECAY*net.fc.weight.norm()**2\n",
    "gradf_weight = grad(loss, net.fc.weight, create_graph=True)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ADH0i667exQY"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Compute the Hessian from the previous derivative\n",
    "\n",
    "# ============ YOUR CODE HERE ============\n",
    "# Apply the same grad function on each scalar element of the gradient to get \n",
    "# each raw of the Hessian. Concatenate both and compute the covariance \n",
    "# by inverting the Hessian\n",
    "# NB: to avoid accumulated gradient when debugging and running the cell \n",
    "# multiple times, you should convert your grad results to numpy straight away\n",
    "hess_weights = None\n",
    "Sigma_laplace = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_g7CKY0Myq1o"
   },
   "source": [
    "We now compute the posterior approximate $\\mathcal{N}(\\pmb{w} ; \\pmb{\\mu}_{lap}, \\pmb{\\Sigma}_{lap}^2)$ with the parameters found. \n",
    "\n",
    "Given this distribution, we can compute the posterior thanks to Monte-Carlo sampling and plot results for the last epoch corresponding to $\\pmb{w}_{\\textrm{MAP}}$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7xf9fF5OuB_K"
   },
   "outputs": [],
   "source": [
    "# Defining posterior distribution\n",
    "laplace_posterior =  np.random.multivariate_normal(w_map.detach().numpy().reshape(2,), Sigma_laplace.detach().numpy(), NB_SAMPLES)\n",
    "\n",
    "# Plotting results\n",
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), model_type='laplace', \n",
    "                       tloc=TEXT_LOCATION, nsamples=NB_SAMPLES, posterior=laplace_posterior)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qUPjVjot4IKE"
   },
   "source": [
    "**[Question 1.2]: Analyze the results provided by previous plot. Compared to previous MAP estimate, how does the predictive distribution behave?**\n",
    "\n",
    "**[Question 1.3]: Comment the effect of the regularisation hyper-parameter WEIGHT_DECAY.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qgNFac420Nh_"
   },
   "source": [
    "### I.3 Variational Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wuJcY6rMa7Ff"
   },
   "source": [
    "In this part, we will reimplement variational inference by hand with Pytorch tools. <br/><br/>\n",
    "\n",
    "**Optimization problem**  \n",
    "We define an approximating variational distribution $q_{\\pmb{\\theta}}(\\pmb{w})$ parametrized by $\\pmb{\\theta}$ and minimize its Kullback-Leibler (KL) divergence with the unknown true posterior $p(\\pmb{w} \\vert \\mathcal{D})$. This is equivalent to maximizing the **evidence lower bound (ELBO)** w.r.t to $q_{\\pmb{\\theta}}(\\pmb{w})$:\n",
    "\n",
    "$$ arg \\max_{\\pmb{\\theta}}~ \\mathbb{E}_{q_{\\pmb{\\theta}}(\\pmb{w})} \\big [\\underbrace{\\log p(\\mathcal{D} \\vert \\pmb{w})}_{likelihood} \\big ] - \\underbrace{\\textrm{KL}(q_{\\pmb{\\theta}}(\\pmb{w})\\vert\\vert p(\\pmb{w}))}_{regularization} $$\n",
    "where we have a likelihood term  and the KL divergence between the prior and the variational distribution. By assuming that samples are *i.i.d*, maximizing the ELBO is equivalent to minimizing the following loss:\n",
    "$$ \\mathcal{L}_{\\textrm{VI}}(\\pmb{\\theta}; \\mathcal{D}) = - \\sum_{n=1}^N \\mathbb{E}_{q_{\\pmb{\\theta}}(\\pmb{w})} \\Big [ \\log p(y_n \\vert \\pmb{x}_n, \\pmb{w}) \\Big ]+ \\textrm{KL}(q_{\\pmb{\\theta}}(\\pmb{w})\\vert\\vert p(\\pmb{w})) = NLL(\\pmb{\\theta}; \\mathcal{D}) + \\textrm{KL}(q_{\\pmb{\\theta}}(\\pmb{w})\\vert\\vert p(\\pmb{w}))$$\n",
    "\n",
    "\n",
    "- **Likelihood term:** computind the expectations for the negative log likelihhod $NLL(\\pmb{\\theta}; \\mathcal{D})$ can be tedious mathematics, or maybe not even possible. \n",
    "    - **Monte Carlo estimator:** luckily, we can get estimates of the expectations by taking samples from $q_{\\pmb{\\theta}}(\\pmb{w})$ and average over those results. Even more simple, we can show that using only one sample is stil an unbiased gradient estimator. Hence, $NLL(\\pmb{\\theta}; \\mathcal{D})$ rewrites:\n",
    "$$ NLL(\\pmb{\\theta}; \\mathcal{D}) =  \\sum_{n=1}^N - \\log p(y_n \\vert \\pmb{x}_n, \\pmb{w}_s),$$ \n",
    "where $\\pmb{w}_s \\sim q_{\\pmb{\\theta}}$ is a sample from the variational distribution. <br><br>\n",
    " \n",
    "- **Mean-field approximation:** assumes a factorisation over weights: $ q_{\\pmb{\\theta}}(\\pmb{w}) = \\prod\\limits_{i=1}^{N_w} q_{\\pmb{\\theta}}(w_{i}) =\\prod\\limits_{i=1}^{N_w} \\mathcal{N}(w_{i}; \\mu_{i}, \\sigma^2_{i}) $. We use this Mmean-field approximation for the variational posterior $q_{\\pmb{\\theta}}(\\pmb{w})$ and the prior $p(\\pmb{w})$. \n",
    "    - **Reparametrization trick:** if we start taking samples from $q_{\\pmb{\\theta}}$, we leave the deterministic world, and the gradient can not flow through the model anymore. We avoid this problem by reparameterizing the samples $\\pmb{w}_{i} \\sim \\mathcal{N}(\\mu_{i}, \\sigma_{i}^2)$ from the distribution. <br>\n",
    "Instead of sampling directly from the variational distribution, we sample from a centered isotropic multivariate Gaussian and recreate samples from the variational distribution. Now the stochasticity of $\\pmb{\\varepsilon}$ is external and will not prevent the flow of gradients.\n",
    "$$ \\pmb{w}_{i} = \\mu_{i}+ \\sigma_{i}\\odot\\pmb{\\varepsilon}_s$$\n",
    "where $\\pmb{\\varepsilon}_s \\sim \\mathcal{N}(0,1)$.\n",
    "<br/><br>  \n",
    "    - **Closed-for solution for the regularization term $ \\textrm{KL}(q_{\\pmb{\\theta}}(\\pmb{w})\\vert\\vert p(\\pmb{w})$**. For univariate Gaussian distribution, the KL term between the approximate variational distribution $q_{\\theta}(w)\\sim \\mathcal{N}(\\mu_{i},\\sigma_{i}^2)$ and the prior $p(w)\\sim \\mathcal{N}(0,\\sigma_{p}^2)$ rewrites: \n",
    "\n",
    "$$\n",
    "\\textrm{KL}\\left[q_{\\theta}(w))\\vert\\vert p(w) \\right]= log\\left(\\frac{\\sigma_{p}}{\\sigma_{i}}\\right) + \\frac{\\sigma_i^2+\\mu_i^2}{2\\sigma_{p}^2} - \\frac{1}{2}\n",
    "$$\n",
    "\n",
    "**Predictive distribution**  \n",
    "For a new sample $\\pmb{x^*}$, the predictive distribution can be approximated using **Monte Carlo sampling**:\n",
    "\\begin{equation}\n",
    "p(\\mathbf{y} =1|\\pmb{x}^*,\\mathcal{D}) \\approx \\int p(\\mathbf{y} = 1|x^*,w)q_\\theta^*(w) \\approx \\frac{1}{S} \\sum_{s=1}^S p(\\mathbf{y}=1|\\pmb{x}^*,\\pmb{w}_s)\n",
    "\\end{equation}\n",
    "where $\\pmb{w}_s \\sim q^*_{\\pmb{\\theta}}$ are samples from the optimum variational distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H2eNammRvajL"
   },
   "source": [
    "#### Step 1: Implement a variational layer\n",
    "\n",
    "Let's first implement variational inference for a single layer. Remind that we defined our Logistic regression model as $f(x) = \\sigma(w^T x + b)$ where $\\sigma(t)= \\frac{1}{1+\\exp(t)}$ is the sigmoid function. As such, we need to place Gaussian distributions on parameters $w$ and $b$.  \n",
    "\n",
    "**Implementation constraint** Variance can not be negative. To avoid numerical issues, we will use $\\rho$. Std can be retrieve with the following formula: $ \\sigma^2 = \\log(1 + e^{\\rho}) $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Vb6ofPETDZOb"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Implement a variational layer from scratch\n",
    "\n",
    "class LinearVariational(nn.Module):\n",
    "    \"\"\"\n",
    "    Mean field approximation of nn.Linear\n",
    "    \"\"\"\n",
    "    def __init__(self, input_size, output_size, prior_var):\n",
    "        super().__init__()\n",
    "        self.prior_var = prior_var # \\sigma_p^2\n",
    "      \n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Initialize the variational parameters for weight and bias\n",
    "        # with nn.Parameter.\n",
    "        # Mean and rho can be initialised to zeros . We use point-base estimate for biaises (no rho)\n",
    "        self.w_mu = None\n",
    "        self.w_rho = None\n",
    "        self.b_mu = None\n",
    "        \n",
    "    def sampling(self, mu, rho):\n",
    "        \"Sample weights using the reparametrization trick\"\n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Given parameter mu and rho, return sampling using \n",
    "        # the reparametrization trick.\n",
    "        # NB: you may look for torch.randn_like...\n",
    "        return None\n",
    "    \n",
    "    def kl_divergence(self): \n",
    "        \"Compute KL divergence between all univariates posterior q(w)~N(\\mu_i,\\sigma_i) and the prior p(w)~N(0,\\sigma_p) \"\n",
    "        prior_std=np.sqrt(self.prior_var)\n",
    "        rho_theta = self.w_rho\n",
    "        mu_theta = self.w_mu\n",
    "        \n",
    "        sigma_theta = torch.sqrt(torch.log(1 + torch.exp(rho_theta)))\n",
    "        sigmap = torch.zeros(mu_theta.shape)\n",
    "        sigmap.fill_(prior_std)\n",
    "        divs = torch.log(sigmap/sigma) + (sigma**2 + mu**2)/(2*sigmap**2) -0.5\n",
    "        \n",
    "        return torch.mean(divs)\n",
    "    \n",
    "    def forward(self, x):\n",
    "        \"Usual forward function for pytorch layer\"\n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Sample parameters w and b using self.sampling\n",
    "        # Then, perform a forward pass using those sampled parameters\n",
    "        out = None\n",
    "      \n",
    "        return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mLazbzY0bIjw"
   },
   "source": [
    "#### Step 2: Variational Logistic Regression\n",
    "\n",
    "Now, let's use this `LinearVariational` layer in a Logistic regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pYGaD-CaD-Zt"
   },
   "outputs": [],
   "source": [
    "class VariationalLogisticRegression(nn.Module):\n",
    "    def __init__(self, input_size, prior_var=4.0):\n",
    "        super().__init__()\n",
    "        self.prior_var = prior_var\n",
    "        self.fc_var =  LinearVariational(input_size, 1,self.prior_var)\n",
    "      \n",
    "    def forward(self, x):\n",
    "        out = self.fc_var(x)\n",
    "        return torch.sigmoid(out)\n",
    "\n",
    "    def kl_divergence(self):\n",
    "        return self.fc_var.kl_divergence()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**[Question 1.4]: Comment the code of the VariationalLogisticRegression and LinearVariational classes.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tlRYhjTXxbvy"
   },
   "source": [
    "**We can now train our variational model as any other network in Pytorch**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var_net = VariationalLogisticRegression(input_size=X.shape[1],prior_var=4)\n",
    "var_net.train()\n",
    "optimizer = torch.optim.SGD(var_net.parameters(), lr=0.1)\n",
    "criterion = nn.BCELoss()\n",
    "\n",
    "nbEpochs=40\n",
    "loss_plt = np.zeros((nbEpochs,3))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "\n",
    "for epoch in range(nbEpochs): \n",
    "    optimizer.zero_grad()\n",
    "\n",
    "    # forward + backward + optimize\n",
    "    output = var_net(X).squeeze()\n",
    "    elbo = var_net.kl_divergence() + criterion(output, y)\n",
    "\n",
    "    elbo.backward()\n",
    "    optimizer.step()\n",
    "\n",
    "    # Computing prediction for visualization purpose\n",
    "    preds = torch.zeros(NB_SAMPLES, X.shape[0], 1)\n",
    "    for i in range(NB_SAMPLES):\n",
    "        preds[i] = var_net(X)\n",
    "    pred = preds.mean(0).squeeze()\n",
    "    accuracy = ((pred>=0.5) == y).float().mean()\n",
    "    \n",
    "\n",
    "    # For plotting and showing learning process at each epoch\n",
    "    plot_decision_boundary(var_net, X, y, epoch, accuracy, model_type='vi', tloc=TEXT_LOCATION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fRbXTWZU48w6"
   },
   "source": [
    "**[Question 1.5]: Comment the code of the training loop, especially the loss computation. Analyze the results provided by previous plot. Compared to previous MAP estimate, how does the predictive distribution behave? What is the main difference between the Variational approximation and the Laplace approximation?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uZV6pC6v7Bqh"
   },
   "source": [
    "## Part II: Bayesian Neural Networks\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-pqscdtHyT97"
   },
   "source": [
    "Moving on to a non-linear dataset, we will leverage our variational implementation to a Multi-Layer Perceptron (MLP). Finally, we will also review one last approximate inference method which has the particularity to be very easy to implement: Monte-Carlo Dropout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_g0QmYJ7WJ8p"
   },
   "source": [
    "### II.0 Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "NVUva--mk2jk"
   },
   "outputs": [],
   "source": [
    "#@title Hyperparameters for model and approximate inference { form-width: \"30%\" }\n",
    "\n",
    "NOISE_MOON = 0.05 #@param\n",
    "WEIGHT_DECAY = 5e-2 #@param\n",
    "NB_SAMPLES = 100 #@param\n",
    "TEXT_LOCATION = (-1.5, -1.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FjIhWLPrkVdp"
   },
   "outputs": [],
   "source": [
    "# Load two moons dataset\n",
    "X, y = make_moons (n_samples=1000, noise=NOISE_MOON)\n",
    "X, y = torch.from_numpy(X), torch.from_numpy(y)\n",
    "X, y = X.type(torch.float), y.type(torch.float)\n",
    "torch_train_dataset = data.TensorDataset(X,y) # create your datset\n",
    "train_dataloader = data.DataLoader(torch_train_dataset, batch_size=len(torch_train_dataset))\n",
    "N_DIM = X.shape[1]\n",
    "\n",
    "# Visualize dataset\n",
    "plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired_r', edgecolors='k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s0rMZBLxbHY9"
   },
   "source": [
    "### II.1 Variational Inference with Bayesian Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XV8BP1t1yuu1"
   },
   "source": [
    "Such as for Logistic Regression, we will use `LinearVariational` layer to define a MLP with 1 hidden layer.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PCOP9aDQo7zh"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Implement a Variational MLP\n",
    "\n",
    "class VariationalMLP(nn.Module):\n",
    "    def __init__(self, input_size, hidden_size, prior_var):\n",
    "        super().__init__()\n",
    "        self.prior_var = prior_var\n",
    "\n",
    "\n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Define a variational MLP with 1 hidden layer and ReLU activation\n",
    "    \n",
    "    def kl_divergence(self):\n",
    "         # ============ YOUR CODE HERE ============\n",
    "        return None\n",
    "    \n",
    "            \n",
    "    def forward(self, x):\n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Don't forget to apply the sigmoid function when returning the output\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "be7NXnFs0AmB"
   },
   "source": [
    "**We can now train our variational model as any other network in Pytorch**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ar5ZX1SfpdgV"
   },
   "outputs": [],
   "source": [
    "var_net = VariationalMLP(input_size=X.shape[1], hidden_size=50, prior_var=100)\n",
    "var_net.train()\n",
    "optimizer = torch.optim.Adam(var_net.parameters(), lr=0.1)\n",
    "\n",
    "criterion = nn.BCELoss()\n",
    "\n",
    "nbEpochs=10000\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "\n",
    "for epoch in range(nbEpochs):  # loop over the dataset multiple times\n",
    "    # zero the parameter gradients\n",
    "    optimizer.zero_grad()\n",
    "\n",
    "    # forward + backward + optimize\n",
    "    output = var_net(X).squeeze()\n",
    "    elbo = var_net.kl_divergence() + criterion(output, y)\n",
    "\n",
    "\n",
    "    elbo.backward()\n",
    "    optimizer.step()\n",
    "\n",
    "    # For plotting and showing learning process at each epoch\n",
    "    if (epoch+1)%500==0:\n",
    "        # Computing prediction for visualization purpose\n",
    "        preds = torch.zeros(NB_SAMPLES, X.shape[0], 1)\n",
    "        for i in range(NB_SAMPLES):\n",
    "            preds[i] = var_net(X)\n",
    "        pred = preds.mean(0).squeeze()\n",
    "        accuracy = ((pred>=0.5) == y).float().mean()\n",
    "\n",
    "        plot_decision_boundary(var_net, X, y, epoch, accuracy, model_type='vi', tloc=TEXT_LOCATION)\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**[Question 2.1]: Analyze the results showed on plot. **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3i_scDA7bC1c"
   },
   "source": [
    "### II.2 Monte Carlo Dropout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O8J9UNT41TJ2"
   },
   "source": [
    "Training a neural network with randomly dropping some activations, such as with dropout layers, can actually be seen as an **approximate variational inference method**!\n",
    "\n",
    "[Gal and Ghahramani, 2016] showed this can be fullfilled for:\n",
    "- $p(\\pmb{w}) = \\prod_l p(\\pmb{W}_l) = \\prod_l \\mathcal{MN}(\\pmb{W}_l; 0, I/ l_i^2, I)$ $\\Rightarrow$ Multivariate Gaussian distribution factorized over layers\n",
    "- $q(\\pmb{w}) = \\prod_l q(\\pmb{W}_l) = \\prod_l \\textrm{diag}(\\varepsilon_l)\\odot\\pmb{M}_l $ with $\\varepsilon_l \\sim \\textrm{Ber}(1-p_l)$.\n",
    "\n",
    "We will now implement a MLP with dropout layers and perform Monte-Carlo sampling to obtain the predictive distribution $p(\\mathbf{y} \\vert \\pmb{x}^*, \\pmb{w})$ for a new sample $\\pmb{x}^*$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xxGmkjd0kVgZ"
   },
   "outputs": [],
   "source": [
    "#@title **[CODING TASK]** Implement a MLP with dropout (p=0.5)\n",
    "# Code MLP with 1 hidden layer and a dropout layer. Be careful, the dropout \n",
    "# layer should be also activated during test time.\n",
    "# (Hint: we may want to look out at F.dropout())\n",
    "\n",
    "class MLP(nn.Module):\n",
    "    \"\"\" Pytorch MLP for binary classification model with an added dropout layer\"\"\"\n",
    "    def __init__(self, input_size, hidden_size):\n",
    "        super().__init__()\n",
    "        # ============ YOUR CODE HERE ============\n",
    "\n",
    "    def forward(self, x):\n",
    "        # ============ YOUR CODE HERE ============\n",
    "        # Don't forget to apply the sigmoid function when returning the output\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qMlasuRN08Yk"
   },
   "source": [
    "We train our model as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gqSZJDWIlf5e"
   },
   "outputs": [],
   "source": [
    "net = MLP(input_size=X.shape[1], hidden_size=50)\n",
    "net.train()\n",
    "criterion = nn.BCELoss()\n",
    "optimizer = torch.optim.Adam(net.parameters(), lr=0.01)\n",
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "\n",
    "for epoch in range(500):  # loop over the dataset multiple times\n",
    "\n",
    "    # zero the parameter gradients\n",
    "    optimizer.zero_grad()\n",
    "\n",
    "    # forward + backward + optimize\n",
    "    output = net(X).squeeze()\n",
    "    loss = criterion(output, y)\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "\n",
    "    # For plotting and showing learning process at each epoch, uncomment and indent line below\n",
    "    if (epoch+1)%50==0:\n",
    "        plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), tloc=TEXT_LOCATION, model_type='classic')\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "U_RyglMRa4d5"
   },
   "source": [
    "Now let's look at the results given by MC Dropout:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JRh6BQDcazFx"
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(7,7))\n",
    "plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), tloc=TEXT_LOCATION, model_type='mcdropout', nsamples=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "05OBuwko5EvS"
   },
   "source": [
    "**[Question 2.1]: Again, analyze the results showed on plot. What is the benefit of MC Dropout variational inference over Bayesian Logistic Regression with variational inference?**"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [
    "clhZGFZ2wOxg",
    "ZCp4ytXhV6IU",
    "sNGfQhFqtknu",
    "DsTozn8kQccU",
    "qgNFac420Nh_",
    "_g0QmYJ7WJ8p",
    "s0rMZBLxbHY9",
    "3i_scDA7bC1c"
   ],
   "name": "TP2_RDFIA_Approximate_Inference.ipynb",
   "provenance": [
    {
     "file_id": "1ObknbWrb32m4DzjuNkDLbYsPFrBQFxbY",
     "timestamp": 1611617981429
    },
    {
     "file_id": "1PLhcxazEKZ_VQRqt0p3RasaLedb5WNgP",
     "timestamp": 1611229201493
    }
   ]
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}