{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probabilistic Programming 3: Regression and classification\n", "\n", "#### Goal \n", " - Learn how to infer a posterior distribution for a linear regression model using a probabilistic programming language.\n", " - Learn how to infer a posterior distribution for a linear classification model using a probabilistic programming language.\n", " \n", "#### Materials \n", " - Mandatory\n", " - This notebook.\n", " - Lecture notes on [regression](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Regression.ipynb).\n", " - Lecture notes on [discriminative classification](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Discriminative-Classification.ipynb).\n", " - Optional\n", " - Bayesian linear regression (Section 3.3 [Bishop](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf))\n", " - Bayesian logistic regression (Section 4.5 [Bishop](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf))\n", " - [Cheatsheets: how does Julia differ from Matlab / Python](https://docs.julialang.org/en/v1/manual/noteworthy-differences/index.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "using Pkg\n", "Pkg.activate(\"../../../lessons/\")\n", "Pkg.instantiate();\n", "IJulia.clear_output();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem: Economic growth\n", "\n", "In 2008, the credit crisis sparked a recession in the US, which spread to other countries in the ensuing years. It took most countries a couple of years to recover. \n", "Now, the year is 2011. The Turkish government is asking you to estimate whether Turkey is out of the recession. You decide to look at the data of the national stock exchange to see if there's a positive trend. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "using CSV\n", "using DataFrames\n", "using LinearAlgebra\n", "using Distributions\n", "using StatsFuns\n", "using RxInfer\n", "using Plots\n", "default(label=\"\", margin=10Plots.pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "\n", "We are going to start with loading in a data set. We have daily measurements from Istanbul, from the 5th of January 2009 until 22nd of February 2011. The dataset comes from an online resource for machine learning data sets: the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
536×2 DataFrame
511 rows omitted
RowdateISE
String15Float64
15-Jan-090.0357537
26-Jan-090.0254259
37-Jan-09-0.0288617
48-Jan-09-0.0622081
59-Jan-090.00985991
612-Jan-09-0.029191
713-Jan-090.0154453
814-Jan-09-0.0411676
915-Jan-090.000661905
1016-Jan-090.0220373
1119-Jan-09-0.0226925
1220-Jan-09-0.0137087
1321-Jan-090.000864697
5257-Feb-11-0.0061961
5268-Feb-110.00535559
5279-Feb-110.00482299
52810-Feb-11-0.0176644
52911-Feb-110.00478229
53014-Feb-11-0.00249793
53115-Feb-110.00360638
53216-Feb-110.00859906
53317-Feb-110.00931031
53418-Feb-110.000190969
53521-Feb-11-0.013069
53622-Feb-11-0.00724632
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& date & ISE\\\\\n", "\t\\hline\n", "\t& String15 & Float64\\\\\n", "\t\\hline\n", "\t1 & 5-Jan-09 & 0.0357537 \\\\\n", "\t2 & 6-Jan-09 & 0.0254259 \\\\\n", "\t3 & 7-Jan-09 & -0.0288617 \\\\\n", "\t4 & 8-Jan-09 & -0.0622081 \\\\\n", "\t5 & 9-Jan-09 & 0.00985991 \\\\\n", "\t6 & 12-Jan-09 & -0.029191 \\\\\n", "\t7 & 13-Jan-09 & 0.0154453 \\\\\n", "\t8 & 14-Jan-09 & -0.0411676 \\\\\n", "\t9 & 15-Jan-09 & 0.000661905 \\\\\n", "\t10 & 16-Jan-09 & 0.0220373 \\\\\n", "\t11 & 19-Jan-09 & -0.0226925 \\\\\n", "\t12 & 20-Jan-09 & -0.0137087 \\\\\n", "\t13 & 21-Jan-09 & 0.000864697 \\\\\n", "\t14 & 22-Jan-09 & -0.00381506 \\\\\n", "\t15 & 23-Jan-09 & 0.00566126 \\\\\n", "\t16 & 26-Jan-09 & 0.0468313 \\\\\n", "\t17 & 27-Jan-09 & -0.00663498 \\\\\n", "\t18 & 28-Jan-09 & 0.034567 \\\\\n", "\t19 & 29-Jan-09 & -0.0205282 \\\\\n", "\t20 & 30-Jan-09 & -0.0087767 \\\\\n", "\t21 & 2-Feb-09 & -0.0259191 \\\\\n", "\t22 & 3-Feb-09 & 0.0152795 \\\\\n", "\t23 & 4-Feb-09 & 0.0185778 \\\\\n", "\t24 & 5-Feb-09 & -0.0141329 \\\\\n", "\t25 & 6-Feb-09 & 0.036607 \\\\\n", "\t26 & 9-Feb-09 & 0.0113532 \\\\\n", "\t27 & 10-Feb-09 & -0.040542 \\\\\n", "\t28 & 11-Feb-09 & -0.0221056 \\\\\n", "\t29 & 12-Feb-09 & -0.0148884 \\\\\n", "\t30 & 13-Feb-09 & 0.00702675 \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m536×2 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\u001b[1m ISE \u001b[0m\n", " │\u001b[90m String15 \u001b[0m\u001b[90m Float64 \u001b[0m\n", "─────┼─────────────────────────\n", " 1 │ 5-Jan-09 0.0357537\n", " 2 │ 6-Jan-09 0.0254259\n", " 3 │ 7-Jan-09 -0.0288617\n", " 4 │ 8-Jan-09 -0.0622081\n", " 5 │ 9-Jan-09 0.00985991\n", " 6 │ 12-Jan-09 -0.029191\n", " 7 │ 13-Jan-09 0.0154453\n", " 8 │ 14-Jan-09 -0.0411676\n", " 9 │ 15-Jan-09 0.000661905\n", " 10 │ 16-Jan-09 0.0220373\n", " 11 │ 19-Jan-09 -0.0226925\n", " ⋮ │ ⋮ ⋮\n", " 527 │ 9-Feb-11 0.00482299\n", " 528 │ 10-Feb-11 -0.0176644\n", " 529 │ 11-Feb-11 0.00478229\n", " 530 │ 14-Feb-11 -0.00249793\n", " 531 │ 15-Feb-11 0.00360638\n", " 532 │ 16-Feb-11 0.00859906\n", " 533 │ 17-Feb-11 0.00931031\n", " 534 │ 18-Feb-11 0.000190969\n", " 535 │ 21-Feb-11 -0.013069\n", " 536 │ 22-Feb-11 -0.00724632\n", "\u001b[36m 515 rows omitted\u001b[0m" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read CSV file\n", "df = DataFrame(CSV.File(\"../datasets/stock_exchange.csv\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot the evolution of the stock market values over time." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count number of samples\n", "time_period = 436:536\n", "num_samples = length(time_period)\n", "\n", "# Extract columns\n", "dates_num = 1:num_samples\n", "dates_str = df[time_period,1]\n", "stock_val = df[time_period,2]\n", "\n", "# Set xticks\n", "xtick_points = Int64.(round.(range(1, stop=num_samples, length=5)))\n", "\n", "# Scatter exchange levels\n", "scatter(dates_num, \n", " stock_val, \n", " color=\"black\",\n", " label=\"\", \n", " ylabel=\"Stock Market Levels\", \n", " xlabel=\"time (days)\",\n", " xticks=(xtick_points, [dates_str[i] for i in xtick_points]), \n", " size=(800,300))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model specification\n", "\n", "We have a date $x_i \\in \\mathbb{R}$, referred to as a \"covariate\" or input variable, and the value of the stock exchange at that time point $y_i \\in \\mathbb{R}$, referred to as a \"response\" or output variable. A regression model has parameters $\\theta$, used to predict $y = (y_1, \\dots, y_N)$ from $x = (x_1, \\dots, x_N)$. We are looking for a posterior distribution for the parameters $\\theta$:\n", "\n", "$$\\underbrace{p(\\theta \\mid y, x)}_{\\text{posterior}} \\propto\\ \\underbrace{p(y \\mid x, \\theta)}_{\\text{likelihood}} \\cdot \\underbrace{p(\\theta)}_{\\text{prior}}$$\n", "\n", "We assume each observation $y_i$ is generated via: \n", "\n", "$$ y_i = f_\\theta(x_i) + e_i$$ \n", "\n", "where $e_i$ is white noise, $e_i \\sim \\mathcal{N}(0, \\sigma^2_y)$, and the regression function $f_\\theta$ is linear: $f_\\theta(x) = x \\theta_1 + \\theta_2$. The parameters consist of a slope coefficient $\\theta_1$ and an intercept $\\theta_2$, which are summarized into the vector $\\theta = \\begin{bmatrix}\\theta_1 \\\\ \\theta_2 \\end{bmatrix}$. In practice, we augment the data point $x$ with a 1, i.e., $\\begin{bmatrix}x \\\\ 1 \\end{bmatrix}$, so that we may define $f_\\theta(x) = \\theta^{\\top}x$. \n", "\n", "#### Likelihood\n", "If we integrate out the noise $e$, then we obtain a Gaussian likelihood function centered on $f_\\theta(x)$ with variance $\\sigma^2_Y$:\n", "\n", "$$p(y_i \\mid x_i, \\theta) = \\mathcal{N}(y_i \\mid f_\\theta(x_i),\\sigma^2_y)\\, \\ .$$ \n", "\n", "But this is just for a single sample and we have an entire data set. The likelihood of all $(x,y)$ is:\n", "\n", "$$\\begin{aligned} p(y \\mid x, \\theta) &= \\prod_{i=1}^{N} p(y_i \\mid x_i, \\theta) \\\\ & = \\prod_{i=1}^{N} \\mathcal{N}(y_i \\mid f_{\\theta}(x_i), \\sigma^2_y) \\, . \\end{aligned} $$ \n", "\n", "#### Prior distribution\n", "We know that the weights are real numbers and that they can be negative. That motivates us to use a Gaussian prior:\n", "\n", "$$ p(\\theta) = \\mathcal{N}(\\theta \\mid \\mu_\\theta, \\Sigma_\\theta) \\, .$$\n", "\n", "We can specify these equations almost directly in our PPL." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "@model function linear_regression(μ_θ, Σ_θ, σ2_y; N=1)\n", " \"Bayesian linear regression\"\n", " \n", " # Allocate data variables\n", " X = datavar(Vector{Float64}, N)\n", " y = datavar(Float64, N)\n", " \n", " # Prior distribution of coefficients\n", " θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)\n", " \n", " for i = 1:N\n", "\n", " # Likelihood of i-th sample\n", " y[i] ~ NormalMeanVariance(dot(θ,X[i]), σ2_y)\n", " \n", " end\n", " return y, X, θ\n", "end" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Prior parameters\n", "μ_θ, Σ_θ = (zeros(2), diagm(ones(2)))\n", "\n", "# Likelihood variance\n", "σ2_y = 1.0;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our model, it is time to infer parameters." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Inference results:\n", " Posteriors | available for (θ)\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call inference function\n", "results = inference(\n", " model = linear_regression(μ_θ, Σ_θ, σ2_y, N=num_samples),\n", " data = (y = stock_val, X = [[dates_num[i], 1.0] for i in 1:num_samples]),\n", " returnvars = (θ = KeepLast()),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the resulting posterior." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract posterior weights \n", "post_θ = results.posteriors[:θ]\n", "\n", "# Define ranges for plot\n", "x1 = range(-1.5, length=500, stop=1.5)\n", "x2 = range(-2.5, length=500, stop=2.5)\n", "\n", "# Draw contour plots of distributions\n", "prior_θ = MvNormal(μ_θ, Σ_θ)\n", "p1a = contour(x1, x2, (x1,x2) -> pdf(prior_θ, [x1,x2]), xlabel=\"θ1\", ylabel=\"θ2\", title=\"prior\", label=\"\")\n", "p1b = contour(x1, x2, (x1,x2) -> pdf(post_θ, [x1,x2]), xlabel=\"θ1\", title=\"posterior\", label=\"\")\n", "plot(p1a, p1b, size=(900,300))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It has become quite sharply peaked in a small area of parameter space.\n", "\n", "We can extract the MAP point estimate to compute and visualize the most probable regression function $f_\\theta$." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Slope coefficient = -4.320468927288264e-5\n", "Intercept coefficient = 0.0022453122200430586\n" ] } ], "source": [ "# Extract estimated weights\n", "θ_MAP = mode(post_θ)\n", "\n", "# Report results\n", "println(\"Slope coefficient = \"*string(θ_MAP[1]))\n", "println(\"Intercept coefficient = \"*string(θ_MAP[2]))\n", "\n", "# Make predictions\n", "regression_estimated = dates_num * θ_MAP[1] .+ θ_MAP[2];" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Let's visualize it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Visualize observations\n", "scatter(dates_num, stock_val, color=\"black\", xticks=(xtick_points, [dates_str[i] for i in xtick_points]), label=\"observations\", legend=:topleft)\n", "\n", "# Overlay regression function\n", "plot!(dates_num, regression_estimated, color=\"blue\", label=\"regression\", linewidth=2, size=(800,300))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The slope coefficient $\\theta_1$ is negative and the plot shows a decreasing line. The ISE experienced a negative linear trend from October 2010 up to March 2011. Assuming the stock market is an indicator of economic growth, then we may conclude that in March 2011 the Turkish economy is still in recession." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise\n", "\n", "Change the `time period` variable. Re-run the regression and see how the results change.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem: Credit Assignment\n", "\n", "We will now look at a classification problem. Suppose you are a bank and that you have to decide whether you will grant credit, e.g. a mortgage or a small business loan, to a customer. You have a historic data set where your experts have assigned credit to hundreds of people. You have asked them to report on what aspects of the problem are important. You hope to automate this decision process by training a classifier on the data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "\n", "The data set we are going to use actually comes from the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/Credit+Approval). It consists of past credit assignments, labeled as successful (=1) or unsuccessful (=0). Many of the features have been anonymized for privacy concerns." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Read CSV file\n", "df = DataFrame(CSV.File(\"../datasets/credit_train.csv\"))\n", "\n", "# Split dataframe into features and labels\n", "features_train = Matrix(df[:,1:7])\n", "labels_train = Vector(df[:,8])\n", "\n", "# Store number of features\n", "num_features = size(features_train,2)\n", "\n", "# Number of training samples\n", "num_train = size(features_train,1);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the data and see if we can make sense of it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scatter(features_train[labels_train .== 0, 1], features_train[labels_train .== 0, 2], color=\"blue\", label=\"unsuccessful\", xlabel=\"feature1\", ylabel=\"feature2\")\n", "scatter!(features_train[labels_train .== 1, 1], features_train[labels_train .== 1, 2], color=\"red\", label=\"successful\", size=(800,300))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mmhh, it doesn't look like the samples can easily be separated. This will be challenging." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise\n", "\n", "The plot above shows features 1 and 2. Have a look at the other combinations of features.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model specification\n", "\n", "We have features $X$, labels $Y$ and parameters $\\theta$. Same as with regression, we are looking for a posterior distribution of the classification parameters:\n", "\n", "$$\\underbrace{p(\\theta \\mid Y, X)}_{\\text{posterior}} \\propto\\ \\underbrace{p(Y \\mid X, \\theta)}_{\\text{likelihood}} \\cdot \\underbrace{p(\\theta)}_{\\text{prior}}$$\n", "\n", "The likelihood in this case will be of a probit form:\n", "\n", "$$ p(Y \\mid X, \\theta) = \\prod_{i=1}^{N} \\ \\mathcal{B} \\big(Y_i \\mid \\Phi(f_\\theta(X_i) \\big) \\, .$$ \n", "\n", "As you can see it is a Bernoulli distribution with a cumulative normal distribution as transfer (a.k.a. _link_) function: \n", "\n", "$$ \\Phi(x) = \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{x} \\exp \\left(-\\frac{t^2}{2} \\right) \\mathrm{d}t \\, .$$ \n", "\n", "The transfer function maps the input ($f_\\theta(X_i)$) to the interval $(0,1)$ so that the result acts as a rate parameter to the Bernoulli. Check Bert's lecture on discriminative classification for more information.\n", "\n", "We will use a Gaussian prior distribution for the classification parameters $\\theta$:\n", "\n", "$$ p(\\theta) = \\mathcal{N}(\\theta \\mid \\mu_\\theta, \\Sigma_\\theta) \\, .$$\n", "\n", "You have probably noticed that this combination of likelihood and prior is not part of the family of conjugate pairings. As such, we don't have an exact posterior. The Laplace approximation is one procedure but under the hood, our toolbox is actually performing a different procedure for obtaining the posterior parameters: [moment matching](https://en.wikipedia.org/wiki/Method_of_moments_(statistics)). The cumulative normal distribution allows for integrating the product of prior and likelihood by hand, with respect to the first (mean) and second (variance) moments. The toolbox is essentially performing a lookup for the analytically derived formula, which computationally cheaper than performing the iterative steps necessary for the Laplace approximation." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Parameters for priors\n", "μ_θ = zeros(num_features+1,)\n", "Σ_θ = diagm(ones(num_features+1));" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "@model function linear_classification(μ_θ, Σ_θ; N=1)\n", " \"Bayesian classification model\"\n", " \n", " # Allocate data variables\n", " X = datavar(Vector{Float64}, N)\n", " y = datavar(Int64, N)\n", " \n", " # Weight prior distribution\n", " θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)\n", " \n", " for i = 1:N\n", " \n", " # Binary likelihood\n", " y[i] ~ Probit(dot(θ, X[i]))\n", " \n", " end\n", " return y, X, θ\n", "end" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Inference results:\n", " Posteriors | available for (θ)\n" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = inference(\n", " model = linear_classification(μ_θ, Σ_θ, N=num_train),\n", " data = (y = labels_train, X = [[features_train[i,:]; 1.0] for i in 1:num_train]),\n", " returnvars = (θ = KeepLast()),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, we cannot visualize a distribution of more than 2 dimensions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict test data\n", "\n", "The bank has some test data for you as well. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Read CSV file\n", "df = DataFrame(CSV.File(\"../datasets/credit_test.csv\"))\n", "\n", "# Split dataframe into features and labels\n", "features_test = Matrix(df[:,1:7])\n", "labels_test = Vector(df[:,8])\n", "\n", "# Number of test samples\n", "num_test = size(features_test,1);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can classify test samples by taking the MAP for the classification parameters, computing the linear function $f_\\theta$ and rounding the result to obtain the most probable label." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Accuracy = 49.0%\n" ] } ], "source": [ "# Extract MAP estimate of classification parameters\n", "θ_MAP = mode(results.posteriors[:θ])\n", "\n", "# Compute dot product between parameters and test data\n", "fθ_pred = [features_test ones(num_test,)] * θ_MAP\n", "\n", "# Predict labels through probit\n", "labels_pred = round.(normcdf.(fθ_pred));\n", "\n", "# Compute classification accuracy of test data\n", "accuracy_test = mean(labels_test .== labels_pred)\n", "\n", "# Report result\n", "println(\"Test Accuracy = \"*string(accuracy_test*100)*\"%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mmmhh... If you were a bank, you might decide that you don't want to automatically assign credit to your customers." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "anaconda-cloud": {}, "kernelspec": { "display_name": "Julia 1.8.1", "language": "julia", "name": "julia-1.8" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }