{ "cells": [ { "cell_type": "markdown", "id": "3cf2b619", "metadata": {}, "source": [ "# Maximum Likelihood Estimation" ] }, { "cell_type": "markdown", "id": "a9cc6915", "metadata": {}, "source": [ "## Contents\n", "\n", "- [Maximum Likelihood Estimation](#Maximum-Likelihood-Estimation) \n", " - [Overview](#Overview) \n", " - [Set Up and Assumptions](#Set-Up-and-Assumptions) \n", " - [Conditional Distributions](#Conditional-Distributions) \n", " - [Maximum Likelihood Estimation](#Maximum-Likelihood-Estimation) \n", " - [MLE with Numerical Methods](#MLE-with-Numerical-Methods) \n", " - [Maximum Likelihood Estimation with `statsmodels`](#Maximum-Likelihood-Estimation-with-statsmodels) \n", " - [Summary](#Summary) \n", " - [Exercises](#Exercises) " ] }, { "cell_type": "markdown", "id": "2fb2c69c", "metadata": {}, "source": [ "## Overview\n", "\n", "In a [previous lecture](https://python.quantecon.org/ols.html), we estimated the relationship between\n", "dependent and explanatory variables using linear regression.\n", "\n", "But what if a linear relationship is not an appropriate assumption for our model?\n", "\n", "One widely used alternative is maximum likelihood estimation, which\n", "involves specifying a class of distributions, indexed by unknown parameters, and then using the data to pin down these parameter values.\n", "\n", "The benefit relative to linear regression is that it allows more flexibility in the probabilistic relationships between variables.\n", "\n", "Here we illustrate maximum likelihood by replicating Daniel Treisman’s (2016) paper, [Russia’s Billionaires](https://www.aeaweb.org/articles?id=10.1257/aer.p20161068), which connects the number of billionaires in a country to its economic characteristics.\n", "\n", "The paper concludes that Russia has a higher number of billionaires than\n", "economic factors such as market size and tax rate predict.\n", "\n", "We’ll require the following imports:" ] }, { "cell_type": "code", "execution_count": null, "id": "0a11195f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams[\"figure.figsize\"] = (11, 5) #set default figure size\n", "import numpy as np\n", "from numpy import exp\n", "from scipy.special import factorial\n", "import pandas as pd\n", "from mpl_toolkits.mplot3d import Axes3D\n", "import statsmodels.api as sm\n", "from statsmodels.api import Poisson\n", "from scipy.stats import norm\n", "from statsmodels.iolib.summary2 import summary_col" ] }, { "cell_type": "markdown", "id": "ab8d944e", "metadata": {}, "source": [ "### Prerequisites\n", "\n", "We assume familiarity with basic probability and multivariate calculus." ] }, { "cell_type": "markdown", "id": "e72e430d", "metadata": {}, "source": [ "## Set Up and Assumptions\n", "\n", "Let’s consider the steps we need to go through in maximum likelihood estimation and how they pertain to this study." ] }, { "cell_type": "markdown", "id": "4a1d7aa8", "metadata": {}, "source": [ "### Flow of Ideas\n", "\n", "The first step with maximum likelihood estimation is to choose the probability distribution believed to be generating the data.\n", "\n", "More precisely, we need to make an assumption as to which *parametric class* of distributions is generating the data.\n", "\n", "- e.g., the class of all normal distributions, or the class of all gamma distributions. \n", "\n", "\n", "Each such class is a family of distributions indexed by a finite number of parameters.\n", "\n", "- e.g., the class of normal distributions is a family of distributions\n", " indexed by its mean $ \\mu \\in (-\\infty, \\infty) $ and standard deviation $ \\sigma \\in (0, \\infty) $. \n", "\n", "\n", "We’ll let the data pick out a particular element of the class by pinning down the parameters.\n", "\n", "The parameter estimates so produced will be called **maximum likelihood estimates**." ] }, { "cell_type": "markdown", "id": "d1c7d9a8", "metadata": {}, "source": [ "### Counting Billionaires\n", "\n", "Treisman [[Treisman, 2016](https://python.quantecon.org/zreferences.html#id94)] is interested in estimating the number of billionaires in different countries.\n", "\n", "The number of billionaires is integer-valued.\n", "\n", "Hence we consider distributions that take values only in the nonnegative integers.\n", "\n", "(This is one reason least squares regression is not the best tool for the present problem, since the dependent variable in linear regression is not restricted\n", "to integer values)\n", "\n", "One integer distribution is the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution), the probability mass function (pmf) of which is\n", "\n", "$$\n", "f(y) = \\frac{\\mu^{y}}{y!} e^{-\\mu},\n", "\\qquad y = 0, 1, 2, \\ldots, \\infty\n", "$$\n", "\n", "We can plot the Poisson distribution over $ y $ for different values of $ \\mu $ as follows" ] }, { "cell_type": "code", "execution_count": null, "id": "1e7c8987", "metadata": { "hide-output": false }, "outputs": [], "source": [ "poisson_pmf = lambda y, μ: μ**y / factorial(y) * exp(-μ)\n", "y_values = range(0, 25)\n", "\n", "fig, ax = plt.subplots(figsize=(12, 8))\n", "\n", "for μ in [1, 5, 10]:\n", " distribution = []\n", " for y_i in y_values:\n", " distribution.append(poisson_pmf(y_i, μ))\n", " ax.plot(y_values,\n", " distribution,\n", " label=f'$\\mu$={μ}',\n", " alpha=0.5,\n", " marker='o',\n", " markersize=8)\n", "\n", "ax.grid()\n", "ax.set_xlabel('$y$', fontsize=14)\n", "ax.set_ylabel('$f(y \\mid \\mu)$', fontsize=14)\n", "ax.axis(xmin=0, ymin=0)\n", "ax.legend(fontsize=14)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "30a1784f", "metadata": {}, "source": [ "Notice that the Poisson distribution begins to resemble a normal distribution as the mean of $ y $ increases.\n", "\n", "Let’s have a look at the distribution of the data we’ll be working with in this lecture.\n", "\n", "Treisman’s main source of data is *Forbes’* annual rankings of billionaires and their estimated net worth.\n", "\n", "The dataset `mle/fp.dta` can be downloaded from [here](https://python.quantecon.org/_static/lecture_specific/mle/fp.dta)\n", "or its [AER page](https://www.aeaweb.org/articles?id=10.1257/aer.p20161068)." ] }, { "cell_type": "code", "execution_count": null, "id": "b3ed63fa", "metadata": { "hide-output": false }, "outputs": [], "source": [ "pd.options.display.max_columns = 10\n", "\n", "# Load in data and view\n", "df = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/mle/fp.dta?raw=true')\n", "df.head()" ] }, { "cell_type": "markdown", "id": "2aaf8048", "metadata": {}, "source": [ "Using a histogram, we can view the distribution of the number of\n", "billionaires per country, `numbil0`, in 2008 (the United States is\n", "dropped for plotting purposes)" ] }, { "cell_type": "code", "execution_count": null, "id": "dedff5f5", "metadata": { "hide-output": false }, "outputs": [], "source": [ "numbil0_2008 = df[(df['year'] == 2008) & (\n", " df['country'] != 'United States')].loc[:, 'numbil0']\n", "\n", "plt.subplots(figsize=(12, 8))\n", "plt.hist(numbil0_2008, bins=30)\n", "plt.xlim(left=0)\n", "plt.grid()\n", "plt.xlabel('Number of billionaires in 2008')\n", "plt.ylabel('Count')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "a8565653", "metadata": {}, "source": [ "From the histogram, it appears that the Poisson assumption is not unreasonable (albeit with a very low $ \\mu $ and some outliers)." ] }, { "cell_type": "markdown", "id": "44b81355", "metadata": {}, "source": [ "## Conditional Distributions\n", "\n", "In Treisman’s paper, the dependent variable — the number of billionaires $ y_i $ in country $ i $ — is modeled as a function of GDP per capita, population size, and years membership in GATT and WTO.\n", "\n", "Hence, the distribution of $ y_i $ needs to be conditioned on the vector of explanatory variables $ \\mathbf{x}_i $.\n", "\n", "The standard formulation — the so-called *poisson regression* model — is as follows:\n", "\n", "\n", "\n", "$$\n", "f(y_i \\mid \\mathbf{x}_i) = \\frac{\\mu_i^{y_i}}{y_i!} e^{-\\mu_i}; \\qquad y_i = 0, 1, 2, \\ldots , \\infty . \\tag{72.1}\n", "$$\n", "\n", "$$\n", "\\text{where}\\ \\mu_i\n", " = \\exp(\\mathbf{x}_i' \\boldsymbol{\\beta})\n", " = \\exp(\\beta_0 + \\beta_1 x_{i1} + \\ldots + \\beta_k x_{ik})\n", "$$\n", "\n", "To illustrate the idea that the distribution of $ y_i $ depends on\n", "$ \\mathbf{x}_i $ let’s run a simple simulation.\n", "\n", "We use our `poisson_pmf` function from above and arbitrary values for\n", "$ \\boldsymbol{\\beta} $ and $ \\mathbf{x}_i $" ] }, { "cell_type": "code", "execution_count": null, "id": "77d34ea0", "metadata": { "hide-output": false }, "outputs": [], "source": [ "y_values = range(0, 20)\n", "\n", "# Define a parameter vector with estimates\n", "β = np.array([0.26, 0.18, 0.25, -0.1, -0.22])\n", "\n", "# Create some observations X\n", "datasets = [np.array([0, 1, 1, 1, 2]),\n", " np.array([2, 3, 2, 4, 0]),\n", " np.array([3, 4, 5, 3, 2]),\n", " np.array([6, 5, 4, 4, 7])]\n", "\n", "\n", "fig, ax = plt.subplots(figsize=(12, 8))\n", "\n", "for X in datasets:\n", " μ = exp(X @ β)\n", " distribution = []\n", " for y_i in y_values:\n", " distribution.append(poisson_pmf(y_i, μ))\n", " ax.plot(y_values,\n", " distribution,\n", " label=f'$\\mu_i$={μ:.1}',\n", " marker='o',\n", " markersize=8,\n", " alpha=0.5)\n", "\n", "ax.grid()\n", "ax.legend()\n", "ax.set_xlabel('$y \\mid x_i$')\n", "ax.set_ylabel(r'$f(y \\mid x_i; \\beta )$')\n", "ax.axis(xmin=0, ymin=0)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "0ede2948", "metadata": {}, "source": [ "We can see that the distribution of $ y_i $ is conditional on\n", "$ \\mathbf{x}_i $ ($ \\mu_i $ is no longer constant)." ] }, { "cell_type": "markdown", "id": "68fa96d0", "metadata": {}, "source": [ "## Maximum Likelihood Estimation\n", "\n", "In our model for number of billionaires, the conditional distribution\n", "contains 4 ($ k = 4 $) parameters that we need to estimate.\n", "\n", "We will label our entire parameter vector as $ \\boldsymbol{\\beta} $ where\n", "\n", "$$\n", "\\boldsymbol{\\beta} = \\begin{bmatrix}\n", " \\beta_0 \\\\\n", " \\beta_1 \\\\\n", " \\beta_2 \\\\\n", " \\beta_3\n", " \\end{bmatrix}\n", "$$\n", "\n", "To estimate the model using MLE, we want to maximize the likelihood that\n", "our estimate $ \\hat{\\boldsymbol{\\beta}} $ is the true parameter $ \\boldsymbol{\\beta} $.\n", "\n", "Intuitively, we want to find the $ \\hat{\\boldsymbol{\\beta}} $ that best fits our data.\n", "\n", "First, we need to construct the likelihood function $ \\mathcal{L}(\\boldsymbol{\\beta}) $, which is similar to a joint probability density function.\n", "\n", "Assume we have some data $ y_i = \\{y_1, y_2\\} $ and\n", "$ y_i \\sim f(y_i) $.\n", "\n", "If $ y_1 $ and $ y_2 $ are independent, the joint pmf of these\n", "data is $ f(y_1, y_2) = f(y_1) \\cdot f(y_2) $.\n", "\n", "If $ y_i $ follows a Poisson distribution with $ \\lambda = 7 $,\n", "we can visualize the joint pmf like so" ] }, { "cell_type": "code", "execution_count": null, "id": "ba45c912", "metadata": { "hide-output": false }, "outputs": [], "source": [ "def plot_joint_poisson(μ=7, y_n=20):\n", " yi_values = np.arange(0, y_n, 1)\n", "\n", " # Create coordinate points of X and Y\n", " X, Y = np.meshgrid(yi_values, yi_values)\n", "\n", " # Multiply distributions together\n", " Z = poisson_pmf(X, μ) * poisson_pmf(Y, μ)\n", "\n", " fig = plt.figure(figsize=(12, 8))\n", " ax = fig.add_subplot(111, projection='3d')\n", " ax.plot_surface(X, Y, Z.T, cmap='terrain', alpha=0.6)\n", " ax.scatter(X, Y, Z.T, color='black', alpha=0.5, linewidths=1)\n", " ax.set(xlabel='$y_1$', ylabel='$y_2$')\n", " ax.set_zlabel('$f(y_1, y_2)$', labelpad=10)\n", " plt.show()\n", "\n", "plot_joint_poisson(μ=7, y_n=20)" ] }, { "cell_type": "markdown", "id": "a811c9e2", "metadata": {}, "source": [ "Similarly, the joint pmf of our data (which is distributed as a\n", "conditional Poisson distribution) can be written as\n", "\n", "$$\n", "f(y_1, y_2, \\ldots, y_n \\mid \\mathbf{x}_1, \\mathbf{x}_2, \\ldots, \\mathbf{x}_n; \\boldsymbol{\\beta})\n", " = \\prod_{i=1}^{n} \\frac{\\mu_i^{y_i}}{y_i!} e^{-\\mu_i}\n", "$$\n", "\n", "$ y_i $ is conditional on both the values of $ \\mathbf{x}_i $ and the\n", "parameters $ \\boldsymbol{\\beta} $.\n", "\n", "The likelihood function is the same as the joint pmf, but treats the\n", "parameter $ \\boldsymbol{\\beta} $ as a random variable and takes the observations\n", "$ (y_i, \\mathbf{x}_i) $ as given\n", "\n", "$$\n", "\\begin{split}\n", "\\mathcal{L}(\\beta \\mid y_1, y_2, \\ldots, y_n \\ ; \\ \\mathbf{x}_1, \\mathbf{x}_2, \\ldots, \\mathbf{x}_n) = &\n", "\\prod_{i=1}^{n} \\frac{\\mu_i^{y_i}}{y_i!} e^{-\\mu_i} \\\\ = &\n", "f(y_1, y_2, \\ldots, y_n \\mid \\ \\mathbf{x}_1, \\mathbf{x}_2, \\ldots, \\mathbf{x}_n ; \\beta)\n", "\\end{split}\n", "$$\n", "\n", "Now that we have our likelihood function, we want to find the $ \\hat{\\boldsymbol{\\beta}} $ that yields the maximum likelihood value\n", "\n", "$$\n", "\\underset{\\boldsymbol{\\beta}}{\\max} \\mathcal{L}(\\boldsymbol{\\beta})\n", "$$\n", "\n", "In doing so it is generally easier to maximize the log-likelihood (consider\n", "differentiating $ f(x) = x \\exp(x) $ vs. $ f(x) = \\log(x) + x $).\n", "\n", "Given that taking a logarithm is a monotone increasing transformation, a maximizer of the likelihood function will also be a maximizer of the log-likelihood function.\n", "\n", "In our case the log-likelihood is\n", "\n", "$$\n", "\\begin{split}\n", "\\log{ \\mathcal{L}} (\\boldsymbol{\\beta}) = \\ &\n", " \\log \\Big(\n", " f(y_1 ; \\boldsymbol{\\beta})\n", " \\cdot\n", " f(y_2 ; \\boldsymbol{\\beta})\n", " \\cdot \\ldots \\cdot\n", " f(y_n ; \\boldsymbol{\\beta})\n", " \\Big) \\\\\n", " = &\n", " \\sum_{i=1}^{n} \\log{f(y_i ; \\boldsymbol{\\beta})} \\\\\n", " = &\n", " \\sum_{i=1}^{n}\n", " \\log \\Big( {\\frac{\\mu_i^{y_i}}{y_i!} e^{-\\mu_i}} \\Big) \\\\\n", " = &\n", " \\sum_{i=1}^{n} y_i \\log{\\mu_i} -\n", " \\sum_{i=1}^{n} \\mu_i -\n", " \\sum_{i=1}^{n} \\log y!\n", "\\end{split}\n", "$$\n", "\n", "The MLE of the Poisson to the Poisson for $ \\hat{\\beta} $ can be obtained by solving\n", "\n", "$$\n", "\\underset{\\beta}{\\max} \\Big(\n", "\\sum_{i=1}^{n} y_i \\log{\\mu_i} -\n", "\\sum_{i=1}^{n} \\mu_i -\n", "\\sum_{i=1}^{n} \\log y! \\Big)\n", "$$\n", "\n", "However, no analytical solution exists to the above problem – to find the MLE\n", "we need to use numerical methods." ] }, { "cell_type": "markdown", "id": "1aa31430", "metadata": {}, "source": [ "## MLE with Numerical Methods\n", "\n", "Many distributions do not have nice, analytical solutions and therefore require\n", "numerical methods to solve for parameter estimates.\n", "\n", "One such numerical method is the Newton-Raphson algorithm.\n", "\n", "Our goal is to find the maximum likelihood estimate $ \\hat{\\boldsymbol{\\beta}} $.\n", "\n", "At $ \\hat{\\boldsymbol{\\beta}} $, the first derivative of the log-likelihood\n", "function will be equal to 0.\n", "\n", "Let’s illustrate this by supposing\n", "\n", "$$\n", "\\log \\mathcal{L(\\beta)} = - (\\beta - 10) ^2 - 10\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "id": "170b3713", "metadata": { "hide-output": false }, "outputs": [], "source": [ "β = np.linspace(1, 20)\n", "logL = -(β - 10) ** 2 - 10\n", "dlogL = -2 * β + 20\n", "\n", "fig, (ax1, ax2) = plt.subplots(2, sharex=True, figsize=(12, 8))\n", "\n", "ax1.plot(β, logL, lw=2)\n", "ax2.plot(β, dlogL, lw=2)\n", "\n", "ax1.set_ylabel(r'$log \\mathcal{L(\\beta)}$',\n", " rotation=0,\n", " labelpad=35,\n", " fontsize=15)\n", "ax2.set_ylabel(r'$\\frac{dlog \\mathcal{L(\\beta)}}{d \\beta}$ ',\n", " rotation=0,\n", " labelpad=35,\n", " fontsize=19)\n", "ax2.set_xlabel(r'$\\beta$', fontsize=15)\n", "ax1.grid(), ax2.grid()\n", "plt.axhline(c='black')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "80c95a8f", "metadata": {}, "source": [ "The plot shows that the maximum likelihood value (the top plot) occurs\n", "when $ \\frac{d \\log \\mathcal{L(\\boldsymbol{\\beta})}}{d \\boldsymbol{\\beta}} = 0 $ (the bottom\n", "plot).\n", "\n", "Therefore, the likelihood is maximized when $ \\beta = 10 $.\n", "\n", "We can also ensure that this value is a *maximum* (as opposed to a\n", "minimum) by checking that the second derivative (slope of the bottom\n", "plot) is negative.\n", "\n", "The Newton-Raphson algorithm finds a point where the first derivative is\n", "0.\n", "\n", "To use the algorithm, we take an initial guess at the maximum value,\n", "$ \\beta_0 $ (the OLS parameter estimates might be a reasonable\n", "guess), then\n", "\n", "1. Use the updating rule to iterate the algorithm \n", " $$\n", " \\boldsymbol{\\beta}_{(k+1)} = \\boldsymbol{\\beta}_{(k)} - H^{-1}(\\boldsymbol{\\beta}_{(k)})G(\\boldsymbol{\\beta}_{(k)})\n", " $$\n", " where: \n", " $$\n", " \\begin{aligned}\n", " G(\\boldsymbol{\\beta}_{(k)}) = \\frac{d \\log \\mathcal{L(\\boldsymbol{\\beta}_{(k)})}}{d \\boldsymbol{\\beta}_{(k)}} \\\\\n", " H(\\boldsymbol{\\beta}_{(k)}) = \\frac{d^2 \\log \\mathcal{L(\\boldsymbol{\\beta}_{(k)})}}{d \\boldsymbol{\\beta}_{(k)}d \\boldsymbol{\\beta}'_{(k)}}\n", " \\end{aligned}\n", " $$\n", "1. Check whether $ \\boldsymbol{\\beta}_{(k+1)} - \\boldsymbol{\\beta}_{(k)} < tol $ \n", " - If true, then stop iterating and set\n", " $ \\hat{\\boldsymbol{\\beta}} = \\boldsymbol{\\beta}_{(k+1)} $ \n", " - If false, then update $ \\boldsymbol{\\beta}_{(k+1)} $ \n", "\n", "\n", "As can be seen from the updating equation,\n", "$ \\boldsymbol{\\beta}_{(k+1)} = \\boldsymbol{\\beta}_{(k)} $ only when\n", "$ G(\\boldsymbol{\\beta}_{(k)}) = 0 $ ie. where the first derivative is equal to 0.\n", "\n", "(In practice, we stop iterating when the difference is below a small\n", "tolerance threshold)\n", "\n", "Let’s have a go at implementing the Newton-Raphson algorithm.\n", "\n", "First, we’ll create a class called `PoissonRegression` so we can\n", "easily recompute the values of the log likelihood, gradient and Hessian\n", "for every iteration" ] }, { "cell_type": "code", "execution_count": null, "id": "8e9c8daa", "metadata": { "hide-output": false }, "outputs": [], "source": [ "class PoissonRegression:\n", "\n", " def __init__(self, y, X, β):\n", " self.X = X\n", " self.n, self.k = X.shape\n", " # Reshape y as a n_by_1 column vector\n", " self.y = y.reshape(self.n,1)\n", " # Reshape β as a k_by_1 column vector\n", " self.β = β.reshape(self.k,1)\n", "\n", " def μ(self):\n", " return np.exp(self.X @ self.β)\n", "\n", " def logL(self):\n", " y = self.y\n", " μ = self.μ()\n", " return np.sum(y * np.log(μ) - μ - np.log(factorial(y)))\n", "\n", " def G(self):\n", " y = self.y\n", " μ = self.μ()\n", " return X.T @ (y - μ)\n", "\n", " def H(self):\n", " X = self.X\n", " μ = self.μ()\n", " return -(X.T @ (μ * X))" ] }, { "cell_type": "markdown", "id": "f700b287", "metadata": {}, "source": [ "Our function `newton_raphson` will take a `PoissonRegression` object\n", "that has an initial guess of the parameter vector $ \\boldsymbol{\\beta}_0 $.\n", "\n", "The algorithm will update the parameter vector according to the updating\n", "rule, and recalculate the gradient and Hessian matrices at the new\n", "parameter estimates.\n", "\n", "Iteration will end when either:\n", "\n", "- The difference between the parameter and the updated parameter is below a tolerance level. \n", "- The maximum number of iterations has been achieved (meaning convergence is not achieved). \n", "\n", "\n", "So we can get an idea of what’s going on while the algorithm is running,\n", "an option `display=True` is added to print out values at each\n", "iteration." ] }, { "cell_type": "code", "execution_count": null, "id": "0c4c6094", "metadata": { "hide-output": false }, "outputs": [], "source": [ "def newton_raphson(model, tol=1e-3, max_iter=1000, display=True):\n", "\n", " i = 0\n", " error = 100 # Initial error value\n", "\n", " # Print header of output\n", " if display:\n", " header = f'{\"Iteration_k\":<13}{\"Log-likelihood\":<16}{\"θ\":<60}'\n", " print(header)\n", " print(\"-\" * len(header))\n", "\n", " # While loop runs while any value in error is greater\n", " # than the tolerance until max iterations are reached\n", " while np.any(error > tol) and i < max_iter:\n", " H, G = model.H(), model.G()\n", " β_new = model.β - (np.linalg.inv(H) @ G)\n", " error = np.abs(β_new - model.β)\n", " model.β = β_new\n", "\n", " # Print iterations\n", " if display:\n", " β_list = [f'{t:.3}' for t in list(model.β.flatten())]\n", " update = f'{i:<13}{model.logL():<16.8}{β_list}'\n", " print(update)\n", "\n", " i += 1\n", "\n", " print(f'Number of iterations: {i}')\n", " print(f'β_hat = {model.β.flatten()}')\n", "\n", " # Return a flat array for β (instead of a k_by_1 column vector)\n", " return model.β.flatten()" ] }, { "cell_type": "markdown", "id": "e4874e05", "metadata": {}, "source": [ "Let’s try out our algorithm with a small dataset of 5 observations and 3\n", "variables in $ \\mathbf{X} $." ] }, { "cell_type": "code", "execution_count": null, "id": "72455d8e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "X = np.array([[1, 2, 5],\n", " [1, 1, 3],\n", " [1, 4, 2],\n", " [1, 5, 2],\n", " [1, 3, 1]])\n", "\n", "y = np.array([1, 0, 1, 1, 0])\n", "\n", "# Take a guess at initial βs\n", "init_β = np.array([0.1, 0.1, 0.1])\n", "\n", "# Create an object with Poisson model values\n", "poi = PoissonRegression(y, X, β=init_β)\n", "\n", "# Use newton_raphson to find the MLE\n", "β_hat = newton_raphson(poi, display=True)" ] }, { "cell_type": "markdown", "id": "59daed46", "metadata": {}, "source": [ "As this was a simple model with few observations, the algorithm achieved\n", "convergence in only 7 iterations.\n", "\n", "You can see that with each iteration, the log-likelihood value increased.\n", "\n", "Remember, our objective was to maximize the log-likelihood function,\n", "which the algorithm has worked to achieve.\n", "\n", "Also, note that the increase in $ \\log \\mathcal{L}(\\boldsymbol{\\beta}_{(k)}) $\n", "becomes smaller with each iteration.\n", "\n", "This is because the gradient is approaching 0 as we reach the maximum,\n", "and therefore the numerator in our updating equation is becoming smaller.\n", "\n", "The gradient vector should be close to 0 at $ \\hat{\\boldsymbol{\\beta}} $" ] }, { "cell_type": "code", "execution_count": null, "id": "8593c26b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "poi.G()" ] }, { "cell_type": "markdown", "id": "0ef6866d", "metadata": {}, "source": [ "The iterative process can be visualized in the following diagram, where\n", "the maximum is found at $ \\beta = 10 $" ] }, { "cell_type": "code", "execution_count": null, "id": "9452f14b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "logL = lambda x: -(x - 10) ** 2 - 10\n", "\n", "def find_tangent(β, a=0.01):\n", " y1 = logL(β)\n", " y2 = logL(β+a)\n", " x = np.array([[β, 1], [β+a, 1]])\n", " m, c = np.linalg.lstsq(x, np.array([y1, y2]), rcond=None)[0]\n", " return m, c\n", "\n", "β = np.linspace(2, 18)\n", "fig, ax = plt.subplots(figsize=(12, 8))\n", "ax.plot(β, logL(β), lw=2, c='black')\n", "\n", "for β in [7, 8.5, 9.5, 10]:\n", " β_line = np.linspace(β-2, β+2)\n", " m, c = find_tangent(β)\n", " y = m * β_line + c\n", " ax.plot(β_line, y, '-', c='purple', alpha=0.8)\n", " ax.text(β+2.05, y[-1], f'$G({β}) = {abs(m):.0f}$', fontsize=12)\n", " ax.vlines(β, -24, logL(β), linestyles='--', alpha=0.5)\n", " ax.hlines(logL(β), 6, β, linestyles='--', alpha=0.5)\n", "\n", "ax.set(ylim=(-24, -4), xlim=(6, 13))\n", "ax.set_xlabel(r'$\\beta$', fontsize=15)\n", "ax.set_ylabel(r'$log \\mathcal{L(\\beta)}$',\n", " rotation=0,\n", " labelpad=25,\n", " fontsize=15)\n", "ax.grid(alpha=0.3)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ae6e54d9", "metadata": {}, "source": [ "Note that our implementation of the Newton-Raphson algorithm is rather\n", "basic — for more robust implementations see,\n", "for example, [scipy.optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html)." ] }, { "cell_type": "markdown", "id": "2b679b2c", "metadata": {}, "source": [ "## Maximum Likelihood Estimation with `statsmodels`\n", "\n", "Now that we know what’s going on under the hood, we can apply MLE to an interesting application.\n", "\n", "We’ll use the Poisson regression model in `statsmodels` to obtain\n", "a richer output with standard errors, test values, and more.\n", "\n", "`statsmodels` uses the same algorithm as above to find the maximum\n", "likelihood estimates.\n", "\n", "Before we begin, let’s re-estimate our simple model with `statsmodels`\n", "to confirm we obtain the same coefficients and log-likelihood value." ] }, { "cell_type": "code", "execution_count": null, "id": "67031556", "metadata": { "hide-output": false }, "outputs": [], "source": [ "X = np.array([[1, 2, 5],\n", " [1, 1, 3],\n", " [1, 4, 2],\n", " [1, 5, 2],\n", " [1, 3, 1]])\n", "\n", "y = np.array([1, 0, 1, 1, 0])\n", "\n", "stats_poisson = Poisson(y, X).fit()\n", "print(stats_poisson.summary())" ] }, { "cell_type": "markdown", "id": "1b98ddc1", "metadata": {}, "source": [ "Now let’s replicate results from Daniel Treisman’s paper, [Russia’s\n", "Billionaires](https://www.aeaweb.org/articles?id=10.1257/aer.p20161068),\n", "mentioned earlier in the lecture.\n", "\n", "Treisman starts by estimating equation [(72.1)](#equation-poissonreg), where:\n", "\n", "- $ y_i $ is $ {number\\ of\\ billionaires}_i $ \n", "- $ x_{i1} $ is $ \\log{GDP\\ per\\ capita}_i $ \n", "- $ x_{i2} $ is $ \\log{population}_i $ \n", "- $ x_{i3} $ is $ {years\\ in\\ GATT}_i $ – years membership in GATT and WTO (to proxy access to international markets) \n", "\n", "\n", "The paper only considers the year 2008 for estimation.\n", "\n", "We will set up our variables for estimation like so (you should have the\n", "data assigned to `df` from earlier in the lecture)" ] }, { "cell_type": "code", "execution_count": null, "id": "6b9d91d8", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# Keep only year 2008\n", "df = df[df['year'] == 2008]\n", "\n", "# Add a constant\n", "df['const'] = 1\n", "\n", "# Variable sets\n", "reg1 = ['const', 'lngdppc', 'lnpop', 'gattwto08']\n", "reg2 = ['const', 'lngdppc', 'lnpop',\n", " 'gattwto08', 'lnmcap08', 'rintr', 'topint08']\n", "reg3 = ['const', 'lngdppc', 'lnpop', 'gattwto08', 'lnmcap08',\n", " 'rintr', 'topint08', 'nrrents', 'roflaw']" ] }, { "cell_type": "markdown", "id": "39f4f7a1", "metadata": {}, "source": [ "Then we can use the `Poisson` function from `statsmodels` to fit the\n", "model.\n", "\n", "We’ll use robust standard errors as in the author’s paper" ] }, { "cell_type": "code", "execution_count": null, "id": "106f53d1", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# Specify model\n", "poisson_reg = sm.Poisson(df[['numbil0']], df[reg1],\n", " missing='drop').fit(cov_type='HC0')\n", "print(poisson_reg.summary())" ] }, { "cell_type": "markdown", "id": "6a7db03b", "metadata": {}, "source": [ "Success! The algorithm was able to achieve convergence in 9 iterations.\n", "\n", "Our output indicates that GDP per capita, population, and years of\n", "membership in the General Agreement on Tariffs and Trade (GATT) are\n", "positively related to the number of billionaires a country has, as\n", "expected.\n", "\n", "Let’s also estimate the author’s more full-featured models and display\n", "them in a single table" ] }, { "cell_type": "code", "execution_count": null, "id": "3346d85b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "regs = [reg1, reg2, reg3]\n", "reg_names = ['Model 1', 'Model 2', 'Model 3']\n", "info_dict = {'Pseudo R-squared': lambda x: f\"{x.prsquared:.2f}\",\n", " 'No. observations': lambda x: f\"{int(x.nobs):d}\"}\n", "regressor_order = ['const',\n", " 'lngdppc',\n", " 'lnpop',\n", " 'gattwto08',\n", " 'lnmcap08',\n", " 'rintr',\n", " 'topint08',\n", " 'nrrents',\n", " 'roflaw']\n", "results = []\n", "\n", "for reg in regs:\n", " result = sm.Poisson(df[['numbil0']], df[reg],\n", " missing='drop').fit(cov_type='HC0',\n", " maxiter=100, disp=0)\n", " results.append(result)\n", "\n", "results_table = summary_col(results=results,\n", " float_format='%0.3f',\n", " stars=True,\n", " model_names=reg_names,\n", " info_dict=info_dict,\n", " regressor_order=regressor_order)\n", "results_table.add_title('Table 1 - Explaining the Number of Billionaires \\\n", " in 2008')\n", "print(results_table)" ] }, { "cell_type": "markdown", "id": "e5e06dfd", "metadata": {}, "source": [ "The output suggests that the frequency of billionaires is positively\n", "correlated with GDP per capita, population size, stock market\n", "capitalization, and negatively correlated with top marginal income tax\n", "rate.\n", "\n", "To analyze our results by country, we can plot the difference between\n", "the predicted an actual values, then sort from highest to lowest and\n", "plot the first 15" ] }, { "cell_type": "code", "execution_count": null, "id": "f5eee151", "metadata": { "hide-output": false }, "outputs": [], "source": [ "data = ['const', 'lngdppc', 'lnpop', 'gattwto08', 'lnmcap08', 'rintr',\n", " 'topint08', 'nrrents', 'roflaw', 'numbil0', 'country']\n", "results_df = df[data].dropna()\n", "\n", "# Use last model (model 3)\n", "results_df['prediction'] = results[-1].predict()\n", "\n", "# Calculate difference\n", "results_df['difference'] = results_df['numbil0'] - results_df['prediction']\n", "\n", "# Sort in descending order\n", "results_df.sort_values('difference', ascending=False, inplace=True)\n", "\n", "# Plot the first 15 data points\n", "results_df[:15].plot('country', 'difference', kind='bar',\n", " figsize=(12,8), legend=False)\n", "plt.ylabel('Number of billionaires above predicted level')\n", "plt.xlabel('Country')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "5fbd8b42", "metadata": {}, "source": [ "As we can see, Russia has by far the highest number of billionaires in\n", "excess of what is predicted by the model (around 50 more than expected).\n", "\n", "Treisman uses this empirical result to discuss possible reasons for\n", "Russia’s excess of billionaires, including the origination of wealth in\n", "Russia, the political climate, and the history of privatization in the\n", "years after the USSR." ] }, { "cell_type": "markdown", "id": "f2db5dc6", "metadata": {}, "source": [ "## Summary\n", "\n", "In this lecture, we used Maximum Likelihood Estimation to estimate the\n", "parameters of a Poisson model.\n", "\n", "`statsmodels` contains other built-in likelihood models such as\n", "[Probit](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Probit.html)\n", "and\n", "[Logit](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html).\n", "\n", "For further flexibility, `statsmodels` provides a way to specify the\n", "distribution manually using the `GenericLikelihoodModel` class - an\n", "example notebook can be found\n", "[here](https://www.statsmodels.org/dev/examples/notebooks/generated/generic_mle.html)." ] }, { "cell_type": "markdown", "id": "f84abb76", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "id": "373135f6", "metadata": {}, "source": [ "## Exercise 72.1\n", "\n", "Suppose we wanted to estimate the probability of an event $ y_i $\n", "occurring, given some observations.\n", "\n", "We could use a probit regression model, where the pmf of $ y_i $ is\n", "\n", "$$\n", "\\begin{aligned}\n", "f(y_i; \\boldsymbol{\\beta}) = \\mu_i^{y_i} (1-\\mu_i)^{1-y_i}, \\quad y_i = 0,1 \\\\\n", "\\text{where} \\quad \\mu_i = \\Phi(\\mathbf{x}_i' \\boldsymbol{\\beta})\n", "\\end{aligned}\n", "$$\n", "\n", "$ \\Phi $ represents the *cumulative normal distribution* and\n", "constrains the predicted $ y_i $ to be between 0 and 1 (as required\n", "for a probability).\n", "\n", "$ \\boldsymbol{\\beta} $ is a vector of coefficients.\n", "\n", "Following the example in the lecture, write a class to represent the\n", "Probit model.\n", "\n", "To begin, find the log-likelihood function and derive the gradient and\n", "Hessian.\n", "\n", "The `scipy` module `stats.norm` contains the functions needed to\n", "compute the cmf and pmf of the normal distribution." ] }, { "cell_type": "markdown", "id": "04cf1b6e", "metadata": {}, "source": [ "## Solution to[ Exercise 72.1](https://python.quantecon.org/#mle_ex1)\n", "\n", "The log-likelihood can be written as\n", "\n", "$$\n", "\\log \\mathcal{L} = \\sum_{i=1}^n\n", "\\big[\n", "y_i \\log \\Phi(\\mathbf{x}_i' \\boldsymbol{\\beta}) +\n", "(1 - y_i) \\log (1 - \\Phi(\\mathbf{x}_i' \\boldsymbol{\\beta})) \\big]\n", "$$\n", "\n", "Using the **fundamental theorem of calculus**, the derivative of a\n", "cumulative probability distribution is its marginal distribution\n", "\n", "$$\n", "\\frac{ \\partial} {\\partial s} \\Phi(s) = \\phi(s)\n", "$$\n", "\n", "where $ \\phi $ is the marginal normal distribution.\n", "\n", "The gradient vector of the Probit model is\n", "\n", "$$\n", "\\frac {\\partial \\log \\mathcal{L}} {\\partial \\boldsymbol{\\beta}} =\n", "\\sum_{i=1}^n \\Big[\n", "y_i \\frac{\\phi(\\mathbf{x}'_i \\boldsymbol{\\beta})}{\\Phi(\\mathbf{x}'_i \\boldsymbol{\\beta)}} -\n", "(1 - y_i) \\frac{\\phi(\\mathbf{x}'_i \\boldsymbol{\\beta)}}{1 - \\Phi(\\mathbf{x}'_i \\boldsymbol{\\beta)}}\n", "\\Big] \\mathbf{x}_i\n", "$$\n", "\n", "The Hessian of the Probit model is\n", "\n", "$$\n", "\\frac {\\partial^2 \\log \\mathcal{L}} {\\partial \\boldsymbol{\\beta} \\partial \\boldsymbol{\\beta}'} =\n", "-\\sum_{i=1}^n \\phi (\\mathbf{x}_i' \\boldsymbol{\\beta})\n", "\\Big[\n", "y_i \\frac{ \\phi (\\mathbf{x}_i' \\boldsymbol{\\beta}) + \\mathbf{x}_i' \\boldsymbol{\\beta} \\Phi (\\mathbf{x}_i' \\boldsymbol{\\beta}) } { [\\Phi (\\mathbf{x}_i' \\boldsymbol{\\beta})]^2 } +\n", "(1 - y_i) \\frac{ \\phi (\\mathbf{x}_i' \\boldsymbol{\\beta}) - \\mathbf{x}_i' \\boldsymbol{\\beta} (1 - \\Phi (\\mathbf{x}_i' \\boldsymbol{\\beta})) } { [1 - \\Phi (\\mathbf{x}_i' \\boldsymbol{\\beta})]^2 }\n", "\\Big]\n", "\\mathbf{x}_i \\mathbf{x}_i'\n", "$$\n", "\n", "Using these results, we can write a class for the Probit model as\n", "follows" ] }, { "cell_type": "code", "execution_count": null, "id": "e1322f51", "metadata": { "hide-output": false }, "outputs": [], "source": [ "class ProbitRegression:\n", "\n", " def __init__(self, y, X, β):\n", " self.X, self.y, self.β = X, y, β\n", " self.n, self.k = X.shape\n", "\n", " def μ(self):\n", " return norm.cdf(self.X @ self.β.T)\n", "\n", " def ϕ(self):\n", " return norm.pdf(self.X @ self.β.T)\n", "\n", " def logL(self):\n", " μ = self.μ()\n", " return np.sum(y * np.log(μ) + (1 - y) * np.log(1 - μ))\n", "\n", " def G(self):\n", " μ = self.μ()\n", " ϕ = self.ϕ()\n", " return np.sum((X.T * y * ϕ / μ - X.T * (1 - y) * ϕ / (1 - μ)),\n", " axis=1)\n", "\n", " def H(self):\n", " X = self.X\n", " β = self.β\n", " μ = self.μ()\n", " ϕ = self.ϕ()\n", " a = (ϕ + (X @ β.T) * μ) / μ**2\n", " b = (ϕ - (X @ β.T) * (1 - μ)) / (1 - μ)**2\n", " return -(ϕ * (y * a + (1 - y) * b) * X.T) @ X" ] }, { "cell_type": "markdown", "id": "57a7df91", "metadata": {}, "source": [ "## Exercise 72.2\n", "\n", "Use the following dataset and initial values of $ \\boldsymbol{\\beta} $ to\n", "estimate the MLE with the Newton-Raphson algorithm developed earlier in\n", "the lecture\n", "\n", "$$\n", "\\mathbf{X} =\n", "\\begin{bmatrix}\n", "1 & 2 & 4 \\\\\n", "1 & 1 & 1 \\\\\n", "1 & 4 & 3 \\\\\n", "1 & 5 & 6 \\\\\n", "1 & 3 & 5\n", "\\end{bmatrix}\n", "\\quad\n", "y =\n", "\\begin{bmatrix}\n", "1 \\\\\n", "0 \\\\\n", "1 \\\\\n", "1 \\\\\n", "0\n", "\\end{bmatrix}\n", "\\quad\n", "\\boldsymbol{\\beta}_{(0)} =\n", "\\begin{bmatrix}\n", "0.1 \\\\\n", "0.1 \\\\\n", "0.1\n", "\\end{bmatrix}\n", "$$\n", "\n", "Verify your results with `statsmodels` - you can import the Probit\n", "function with the following import statement" ] }, { "cell_type": "code", "execution_count": null, "id": "35fd43bd", "metadata": { "hide-output": false }, "outputs": [], "source": [ "from statsmodels.discrete.discrete_model import Probit" ] }, { "cell_type": "markdown", "id": "1208d87d", "metadata": {}, "source": [ "Note that the simple Newton-Raphson algorithm developed in this lecture\n", "is very sensitive to initial values, and therefore you may fail to\n", "achieve convergence with different starting values." ] }, { "cell_type": "markdown", "id": "ce804923", "metadata": {}, "source": [ "## Solution to[ Exercise 72.2](https://python.quantecon.org/#mle_ex2)\n", "\n", "Here is one solution" ] }, { "cell_type": "code", "execution_count": null, "id": "e834dee0", "metadata": { "hide-output": false }, "outputs": [], "source": [ "X = np.array([[1, 2, 4],\n", " [1, 1, 1],\n", " [1, 4, 3],\n", " [1, 5, 6],\n", " [1, 3, 5]])\n", "\n", "y = np.array([1, 0, 1, 1, 0])\n", "\n", "# Take a guess at initial βs\n", "β = np.array([0.1, 0.1, 0.1])\n", "\n", "# Create instance of Probit regression class\n", "prob = ProbitRegression(y, X, β)\n", "\n", "# Run Newton-Raphson algorithm\n", "newton_raphson(prob)" ] }, { "cell_type": "code", "execution_count": null, "id": "a052e556", "metadata": { "hide-output": false }, "outputs": [], "source": [ "# Use statsmodels to verify results\n", "\n", "print(Probit(y, X).fit().summary())" ] } ], "metadata": { "date": 1714442507.873115, "filename": "mle.md", "kernelspec": { "display_name": "Python", "language": "python3", "name": "python3" }, "title": "Maximum Likelihood Estimation" }, "nbformat": 4, "nbformat_minor": 5 }