---
title: "Likelihood Theory (ELMR Appendix)"
author: "Dr. Jin Zhou @ UCLA"
subtitle: Biostat 200C
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', cache = FALSE)
```

Display system information and load `tidyverse` and `faraway` packages
```{r}
sessionInfo()
library(tidyverse)
library(faraway)
```
`faraway` package contains the datasets in the ELMR book.

## Maximum Likelihood

- Consider $n$ independent observations $Y_1, Y_2, \ldots, Y_n$ from a distribution with density $f(y\mid \theta)$, where $\theta$ is the, possibly vector-valued, parameter. Suppose we observe $\mathbf{y} = (y_1,\ldots, y_n)^T$, then we define the likelihood as   
$$
L(\theta\mid y) = P(\mathbf{Y} = \mathbf{y}) = f(y_1,\ldots, y_n\mid \theta) = \prod_{i=1}^n f(y_i\mid \theta)
$$
- The likelihood is a function of the parameter(s) given the data and is the probability of the observed data given a specified value of the parameter(s).

- For continuous data, the likelihood is a density function and is not a probability. For continuous random variables, $Y_1, Y_2, \ldots, Y_n$, with probability density function $f(y\mid \theta)$. For $y_i$, 
$$
P(Y_i = y_i) = P(y_i \leq Y_i \leq y_i + dy) = f(y_i\mid \theta)dy
$$
- The likelihood is the joint density of the data, $f(y_1,\ldots, y_n\mid \theta)$, evaluated at the observed data, $\mathbf{y}$.
$$
L(\theta\mid y) \approx \prod_{i=1}^n f(y_i\mid \theta)
$$

- For example, suppose that $Y$ is binomially distributed $B(n, p)$. The likelihood is
$$
L(p\mid y) = P(Y = y) = {n \choose y} p^y(1-p)^{n-y}
$$
- The maximum likelihood estimate (MLE) is the the parameter(s) ($\theta$) that gives the largest probability to the observed data, or in other words, MLE of $\theta$ is the value of $\theta$ that maximizes the likelihood function. The MLE is denoted by $\hat{\theta}$.

- In most cases, it is easier to maximize the log of likelihood function, $l(\theta\mid y) = logL(\theta \mid y)$. Since log is a monotone increasing function, the maximum occurs at the same $\hat \theta$.

- Example, for binoimal distribution, the log likelihood is
$$
l(p\mid y) = logL(p\mid y) = \log{n \choose y} + y\log(p) + (n-y)\log(1-p)
$$

## Estimation

- The **score function** is the derivative of the log likelihood with respect to the parameter(s). 
$$
u(p\mid y) = \frac{\partial l(p\mid y)}{\partial p} = \frac{y}{p} - \frac{n-y}{1-p}
$$

- We can find the maximum likelihood estimate $\hat p$ by solving $u(p) = 0$. We get $\hat p = y/n$. We should also verify that this stationary point actually represents a maximum, i.e., second derivative is negative.

- Other properties of maximum likelihood estimators include consistency, sufficiency, asymptotic efficiency and asymptotic normality. These are discussed in books such as [A Course in Large Sample Theory](https://www.routledge.com/A-Course-in-Large-Sample-Theory/Ferguson/p/book/9781138445765?source=shoppingads&locale=en-USD&gad_source=1&gclid=Cj0KCQjwq86wBhDiARIsAJhuphkKo5BGKtpJBvvmWRg7DOhlPwHk-yHSpaV-7B06F-UVqfDzZASbGcIaAli0EALw_wcB) or [Kalbfleisch (1985, Chapters 1 and 2)](https://link.springer.com/book/10.1007/978-1-4612-1096-2).

- Usually we want more than an estimate; some measure of the uncertainty in the estimate is valuable. This can be obtained via the Fisher information which is:
$$
I(\theta) = \mbox{var}\,\, u(\theta)  = -\mbox{E}\left[\frac{\partial^2 l(\theta\mid y)}{\partial \theta^2}\right]
$$
- One can show that the variance of $\hat{\theta}$ can be estimated by:
$$
\mbox{var}(\hat{\theta}) = \frac{1}{I(\theta)}
$$
under mild regularity conditions. For the binomial example this gives:
$$
\mbox{var}(\hat{p}) = \frac{\hat{p}(1-\hat{p})}{n}
$$

- Examples where likelihood can be maximized explicitly are confined to simple cases. Typically, numerical optimization is necessary. The Newton–Raphson method is the most well-known technique.  
$$
\theta_{k+1} = \theta_k - H^{-1}(\theta_k) u(\theta_k)
$$
where $H(\theta)$ is the Hessian matrix of second derivatives of the log likelihood function evaluated at $\theta$.
$$
H(\theta) = \frac{\partial^2 l(\theta\mid y)}{\partial \theta \partial \theta^T}
$$
We iterate this method, putting $\theta_{k+1}$ in place of $\theta_{k}$ and so on, until the procedure (hopefully) converges. The Fisher scoring method replaces $H$ with $-I$ and sometimes gives superior results. This method is used in fitting GLMs and is equivalent to iteratively reweighted least squares.