---
title: 'Convexity, Duality, and Optimality Conditions (KL Chapter 11, BV Chapter 5)'
subtitle: Biostat/Biomath M257
author: Dr. Hua Zhou @ UCLA
date: today
format:
html:
theme: cosmo
embed-resources: true
number-sections: true
toc: true
toc-depth: 4
toc-location: left
code-fold: false
jupyter:
jupytext:
formats: 'ipynb,qmd'
text_representation:
extension: .qmd
format_name: quarto
format_version: '1.0'
jupytext_version: 1.14.5
kernelspec:
display_name: Julia 1.9.0
language: julia
name: julia-1.9
---
## Convexity
* A function $f: \mathbb{R}^n \mapsto \mathbb{R}$ is **convex** if
1. $\text{dom} f$ is a convex set: $\lambda \mathbf{x} + (1-\lambda) \mathbf{y} \in \text{dom} f$ for all $\mathbf{x},\mathbf{y} \in \text{dom} f$ and any $\lambda \in (0, 1)$, and
2. $f(\lambda \mathbf{x} + (1-\lambda) \mathbf{y}) \le \lambda f(\mathbf{x}) + (1-\lambda) f(\mathbf{y})$, for all $\mathbf{x},\mathbf{y} \in \text{dom} f$ and $\lambda \in (0,1)$.
* $f$ is **strictly convex** if the inequality is strict for all $\mathbf{x} \ne \mathbf{y} \in \text{dom} f$ and $\lambda$.
* **Supporting hyperplane inequality**. A differentiable function $f$ is convex if and only if
$$
f(\mathbf{x}) \ge f(\mathbf{y}) + \nabla f(\mathbf{y})^T (\mathbf{x}-\mathbf{y})
$$
for all $\mathbf{x}, \mathbf{y} \in \text{dom} f$.
Proof: For the only if part.
$$
f[\lambda x + (1 - \lambda) y)] \le \lambda f(x) + (1 - \lambda) f(y).
$$
Thus
$$
\frac{f(y + \lambda (x - y)) - f(y)}{\lambda} \le f(x) - f(y).
$$
Taking limit $\lambda \to 0$ yields the supporting hyperplane inequality. For the if part, let $z = \lambda x + (1 - \lambda) y$. Then
\begin{eqnarray*}
f(x) - f(z) &\ge& f'(z) (x - z) \\
f(y) - f(z) &\ge& f'(z) (y - z).
\end{eqnarray*}
Taking $\lambda \cdot (1) + (1 - \lambda) \cdot (2)$ yields the inequality
$$
f(\lambda x + (1 - \lambda) y) \le \lambda f(x) + (1 - \lambda) f(y).
$$
* **Second-order condition for convexity**. A twice differentiable function $f$ is convex if and only if $\nabla^2f(\mathbf{x})$ is psd for all $\mathbf{x} \in \text{dom} f$. It is strictly convex if and only if $\nabla^2f(\mathbf{x})$ is pd for all $\mathbf{x} \in \text{dom} f$.
* Convexity and global optima. Suppose $f$ is a convex function.
1. Any stationary point $\mathbf{y}$, i.e., $\nabla f(\mathbf{y})=\mathbf{0}$, is a global minimum. (Proof: By supporting hyperplane inequality, $f(\mathbf{x}) \ge f(\mathbf{y}) + \nabla f(\mathbf{y})^T (\mathbf{x} - \mathbf{y}) = f(\mathbf{y})$ for all $\mathbf{x} \in \text{dom} f$.)
2. Any local minimum is a global minimum.
3. The set of (global) minima is convex.
4. If $f$ is strictly convex, then the global minimum, if exists, is unique.
* Example: Least squares estimate. $f(\beta) = \frac 12 \| \mathbf{y} - \mathbf{X} \beta \|_2^2$ has Hessian $\nabla^2f = \mathbf{X}^T \mathbf{X}$ which is psd. So $f$ is convex and any stationary point (solution to the normal equation) is a global minimum. When $\mathbf{X}$ is rank deficient, the set of solutions is convex.
* **Jensen's inequality**. If $h$ is convex and $\mathbf{W}$ a random vector taking values in $\text{dom} f$, then
$$
\mathbf{E}[h(\mathbf{W})] \ge h [\mathbf{E}(\mathbf{W})],
$$
provided both expectations exist. For a strictly convex $h$, equality holds if and only if $W = \mathbf{E}(W)$ almost surely.
Proof: Take $\mathbf{x} = \mathbf{W}$ and $\mathbf{y} = \mathbf{E} (\mathbf{W})$ in the supporting hyperplane inequality.
* **Information inequality**. Let $f$ and $g$ be two densities with respect to a common measure $\mu$ and $f, g>0$ almost everywhere relative to $\mu$. Then
$$
\mathbf{E}_f (\log f) \ge \mathbf{E}_f (\log g),
$$
with equality if and only if $f = g$ almost everywhere on $\mu$.
Proof: Apply Jensen's inequality to the convex function $- \ln(t)$ and random variable $W=g(X)/f(X)$ where $X \sim f$.
Important applications of information inequality: M-estimation, EM algorithm.
## Duality
* Consider optimization problem
\begin{eqnarray*}
&\text{minimize}& f_0(\mathbf{x}) \\
&\text{subject to}& f_i(\mathbf{x}) \le 0, \quad i = 1,\ldots,m \\
& & h_i(\mathbf{x}) = 0, \quad i = 1,\ldots,p.
\end{eqnarray*}
* The **Lagrangian** is
\begin{eqnarray*}
L(\mathbf{x}, \lambda, \nu) = f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^p \nu_i h_i(\mathbf{x}).
\end{eqnarray*}
The vectors $\lambda = (\lambda_1,\ldots, \lambda_m)^T$ and $\nu = (\nu_1,\ldots,\nu_p)^T$ are called the **Lagrange multiplier vectors** or **dual variables**.
* The **Lagrange dual function** is the minimum value of the Langrangian over $\mathbf{x}$
\begin{eqnarray*}
g(\lambda, \mu) = \inf_{\mathbf{x}} L(\mathbf{x}, \lambda, \nu) = \inf_{\mathbf{x}} \left( f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^p \nu_i h_i(\mathbf{x}) \right).
\end{eqnarray*}
* Denote the optimal value of original problem by $p^\star$. For any $\lambda \succeq \mathbf{0}$ and any $\nu$, we have
\begin{eqnarray*}
g(\lambda, \nu) \le p^\star.
\end{eqnarray*}
Proof: For any feasible point $\tilde{\mathbf{x}}$,
\begin{eqnarray*}
L(\tilde{\mathbf{x}}, \lambda, \nu) = f_0(\tilde{\mathbf{x}}) + \sum_{i=1}^m \lambda_i f_i(\tilde{\mathbf{x}}) + \sum_{i=1}^p \nu_i h_i(\tilde{\mathbf{x}}) \le f_0(\tilde{\mathbf{x}})
\end{eqnarray*}
because the second term is non-positive and the third term is zero. Then
\begin{eqnarray*}
g(\lambda, \nu) = \inf_{\mathbf{x}} L(\mathbf{x}, \lambda, \nu) \le L(\tilde{\mathbf{x}}, \lambda, \nu) \le f_0(\tilde{\mathbf{x}}).
\end{eqnarray*}
* Since each pair $(\lambda, \nu)$ with $\lambda \succeq \mathbf{0}$ gives a lower bound to the optimal value $p^\star$. It is natural to ask for the best possible lower bound the Lagrange dual function can provide. This leads to the **Lagrange dual problem**
\begin{eqnarray*}
&\text{maximize}& g(\lambda, \nu) \\
&\text{subject to}& \lambda \succeq \mathbf{0},
\end{eqnarray*}
which is a convex problem regardless the primal problem is convex or not.
* We denote the optimal value of the Lagrange dual problem by $d^\star$, which satifies the **week duality**
\begin{eqnarray*}
d^\star \le p^\star.
\end{eqnarray*}
The difference $p^\star - d^\star$ is called the **optimal duality gap**.
* If the primal problem is convex, that is
\begin{eqnarray*}
&\text{minimize}& f_0(\mathbf{x}) \\
&\text{subject to}& f_i(\mathbf{x}) \le 0, \quad i = 1,\ldots,m \\
& & \mathbf{A} \mathbf{x} = \mathbf{b},
\end{eqnarray*}
with $f_0,\ldots,f_m$ convex, we usually (but not always) have the **strong duality**, i.e., $d^\star = p^\star$.
* The conditions under which strong duality holds are called **constraint qualifications**. A commonly used one is **Slater's condition**: There exists a point in the relative interior of the domain such that
\begin{eqnarray*}
f_i(\mathbf{x}) < 0, \quad i = 1,\ldots,m, \quad \mathbf{A} \mathbf{x} = \mathbf{b}.
\end{eqnarray*}
Such a point is also called **strictly feasible**.
## KKT optimality conditions
KKT is "one of the great triumphs of 20th century applied mathematics" (KL Chapter 11).
### Nonconvex problems
* Assume $f_0,\ldots,f_m,h_1,\ldots,h_p$ are differentiable. Let $\mathbf{x}^\star$ and $(\lambda^\star, \nu^\star)$ be any primal and dual optimal points with zero duality gap, i.e., strong duality holds.
* Since $\mathbf{x}^\star$ minimizes $L(\mathbf{x}, \lambda^\star, \nu^\star)$ over $\mathbf{x}$, its gradient vanishes at $\mathbf{x}^\star$, we have the **Karush-Kuhn-Tucker (KKT) conditions**
\begin{eqnarray*}
f_i(\mathbf{x}^\star) &\le& 0, \quad i = 1,\ldots,m \\
h_i(\mathbf{x}^\star) &=& 0, \quad i = 1,\ldots,p \\
\lambda_i^\star &\ge& 0, \quad i = 1,\ldots,m \\
\lambda_i^\star f_i(\mathbf{x}^\star) &=& 0, \quad i=1,\ldots,m \\
\nabla f_0(\mathbf{x}^\star) + \sum_{i=1}^m \lambda_i^\star \nabla f_i(\mathbf{x}^\star) + \sum_{i=1}^p \nu_i^\star \nabla h_i(\mathbf{x}^\star) &=& \mathbf{0}.
\end{eqnarray*}
* The fourth condition (**complementary slackness**) follows from
\begin{eqnarray*}
f_0(\mathbf{x}^\star) &=& g(\lambda^\star, \nu^\star) \\
&=& \inf_{\mathbf{x}} \left( f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i^\star f_i(\mathbf{x}) + \sum_{i=1}^p \nu_i^\star h_i(\mathbf{x}) \right) \\
&\le& f_0(\mathbf{x}^\star) + \sum_{i=1}^m \lambda_i^\star f_i(\mathbf{x}^\star) + \sum_{i=1}^p \nu_i^\star h_i(\mathbf{x}^\star) \\
&\le& f_0(\mathbf{x}^\star).
\end{eqnarray*}
Since $\sum_{i=1}^m \lambda_i^\star f_i(\mathbf{x}^\star)=0$ and each term is non-positive, we have $\lambda_i^\star f_i(\mathbf{x}^\star)=0$, $i=1,\ldots,m$.
* To summarize, for any optimization problem with differentiable objective and constraint functions for which strong duality obtains, any pair of primal and dual optimal points must satisfy the KKT conditions.
### Convex problems
* When the primal problem is convex, the KKT conditions are also sufficient for the points to be primal and dual optimal.
* If $f_i$ are convex and $h_i$ are affine, and $(\tilde{\mathbf{x}}, \tilde \lambda, \tilde \nu)$ satisfy the KKT conditions, then $\tilde{\mathbf{x}}$ and $(\tilde \lambda, \tilde \nu)$ are primal and dual optimal, with zero duality gap.
* The KKT conditions play an important role in optimization. Many algorithms for convex optimization are conceived as, or can be interpreted as, methods for solving the KKT conditions.