# SMILE — BFGS Optimization

The `smile.math.BFGS` class implements three closely related quasi-Newton
minimization algorithms, all accessible through overloaded `minimize()` static
methods:

| Variant | Method signature | Memory | Constraints |
|---------|-----------------|--------|-------------|
| **BFGS** | `minimize(func, x, gtol, maxIter)` | O(n²) | none |
| **L-BFGS** | `minimize(func, m, x, gtol, maxIter)` | O(m·n) | none |
| **L-BFGS-B** | `minimize(func, m, x, l, u, gtol, maxIter)` | O(m·n) | box bounds |

---

## Background

### BFGS

The *Broyden–Fletcher–Goldfarb–Shanno* (BFGS) algorithm is the gold standard
for moderate-sized unconstrained smooth minimization. It belongs to the class
of **quasi-Newton** methods: instead of computing the exact Hessian ∇²f(x)
(expensive) it maintains a running approximation **H** of the *inverse* Hessian,
updated cheaply at each iteration using only gradient differences.

Each BFGS iteration performs three steps:

1. **Descent direction**: compute **d** = −**H** ∇f(x).
2. **Line search**: find a step length α satisfying the Wolfe conditions
   (sufficient decrease + curvature); update x ← x + α **d**.
3. **Hessian update**: update **H** using the rank-2 formula:

$$
H_{k+1} = H_k
  + \frac{s_k s_k^\top}{s_k^\top y_k}
  - \frac{H_k y_k y_k^\top H_k}{y_k^\top H_k y_k}
$$

where **s**ₖ = xₖ₊₁ − xₖ and **y**ₖ = ∇f(xₖ₊₁) − ∇f(xₖ).

Because **H** is stored as a dense n × n matrix, the memory cost is **O(n²)**,
which makes BFGS impractical for problems with more than a few thousand
variables.

### L-BFGS

*Limited-memory BFGS* (L-BFGS) avoids storing the full inverse Hessian
approximation. Instead it keeps a rolling window of the last **m** pairs of
(position difference **s**, gradient difference **y**) vectors and uses a
two-loop recursion to compute the matrix–vector product **H**ₖ **g** implicitly.

Memory is **O(m·n)** — independent of the problem dimension. A history size of
`m = 5` (the default suggested range is 3–7) is sufficient for most problems.

### L-BFGS-B

*Limited-memory BFGS with Bounds* (L-BFGS-B) extends L-BFGS to handle
**box constraints**:

```
lᵢ ≤ xᵢ ≤ uᵢ    for each variable xᵢ
```

At each iteration the algorithm:

1. Computes the **Generalized Cauchy Point** (GCP) — the first local minimizer
   of a piecewise-linear model along the projected steepest-descent path.  This
   identifies which variables are at their bounds ("fixed") and which are free.
2. Solves a **reduced subspace problem** over the free variables using the
   L-BFGS inverse Hessian approximation.
3. Performs a **projected line search** that keeps the iterate inside the box.

---

## API

All variants share the same pattern: the function value at the minimum is
returned, and the solution vector `x` is updated **in-place**.

### Implementing the Objective Function

All three variants require a
`smile.util.function.DifferentiableMultivariateFunction`, which has two methods:

```java
/** Returns f(x). */
double f(double[] x);

/** Computes the gradient g = ∇f(x) and returns f(x). */
double g(double[] x, double[] g);
```

The `g()` method must fill the supplied array `g` with the partial derivatives
and also return the function value. This dual-purpose design avoids redundant
evaluations when the value and gradient are computed together.

### BFGS — Unconstrained, Dense Hessian

```java
double minimum = BFGS.minimize(func, x, gtol, maxIter);
```

| Parameter | Meaning |
|-----------|---------|
| `func` | `DifferentiableMultivariateFunction` to minimize |
| `x` | initial guess; **overwritten** with the solution on return |
| `gtol` | gradient convergence tolerance (e.g. `1e-5`); must be `> 0` |
| `maxIter` | maximum number of outer iterations |

**Returns** the function value at the best point found.

Convergence is declared when either:
- the scaled gradient norm falls below `gtol`, or
- the scaled step length falls below `4 × ε` (machine epsilon × 10⁻⁸).

If `maxIter` is reached without convergence a warning is logged and the best
value found so far is returned.

### L-BFGS — Unconstrained, Limited Memory

```java
double minimum = BFGS.minimize(func, m, x, gtol, maxIter);
```

| Parameter | Meaning |
|-----------|---------|
| `m` | number of correction pairs to retain (`3 ≤ m ≤ 7` recommended; `m = 5` is a common choice) |
| (others) | same as BFGS |

Use L-BFGS instead of BFGS when **n > ~1000** or when memory is limited.  For
small n the dense BFGS method typically converges in fewer iterations.

### L-BFGS-B — Bound-Constrained, Limited Memory

```java
double minimum = BFGS.minimize(func, m, x, l, u, gtol, maxIter);
```

| Parameter | Meaning |
|-----------|---------|
| `l` | lower bounds array, length `n` |
| `u` | upper bounds array, length `n` |
| (others) | same as L-BFGS |

Both `l` and `u` must have the same length as `x`.  The initial point `x` need
not satisfy the bounds; `x` is clamped to `[l, u]` before the first iteration.
Use `Double.NEGATIVE_INFINITY` / `Double.POSITIVE_INFINITY` to leave individual
components unbounded.

---

## Configuration

The internal convergence constants can be tuned via JVM system properties:

| Property | Default | Meaning |
|----------|---------|---------|
| `smile.bfgs.epsilon` | `1e-8` | Base epsilon used to derive `TOLX` and `TOLF` |

Set on the command line with `-Dsmile.bfgs.epsilon=1e-10`.

---

## Examples

### Example 1 — BFGS on the Rosenbrock Function

The *Rosenbrock banana function* `f(x, y) = (1-x)² + 100(y-x²)²` has a
narrow curved valley; its global minimum is 0 at (1, 1).

```java
import smile.math.BFGS;
import smile.util.function.DifferentiableMultivariateFunction;

DifferentiableMultivariateFunction rosenbrock = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double t1 = 1.0 - x[0];
        double t2 = 10.0 * (x[1] - x[0] * x[0]);
        return t1 * t1 + t2 * t2;
    }

    @Override
    public double g(double[] x, double[] g) {
        double t1 = 1.0 - x[0];
        double t2 = 10.0 * (x[1] - x[0] * x[0]);
        g[1] = 20.0 * t2;
        g[0] = -2.0 * (x[0] * g[1] + t1);
        return t1 * t1 + t2 * t2;
    }
};

double[] x = {-1.2, 1.0};             // starting point
double fmin = BFGS.minimize(rosenbrock, x, 1e-5, 500);

System.out.printf("Minimum: f = %.2e at (%.6f, %.6f)%n", fmin, x[0], x[1]);
// Minimum: f = 3.45e-10 at (1.000000, 1.000000)
```

### Example 2 — L-BFGS for Large-Scale Problems

```java
int n = 10_000;
// Quadratic bowl: f(x) = Σ i·xᵢ² — minimum is 0 at the origin
DifferentiableMultivariateFunction bowl = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double sum = 0;
        for (int i = 0; i < x.length; i++) sum += (i + 1.0) * x[i] * x[i];
        return sum;
    }

    @Override
    public double g(double[] x, double[] g) {
        double sum = 0;
        for (int i = 0; i < x.length; i++) {
            g[i] = 2.0 * (i + 1.0) * x[i];
            sum += (i + 1.0) * x[i] * x[i];
        }
        return sum;
    }
};

double[] x = new double[n];
java.util.Arrays.fill(x, 1.0);        // start at all-ones

// m = 5 history pairs — dense BFGS would need a 10000×10000 matrix
double fmin = BFGS.minimize(bowl, 5, x, 1e-8, 1000);
System.out.printf("Minimum: %.6e%n", fmin);
```

### Example 3 — L-BFGS-B with Box Constraints

Find the minimum of `f(x₀, x₁) = (x₀ − 3)² + (x₁ − 4)² + 1` subject to
`3.5 ≤ x₀ ≤ 5` and `3.5 ≤ x₁ ≤ 5`.  The unconstrained minimum (3, 4) is
outside the box, so the constrained minimum is at the boundary (3.5, 4).

```java
DifferentiableMultivariateFunction func = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double d0 = x[0] - 3, d1 = x[1] - 4;
        return d0 * d0 + d1 * d1 + 1;
    }

    @Override
    public double g(double[] x, double[] g) {
        double d0 = x[0] - 3, d1 = x[1] - 4;
        g[0] = 2 * d0;
        g[1] = 2 * d1;
        return d0 * d0 + d1 * d1 + 1;
    }
};

double[] x = {0.0, 0.0};              // starting point (outside bounds is fine)
double[] l = {3.5, 3.5};
double[] u = {5.0, 5.0};

double fmin = BFGS.minimize(func, 5, x, l, u, 1e-8, 500);
System.out.printf("x = (%.4f, %.4f), f = %.4f%n", x[0], x[1], fmin);
// x = (3.5000, 4.0000), f = 1.2500
```

---

## Choosing the Right Variant

```
Problem size         Constraints      Recommended variant
──────────────────────────────────────────────────────────
n ≤ 1 000            none             BFGS
n > 1 000            none             L-BFGS (m = 5)
any n                box bounds       L-BFGS-B (m = 5)
```

**General guidelines:**

- For **smooth, well-conditioned** problems BFGS usually converges in O(n)
  iterations; L-BFGS may need slightly more but uses far less memory.
- Prefer **L-BFGS** whenever `n > 1000` to avoid the O(n²) Hessian storage.
- Use **L-BFGS-B** any time variables have natural bounds (e.g. probabilities
  in [0, 1], non-negative weights, angles in [0, 2π]).
- All three variants require the **gradient** to be analytically supplied.
  If only a function value is available, consider finite-difference approximation
  or automatic differentiation to construct the `DifferentiableMultivariateFunction`.

---

## Convergence Criteria

All variants use the same two stopping tests:

1. **Step-length test** — converged when the largest relative step component
   falls below `TOLX = 4ε`:

   ```
   max_i |Δxᵢ| / max(|xᵢ|, 1) < TOLX
   ```

2. **Gradient test** — converged when the scaled infinity-norm of the gradient
   falls below `gtol`:

   ```
   max_i |∂f/∂xᵢ| · max(|xᵢ|, 1) / max(f, 1) < gtol
   ```

   L-BFGS-B uses the **projected gradient** norm instead, which respects the
   box constraints:

   ```
   max_i |proj_gradient_i| < gtol
   ```

   where `proj_gradient_i = clip(xᵢ − gᵢ, lᵢ, uᵢ) − xᵢ`.

---

## Line Search

All three variants share the same Wolfe-condition backtracking line search.
Starting from a full step α = 1, the step is reduced using:

- **Quadratic interpolation** on the first backtrack, then
- **Cubic interpolation** on subsequent backtracks,

with the minimum step bounded below at `0.1 × α` to prevent steps from
shrinking too fast, and bounded above at `stpmax = 100 × max(‖x‖, n)` to
prevent excessively large jumps into undefined regions.

The sufficient decrease condition (Armijo) is checked with `ftol = 1e-4`.

---

## References

1. R. Fletcher, *Practical Methods of Optimization*, 2nd ed., Wiley, 1987.
2. D. C. Liu and J. Nocedal, "On the limited memory BFGS method for large
   scale optimization," *Mathematical Programming B*, 45 (1989), 503–528.
3. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, "A limited memory algorithm for
   bound constrained optimization," *SIAM Journal on Scientific Computing*,
   16(5) (1995), 1190–1208.


---

*SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.*