Chapter 15: Applications – Econometrics, Finance, and Machine Learning

In this final chapter, we see how the Bayesian methods we've learned are applied to solve complex, real-world problems. Each section provides a concrete, runnable code example for the specific techniques used in each field.

15.1 Econometrics

Econometrics heavily relies on time-series and panel data, where Bayesian methods offer principled ways to handle complexity, prevent overfitting, and model latent structures.


Bayesian Vector Autoregression (BVAR)

Vector Autoregressions (VARs) model the dynamics of multiple time series jointly. However, they have many parameters, leading to overfitting. A BVAR uses priors to "shrink" the coefficients towards simpler, more plausible values (e.g., a random walk), improving forecast performance.
Python

import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

# 1. Simulate a 2-dimensional VAR(1) process
np.random.seed(42)
T = 100
# True coefficient matrix A1
A1_true = np.array([[0.7, 0.2],
                    [-0.1, 0.9]])
# True covariance of errors
cov_true = np.array([[1.0, 0.3],
                     [0.3, 1.5]])
# Generate data
Y = np.zeros((T, 2))
for t in range(1, T):
    Y[t, :] = A1_true @ Y[t-1, :] + np.random.multivariate_normal([0,0], cov_true)

# 2. Define the BVAR model in PyMC
with pm.Model() as bvar_model:
    # Priors on the coefficient matrix (like a simplified Minnesota prior)
    # Shrinking coefficients towards zero
    A1 = pm.Normal("A1", mu=0, sigma=0.5, shape=(2, 2))

    # Prior on the error covariance matrix using LKJ Cholesky decomposition
    chol, _, _ = pm.LKJCholeskyCov(
        "chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0, shape=2), compute_corr=True
    )
    cov = pm.Deterministic("cov", chol @ chol.T)

    # Likelihood
    # The mean for Y[t] is A1 * Y[t-1]
    mu = pm.math.dot(A1, Y[:-1, :].T).T
    likelihood = pm.MvNormal("likelihood", mu=mu, chol=chol, observed=Y[1:, :])

    trace_bvar = pm.sample(1000, tune=1500)

# 3. Analyze the results
az.plot_posterior(trace_bvar, var_names=["A1", "cov"], grid=(3,2), figsize=(10,8))
plt.show()

# Compare true vs. posterior mean of A1
print("True A1 matrix:\n", A1_true)
print("\nPosterior mean of A1 matrix:\n", trace_bvar.posterior['A1'].mean(axis=(0,1)))
The posterior distributions for the A1 matrix elements will be centered near their true values. The priors prevent the estimates from overfitting to the noise in the 100 data points.


Hierarchical Model for Panel Data

Panel data follows multiple units (e.g., firms, countries) over time. A hierarchical (or multilevel) model is perfect for this, allowing us to estimate both population-level effects and unit-specific deviations by "partially pooling" information.
Python

import pandas as pd

# 1. Simulate panel data (e.g., sales for 5 stores over 20 years)
np.random.seed(123)
n_stores = 5
n_years = 20
store_idx = np.repeat(np.arange(n_stores), n_years)
# Create different intercepts for each store
store_intercepts_true = np.random.normal(50, 10, n_stores)
# Assume a common trend (slope)
slope_true = 2.5
X = np.tile(np.arange(n_years), n_stores)
y = store_intercepts_true[store_idx] + slope_true * X + np.random.normal(0, 5, size=n_stores*n_years)

# 2. Define the hierarchical model
with pm.Model() as panel_model:
    # Hyperpriors for the group-level (store) intercepts
    mu_a = pm.Normal('mu_a', mu=50, sigma=10)
    sigma_a = pm.HalfNormal('sigma_a', sigma=10)

    # Common slope for all stores (fixed effect)
    b = pm.Normal('b', mu=0, sigma=5)
    
    # Store-specific intercepts (random effects)
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=n_stores)
    
    # Model error
    sigma_y = pm.HalfNormal('sigma_y', sigma=10)
    
    # Expected value
    mu = a[store_idx] + b * X
    
    # Likelihood
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma_y, observed=y)
    
    trace_panel = pm.sample(1000, tune=1000)

# 3. Plot the random intercepts
az.plot_forest(trace_panel, var_names=['a'], combined=True)
plt.axvline(x=store_intercepts_true.mean(), color='red', linestyle='--', label='True Mean Intercept')
plt.title("Posterior for Store-Specific Intercepts (Random Effects)")
plt.legend()
plt.show()
The forest plot shows the posterior for each store's intercept. These estimates "borrow strength" from each other, resulting in more stable and realistic values than if we had analyzed each store completely separately.


State-Space Models (Dynamic Linear Models)

State-space models are used to infer latent (unobserved) states that evolve over time. The Bayesian framework is a natural fit for estimating both the latent state and the model's parameters simultaneously. The Nile river flow example is a classic application.
Python

# 1. Load Nile river flow data
nile_data = pd.read_csv(pm.get_data("nile.csv"))
nile_years = nile_data["year"].values
nile_flow = nile_data["volume"].values

# 2. Define the state-space model (DLM)
with pm.Model() as nile_dlm:
    # Prior on the initial state (unobserved river level)
    level_init = pm.Normal("level_init", mu=np.mean(nile_flow), sigma=200)

    # Priors for process and observation variances
    sigma_level = pm.HalfNormal("sigma_level", sigma=50) # How much the true level can change (process error)
    sigma_obs = pm.HalfNormal("sigma_obs", sigma=100) # How noisy our measurements are (observation error)

    # The latent random walk for the river level
    level = pm.GaussianRandomWalk("level", mu=0, sigma=sigma_level, init_dist=pm.Normal.dist(mu=level_init, sigma=200), shape=len(nile_flow))

    # Likelihood: Observations are noisy measurements of the true level
    likelihood = pm.Normal("likelihood", mu=level, sigma=sigma_obs, observed=nile_flow)

    trace_nile = pm.sample(2000, tune=2000)

# 3. Plot the inferred latent state
plt.figure(figsize=(12, 5))
plt.plot(nile_years, nile_flow, 'o', color='gray', alpha=0.7, label="Observed Flow")
plt.plot(nile_years, trace_nile.posterior['level'].mean(axis=(0,1)), 'C0', lw=2, label="Inferred Underlying Level")
az.plot_hdi(nile_years, trace_nile.posterior['level'], color='C0', fill_kwargs={'alpha': 0.2})
plt.title("Bayesian State-Space Model of Nile River Flow")
plt.xlabel("Year")
plt.ylabel("Volume")
plt.legend()
plt.show()
The model correctly infers the famous structural break in the Nile's flow around 1898, demonstrating its ability to track a changing, unobserved process over time.

15.2 Finance

In finance, uncertainty is paramount. Bayesian methods provide a superior framework for handling parameter uncertainty in asset pricing, portfolio construction, and risk management.


Bayesian Model Averaging for Asset Pricing

Which factors (market, size, value, momentum, etc.) truly explain stock returns? Instead of picking one model, Bayesian Model Averaging (BMA) combines predictions from multiple models, weighted by the evidence for each model. We can use LOO/WAIC as a practical proxy for model evidence.
Python

# 1. Simulate data from a 3-factor model where one factor is irrelevant
np.random.seed(42)
N = 100
# Three potential factors (uncorrelated for simplicity)
F1 = np.random.randn(N)
F2 = np.random.randn(N)
F3 = np.random.randn(N) # Irrelevant factor
# Asset excess returns
R = 0.02 + 1.2 * F1 + 0.5 * F2 + 0.0 * F3 + np.random.normal(0, 0.5, N)

# 2. Define and fit multiple regression models
models = {}
traces = {}
# Model 1: Factor 1 only
with pm.Model() as models['F1']:
    pm.LinearRegression('R', F1, R)
    traces['F1'] = pm.sample(500, tune=500, progressbar=False)
# Model 2: Factors 1 and 2
with pm.Model() as models['F1_F2']:
    pm.LinearRegression('R', np.vstack([F1, F2]).T, R)
    traces['F1_F2'] = pm.sample(500, tune=500, progressbar=False)
# Model 3: All three factors
with pm.Model() as models['F1_F2_F3']:
    pm.LinearRegression('R', np.vstack([F1, F2, F3]).T, R)
    traces['F1_F2_F3'] = pm.sample(500, tune=500, progressbar=False)

# 3. Compare models using ArviZ's compare function (which also computes BMA weights)
comparison_df = az.compare(traces)
print(comparison_df)
az.plot_compare(comparison_df)
plt.show()
The comparison table will rank the F1_F2 model as the best (lowest LOO score). It also provides a weight column, which represents the BMA weight for each model based on its predictive performance. This tells us how much we should trust each model when making predictions.


Bayesian Portfolio Optimization

Classic portfolio optimization uses single point estimates of expected returns and covariances, ignoring the massive uncertainty around them. A Bayesian approach yields a full posterior distribution for the optimal asset weights, giving a more realistic view.
Python

# 1. Simulate some stock returns
np.random.seed(42)
true_means = np.array([0.05, 0.08]) # Annual returns for two assets
true_cov = np.array([[0.1**2, 0.5*0.1*0.15], [0.5*0.1*0.15, 0.15**2]])
returns = np.random.multivariate_normal(true_means, true_cov, size=20)

# 2. Fit a Bayesian model to get posteriors for means and covariance
with pm.Model() as portfolio_model:
    packed_chol = pm.LKJCholeskyCov("packed_chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0))
    chol = pm.expand_packed_triangular(2, packed_chol)
    cov = pm.Deterministic("cov", chol @ chol.T)
    mu = pm.Normal("mu", mu=0, sigma=0.2, shape=2)
    obs = pm.MvNormal("obs", mu=mu, chol=chol, observed=returns)
    trace_portfolio = pm.sample(1000, tune=1000)

# 3. For each posterior sample, calculate the optimal max-Sharpe weights
posterior_samples = az.extract(trace_portfolio)
optimal_weights = []
for mu_s, cov_s in zip(posterior_samples['mu'], posterior_samples['cov']):
    inv_cov = np.linalg.inv(cov_s.values)
    w = inv_cov @ mu_s.values
    optimal_weights.append(w / np.sum(w)) # Normalize to sum to 1
optimal_weights = np.array(optimal_weights)

# 4. Plot the distribution of optimal weights
plt.hist(optimal_weights[:, 0], bins=30, alpha=0.7, label="Posterior dist. of weight for Asset 1", density=True)
plt.hist(optimal_weights[:, 1], bins=30, alpha=0.7, label="Posterior dist. of weight for Asset 2", density=True)
plt.title("Distribution of Optimal Portfolio Weights")
plt.xlabel("Weight")
plt.legend()
plt.show()
Instead of a single "optimal" allocation, the histograms show the full range of plausible optimal weights, directly visualizing the uncertainty inherent in the portfolio construction problem.


Bayesian Risk Management (Value-at-Risk)

Value-at-Risk (VaR) is a measure of downside risk. A Bayesian approach generates a full predictive distribution for future returns, from which we can calculate a VaR that properly accounts for parameter uncertainty.
Python

# 1. Simulate some asset returns with fat tails (Student-T distribution)
np.random.seed(101)
# nu=4 implies fatter tails than a normal distribution
asset_returns = np.random.standard_t(df=4, size=200) * 0.02 # Scaled to be like daily % returns

# 2. Fit a Bayesian Student-T model to the returns
with pm.Model() as var_model:
    mu = pm.Normal('mu', mu=0, sigma=0.01)
    sigma = pm.HalfNormal('sigma', sigma=0.05)
    nu = pm.Gamma('nu', alpha=2, beta=0.1) # Prior on the degrees-of-freedom
    
    likelihood = pm.StudentT('likelihood', nu=nu, mu=mu, sigma=sigma, observed=asset_returns)
    trace_var = pm.sample(1000, tune=1000)
    
    # 3. Generate samples from the posterior predictive distribution
    ppc = pm.sample_posterior_predictive(trace_var)

# 4. Calculate the 5% VaR from the predictive distribution
future_returns = ppc.posterior_predictive['likelihood'].values.flatten()
VaR_95 = np.quantile(future_returns, 0.05)

# 5. Plot the results
plt.hist(future_returns, bins=100, density=True, label="Posterior Predictive Distribution of Returns")
plt.axvline(VaR_95, color='red', linestyle='--', lw=2, label=f"95% Value-at-Risk = {VaR_95*100:.2f}%")
plt.title("Bayesian Value-at-Risk")
plt.xlabel("1-Day Return")
plt.legend()
plt.show()
The histogram shows our belief about the distribution of tomorrow's return. The red line marks the 5th percentile: we are 95% confident that the loss will not be worse than this value, with our confidence properly reflecting the uncertainty in the model's parameters.

15.3 Machine Learning

Bayesian methods provide powerful tools for machine learning, offering principled regularization, robust uncertainty quantification, and flexible non-parametric models.


Priors as Regularization: The Bayesian Lasso

In machine learning, regularization is used to prevent overfitting, especially with high-dimensional data. Placing specific priors on regression coefficients is equivalent to regularization. A Laplace prior corresponds to L1 (Lasso) regularization, which performs feature selection by shrinking irrelevant coefficients exactly to zero.
Python

# 1. Simulate high-dimensional data where most features are irrelevant
np.random.seed(24)
N = 100 # samples
K = 20  # features
# True coefficients (only 3 are non-zero)
true_coeffs = np.array([5, -3, 2] + [0]*(K-3))
X = np.random.randn(N, K)
y_lasso = X @ true_coeffs + np.random.normal(0, 2, N)

# 2. Define the Bayesian Lasso model (using Laplace priors)
with pm.Model() as lasso_model:
    # Laplace prior encourages sparsity
    beta = pm.Laplace('beta', mu=0, b=1, shape=K)
    intercept = pm.Normal('intercept', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=5)
    
    mu = intercept + pm.math.dot(X, beta)
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=y_lasso)
    
    trace_lasso = pm.sample(1000, tune=1500)

# 3. Plot the posteriors of the coefficients
az.plot_forest(trace_lasso, var_names=['beta'], combined=True, r_hat=True)
plt.axvline(x=0, color='red', linestyle='--')
plt.title("Posterior Distributions of Coefficients (Bayesian Lasso)")
plt.show()
The forest plot will show that the posteriors for the first three coefficients are clearly away from zero. In contrast, the posteriors for the 17 irrelevant coefficients are sharply peaked at zero, demonstrating the powerful feature selection property of the Laplace prior.


Bayesian Neural Networks (BNN)

A BNN places priors on the weights and biases of a neural network. Instead of learning a single set of weights, it infers a full posterior distribution over them. This allows the network to quantify its predictive uncertainty, indicating when it is "not sure" about a prediction.
Python

from sklearn.datasets import make_moons

# 1. Generate non-linear classification data
X_bnn, Y_bnn = make_moons(n_samples=200, noise=0.1, random_state=42)

# 2. Define the BNN architecture in PyMC
n_hidden = 5
with pm.Model() as bnn_model:
    # Input -> Hidden Layer 1
    w1 = pm.Normal('w1', mu=0, sigma=1, shape=(X_bnn.shape[1], n_hidden))
    b1 = pm.Normal('b1', mu=0, sigma=1, shape=n_hidden)
    # Hidden Layer 1 -> Output
    w2 = pm.Normal('w2', mu=0, sigma=1, shape=(n_hidden, 1))
    b2 = pm.Normal('b2', mu=0, sigma=1, shape=1)
    
    # Define the network logic
    act_1 = pm.math.tanh(pm.math.dot(X_bnn, w1) + b1)
    output = pm.math.sigmoid(pm.math.dot(act_1, w2) + b2)
    
    # Likelihood
    likelihood = pm.Bernoulli('likelihood', p=output, observed=Y_bnn.reshape(-1, 1))
    
    # 3. Fit using fast Variational Inference (ADVI)
    approx = pm.fit(n=30000, method='advi')
    trace_bnn = approx.sample(1000)

# 4. Plot the decision boundary with uncertainty
grid = pm.make_grid(X_bnn, n=100)
with bnn_model: # Resets the model to use the new grid data
    pm.set_data({"X_bnn_data": grid})
    ppc_bnn = pm.sample_posterior_predictive(trace_bnn, var_names=['likelihood'], return_inferencedata=False, random_seed=42)

# Plotting
plt.figure(figsize=(8,6))
contour = plt.contourf(grid[:,0].reshape(100,100), grid[:,1].reshape(100,100), ppc_bnn['likelihood'].mean(axis=0).reshape(100,100), cmap='coolwarm', alpha=0.8)
plt.scatter(X_bnn[Y_bnn==0][:,0], X_bnn[Y_bnn==0][:,1], label='Class 0')
plt.scatter(X_bnn[Y_bnn==1][:,0], X_bnn[Y_bnn==1][:,1], label='Class 1')
plt.title("BNN Decision Boundary and Uncertainty")
plt.colorbar(contour, label="Posterior Mean Probability of Class 1")
plt.legend()
plt.show()
The plot shows a smooth decision boundary. The color saturation indicates certainty: deep red/blue regions are confident predictions, while lighter, purplish regions near the boundary represent high uncertainty.


Topic Models: Latent Dirichlet Allocation (LDA)

LDA is a fully Bayesian hierarchical model used to discover abstract "topics" in a collection of documents. It assumes each document is a mixture of topics, and each topic is a distribution over words. While possible to build in PyMC, standard practice is to use optimized libraries like scikit-learn.
Python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 1. Sample text data
corpus = [
    "The stock market fell due to interest rate hikes and inflation fears",
    "Investors are moving capital from equity to bonds and cash",
    "The central bank announced a new monetary policy to fight inflation",
    "NASA launched a new rocket to the moon for scientific exploration",
    "The Hubble telescope captured stunning images of distant galaxies",
    "Space exploration is crucial for scientific advancement and new discoveries"
]

# 2. Vectorize the text data (convert words to counts)
vectorizer = CountVectorizer(stop_words='english')
X_lda = vectorizer.fit_transform(corpus)

# 3. Fit the LDA model (a Bayesian model under the hood)
lda = LatentDirichletAllocation(n_components=2, random_state=42) # 2 topics: Finance & Space
lda.fit(X_lda)

# 4. Inspect the topics by printing the top words for each
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx}: "
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

print("Discovered Topics:")
print_top_words(lda, vectorizer.get_feature_names_out(), 5)
The output will clearly show that the algorithm has discovered one topic dominated by words like "inflation," "market," "bank," and "stock," and another dominated by "space," "exploration," "scientific," and "rocket," successfully identifying the underlying themes in the corpus.


Gaussian Processes (GPs)

GPs are a flexible, non-parametric Bayesian approach to regression. Instead of fitting a single function, a GP infers a distribution over functions that are consistent with the data, providing excellent uncertainty quantification.
Python

# 1. Generate some non-linear data
np.random.seed(1)
X_gp = np.linspace(0, 10, 20)[:, None]
y_gp = np.sin(X_gp).ravel() + np.random.normal(0, 0.2, 20)

# 2. Define and fit the Gaussian Process model in PyMC
with pm.Model() as gp_model:
    # Priors for the covariance function hyperparameters
    length_scale = pm.Gamma("length_scale", alpha=2, beta=1)
    eta = pm.HalfCauchy("eta", beta=5)
    
    # Specify the covariance function (kernel)
    cov_func = eta**2 * pm.gp.cov.ExpQuad(1, ls=length_scale)
    gp = pm.gp.Marginal(cov_func=cov_func)
    
    # Prior for the observation noise
    sigma = pm.HalfNormal("sigma", sigma=5)
    
    # Likelihood
    y_ = gp.marginal_likelihood("y", X=X_gp, y=y_gp, noise=sigma)
    trace_gp = pm.sample(1000, tune=1000)

# 3. Predict on new data points
X_new = np.linspace(-2, 12, 100)[:, None]
with gp_model:
    f_pred = gp.conditional("f_pred", Xnew=X_new)
    pred_samples = pm.sample_posterior_predictive(trace_gp, var_names=["f_pred"])

# 4. Plot the GP fit and its uncertainty
plt.figure(figsize=(10, 5))
plt.plot(X_gp, y_gp, 'o', color='C0', ms=8, label="Observed Data")
plt.plot(X_new, pred_samples.posterior_predictive["f_pred"].mean(axis=(0,1)), "C1", lw=2, label="Posterior Mean Function")
az.plot_hdi(X_new.flatten(), pred_samples.posterior_predictive["f_pred"], color='C1', hdi_prob=0.95, fill_kwargs={'alpha': 0.2})
plt.title("Gaussian Process Regression")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()
The plot beautifully illustrates the power of GPs. The model learns the underlying sine wave and, crucially, the uncertainty (the shaded band) is narrow where we have data and widens dramatically where we don't, perfectly capturing the model's confidence in its predictions.