--- name: asymptotic-theory description: M-estimation, influence functions, and semiparametric efficiency theory for causal inference --- # Asymptotic Theory **Rigorous framework for statistical inference and efficiency in modern methodology** Use this skill when working on: asymptotic properties of estimators, influence functions, semiparametric efficiency, double robustness, variance estimation, confidence intervals, hypothesis testing, M-estimation, or deriving limiting distributions. --- ## Efficiency Bounds ### Semiparametric Efficiency Theory **Cramér-Rao Lower Bound**: For any unbiased estimator, $$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$ where $I(\theta)$ is the Fisher information. **Semiparametric Efficiency Bound**: The variance of the efficient influence function: $$V_{eff} = E[\phi^*(\theta_0)^2]$$ where $\phi^*$ is the efficient influence function (EIF). **Influence Function Notation**: $IF(O; \theta, P)$ represents the influence of observation $O$ on parameter $\theta$ under distribution $P$: $$IF(O; \theta, P) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_O) - T(P)}{\epsilon}$$ **Semiparametric Variance**: For RAL estimators, $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, E[IF(O)^2])$$ **Estimating Equations**: M-estimators solve $\sum_{i=1}^n \psi(O_i; \theta) = 0$, with asymptotic variance: $$V = \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-1} E[\psi(O; \theta)\psi(O; \theta)^T] \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-T}$$ ### Efficiency for Mediation Estimands | Estimand | Efficient Influence Function | Efficiency Bound | |----------|------------------------------|------------------| | ATE | $\phi_{ATE} = \frac{A}{\pi}(Y-\mu_1) - \frac{1-A}{1-\pi}(Y-\mu_0) + \mu_1 - \mu_0 - \psi$ | $V_{ATE} = E[\phi_{ATE}^2]$ | | NDE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE | | NIE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE | ```r # Compute semiparametric efficiency bound compute_efficiency_bound <- function(data, estimand = "ATE") { n <- nrow(data) if (estimand == "ATE") { # Estimate nuisance functions ps_model <- glm(A ~ X, data = data, family = binomial) pi_hat <- predict(ps_model, type = "response") mu1_model <- lm(Y ~ X, data = subset(data, A == 1)) mu0_model <- lm(Y ~ X, data = subset(data, A == 0)) mu1_hat <- predict(mu1_model, newdata = data) mu0_hat <- predict(mu0_model, newdata = data) # Efficient influence function psi_hat <- mean(mu1_hat - mu0_hat) phi <- with(data, { A/pi_hat * (Y - mu1_hat) - (1-A)/(1-pi_hat) * (Y - mu0_hat) + mu1_hat - mu0_hat - psi_hat }) # Efficiency bound = variance of EIF list( efficiency_bound = var(phi), standard_error = sqrt(var(phi) / n), eif_values = phi ) } } ``` --- ## Empirical Process Theory ### Key Concepts **Empirical Process**: $\mathbb{G}_n(f) = \sqrt{n}(\mathbb{P}_n - P)f = \frac{1}{\sqrt{n}}\sum_{i=1}^n (f(O_i) - Pf)$ **Uniform Convergence**: For function class $\mathcal{F}$, $$\sup_{f \in \mathcal{F}} |\mathbb{G}_n(f)| \xrightarrow{d} \sup_{f \in \mathcal{F}} |\mathbb{G}(f)|$$ where $\mathbb{G}$ is a Gaussian process. ### Complexity Measures | Measure | Definition | Use | |---------|------------|-----| | VC dimension | Max shattered set size | Classification | | Covering number | $N(\epsilon, \mathcal{F}, \|\cdot\|)$ | General classes | | Bracketing number | $N_{[]}(\epsilon, \mathcal{F}, L_2)$ | Entropy bounds | | Rademacher complexity | $\mathcal{R}_n(\mathcal{F}) = E[\sup_{f \in \mathcal{F}} |\frac{1}{n}\sum_i \epsilon_i f(X_i)|]$ | Generalization | ```r # Estimate Rademacher complexity via Monte Carlo estimate_rademacher <- function(f_class, data, n_reps = 1000) { n <- nrow(data) sup_values <- replicate(n_reps, { # Random Rademacher variables epsilon <- sample(c(-1, 1), n, replace = TRUE) # Compute supremum over function class sup_f <- max(sapply(f_class, function(f) { abs(mean(epsilon * f(data))) })) sup_f }) mean(sup_values) } ``` --- ## Donsker Classes ### Definition and Importance A function class $\mathcal{F}$ is **Donsker** if $\mathbb{G}_n \rightsquigarrow \mathbb{G}$ in $\ell^\infty(\mathcal{F})$, where $\mathbb{G}$ is a tight Gaussian process. ### Key Donsker Classes | Class | Description | Application | |-------|-------------|-------------| | VC classes | Finite VC dimension | Classification functions | | Smooth functions | Bounded derivatives | Regression estimators | | Monotone functions | Single crossings | Distribution functions | | Lipschitz functions | Bounded variation | M-estimators | ### Donsker Theorem Applications **For M-estimation**: If $\psi(O, \theta)$ belongs to a Donsker class, then $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$$ where $V = (\partial_\theta E[\psi])^{-1} \text{Var}(\psi) (\partial_\theta E[\psi])^{-T}$ ```r # Verify Donsker conditions for empirical process check_donsker_conditions <- function(psi_class, data) { # Estimate bracketing entropy integral epsilon_grid <- seq(0.01, 1, by = 0.01) bracket_numbers <- sapply(epsilon_grid, function(eps) { # Estimate N_[](eps, F, L_2) estimate_bracketing_number(psi_class, data, eps) }) # Donsker if integral converges entropy_integral <- integrate( function(eps) sqrt(log(approxfun(epsilon_grid, bracket_numbers)(eps))), lower = 0, upper = 1 ) list( is_donsker = entropy_integral$value < Inf, entropy_integral = entropy_integral$value, bracket_numbers = data.frame(epsilon = epsilon_grid, N = bracket_numbers) ) } ``` --- ## Core Concepts ### Why Asymptotics? 1. **Exact distributions** often unavailable for complex estimators 2. **Large-sample approximations** provide tractable inference 3. **Efficiency theory** guides optimal estimator construction 4. **Robustness** properties clarified through asymptotic analysis ### Fundamental Sequence ``` Estimator θ̂ₙ → Consistency → Asymptotic Normality → Efficiency → Inference ↓ ↓ ↓ ↓ θ̂ₙ →ᵖ θ₀ √n(θ̂ₙ-θ₀) →ᵈ N(0,V) V = V_eff CIs, tests ``` --- ## Modes of Convergence ### Convergence in Probability ($\xrightarrow{p}$) $X_n \xrightarrow{p} X$ if $\forall \epsilon > 0$: $P(|X_n - X| > \epsilon) \to 0$ **Consistency**: $\hat{\theta}_n \xrightarrow{p} \theta_0$ ### Convergence in Distribution ($\xrightarrow{d}$) $X_n \xrightarrow{d} X$ if $F_{X_n}(x) \to F_X(x)$ at all continuity points **Asymptotic normality**: $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ ### Almost Sure Convergence ($\xrightarrow{a.s.}$) $X_n \xrightarrow{a.s.} X$ if $P(\lim_{n\to\infty} X_n = X) = 1$ **Relationship**: $\xrightarrow{a.s.} \Rightarrow \xrightarrow{p} \Rightarrow \xrightarrow{d}$ ### Stochastic Order Notation | Notation | Meaning | Example | |----------|---------|---------| | $O_p(1)$ | Bounded in probability | $\hat{\theta}_n = O_p(1)$ | | $o_p(1)$ | Converges to 0 in probability | $\hat{\theta}_n - \theta_0 = o_p(1)$ | | $O_p(a_n)$ | $X_n/a_n = O_p(1)$ | $\hat{\theta}_n - \theta_0 = O_p(n^{-1/2})$ | | $o_p(a_n)$ | $X_n/a_n = o_p(1)$ | Remainder terms | --- ## Key Theorems ### Laws of Large Numbers **Weak LLN**: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{p} E[X]$$ **Strong LLN**: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{a.s.} E[X]$$ **Uniform LLN**: For $\sup_{\theta \in \Theta}$ convergence, need additional conditions (compactness, envelope). ### Central Limit Theorem **Classical CLT**: If $X_1, \ldots, X_n$ iid with $E[X] = \mu$, $Var(X) = \sigma^2 < \infty$: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$ **Lindeberg-Feller CLT**: For triangular arrays with: $$\sum_{i=1}^n E[X_{ni}^2 \mathbf{1}(|X_{ni}| > \epsilon)] \to 0 \quad \forall \epsilon > 0$$ **Multivariate CLT**: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \Sigma)$$ ### Slutsky's Theorem If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant): - $X_n + Y_n \xrightarrow{d} X + c$ - $X_n Y_n \xrightarrow{d} cX$ - $X_n/Y_n \xrightarrow{d} X/c$ (if $c \neq 0$) ### Continuous Mapping Theorem If $X_n \xrightarrow{d} X$ and $g$ continuous: $$g(X_n) \xrightarrow{d} g(X)$$ ### Delta Method If $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ differentiable at $\theta_0$: $$\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0)) \xrightarrow{d} N(0, g'(\theta_0)^\top V g'(\theta_0))$$ **Multivariate**: Replace $g'(\theta_0)$ with Jacobian matrix. --- ## M-Estimation Theory ### Setup Estimator $\hat{\theta}_n$ solves: $$\hat{\theta}_n = \arg\max_{\theta \in \Theta} M_n(\theta)$$ where $M_n(\theta) = n^{-1} \sum_{i=1}^n m(O_i; \theta)$ ### Consistency Conditions 1. **Uniform convergence**: $\sup_\theta |M_n(\theta) - M(\theta)| \xrightarrow{p} 0$ 2. **Identification**: $M(\theta)$ uniquely maximized at $\theta_0$ 3. **Compactness**: $\Theta$ compact (or identification at distance from boundary) **Result**: $\hat{\theta}_n \xrightarrow{p} \theta_0$ ### Asymptotic Normality Conditions 1. $\theta_0$ interior point of $\Theta$ 2. $M(\theta)$ twice differentiable at $\theta_0$ 3. $\ddot{M}(\theta_0)$ non-singular 4. $\sqrt{n} \dot{M}_n(\theta_0) \xrightarrow{d} N(0, V)$ **Result**: $$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, [-\ddot{M}(\theta_0)]^{-1} V [-\ddot{M}(\theta_0)]^{-1})$$ ### Standard Errors **Sandwich estimator**: $$\hat{V} = \hat{A}^{-1} \hat{B} \hat{A}^{-1}$$ where: - $\hat{A} = -n^{-1} \sum_i \ddot{m}(O_i; \hat{\theta}_n)$ (Hessian) - $\hat{B} = n^{-1} \sum_i \dot{m}(O_i; \hat{\theta}_n) \dot{m}(O_i; \hat{\theta}_n)^\top$ (outer product) --- ## Influence Functions ### Definition The **influence function** of a functional $T(P)$ at distribution $P$ is: $$\phi(o) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_o) - T(P)}{\epsilon}$$ where $\delta_o$ is point mass at $o$. ### Properties 1. **Mean zero**: $E_P[\phi(O)] = 0$ 2. **Variance = asymptotic variance**: If $\sqrt{n}(\hat{T}_n - T) \xrightarrow{d} N(0, V)$, then $V = E[\phi(O)^2]$ 3. **Linearization**: $\sqrt{n}(\hat{T}_n - T) = \sqrt{n} \mathbb{P}_n[\phi] + o_p(1)$ ### Examples | Functional | Influence Function | |------------|-------------------| | Mean $E[Y]$ | $\phi(y) = y - E[Y]$ | | Variance $Var(Y)$ | $\phi(y) = (y - \mu)^2 - \sigma^2$ | | Quantile $Q_p$ | $\phi(y) = \frac{p - \mathbf{1}(y \leq Q_p)}{f(Q_p)}$ | | Regression coefficient | $\phi = (X^\top X)^{-1} X(Y - X^\top\beta)$ | ### Deriving Influence Functions **Method 1: Gateaux derivative** (definition) **Method 2: Estimating equation approach** If $\hat{\theta}$ solves $\mathbb{P}_n[\psi(O; \theta)] = 0$, then: $$\phi(O) = -E[\partial_\theta \psi]^{-1} \psi(O; \theta_0)$$ **Method 3: Functional delta method** For $\psi = g(T_1, T_2, \ldots)$: $$\phi_\psi = \sum_j \frac{\partial g}{\partial T_j} \phi_{T_j}$$ --- ## Semiparametric Efficiency ### Semiparametric Models Model $\mathcal{P}$ contains distributions satisfying: $$\theta = \Psi(P), \quad P \in \mathcal{P}$$ The "nuisance" is infinite-dimensional (e.g., unknown baseline distribution). ### Tangent Space **Parametric submodels**: One-dimensional smooth paths $\{P_t : t \in \mathbb{R}\}$ through $P_0$. **Score**: $S = \partial_t \log p_t \big|_{t=0}$ **Tangent space** $\mathcal{T}$: Closed linear span of all such scores. ### Efficiency Bound The **efficient influence function** (EIF) is the projection of any influence function onto the tangent space. **Semiparametric efficiency bound**: $$V_{eff} = E[\phi_{eff}(O)^2]$$ No regular estimator can have asymptotic variance smaller than $V_{eff}$. ### Achieving Efficiency An estimator is **semiparametrically efficient** if its influence function equals the EIF: $$\phi_{\hat{\theta}} = \phi_{eff}$$ **Strategies**: 1. Solve efficient score equation 2. Targeted learning (TMLE) 3. One-step estimator with EIF-based correction --- ## Double Robustness ### Concept An estimator is **doubly robust** if it is consistent when **either**: - Outcome model correctly specified, OR - Treatment model (propensity score) correctly specified ### AIPW Estimator For ATE $\psi = E[Y(1) - Y(0)]$: $$\hat{\psi}_{DR} = \mathbb{P}_n\left[\frac{A(Y - \hat{\mu}_1(X))}{\hat{\pi}(X)} + \hat{\mu}_1(X)\right] - \mathbb{P}_n\left[\frac{(1-A)(Y - \hat{\mu}_0(X))}{1-\hat{\pi}(X)} + \hat{\mu}_0(X)\right]$$ where: - $\hat{\mu}_a(X) = \hat{E}[Y|A=a,X]$ (outcome model) - $\hat{\pi}(X) = \hat{P}(A=1|X)$ (propensity score) ### Why It Works **Bias decomposition**: $$\hat{\psi}_{DR} - \psi = \text{(outcome error)} \times \text{(propensity error)} + o_p(n^{-1/2})$$ If either error is zero, bias is zero. ### Efficiency Under Double Robustness When **both** models correct: - Achieves semiparametric efficiency bound - Asymptotic variance = $E[\phi_{eff}^2]$ When **one** model wrong: - Still consistent - But less efficient than when both correct --- ## Variance Estimation ### Analytic (Sandwich) $$\hat{V} = \frac{1}{n} \sum_{i=1}^n \hat{\phi}(O_i)^2$$ where $\hat{\phi}$ is estimated influence function. ### Bootstrap **Nonparametric bootstrap**: 1. Resample $n$ observations with replacement 2. Compute $\hat{\theta}^*_b$ for $b = 1, \ldots, B$ 3. $\hat{V} = \text{Var}(\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B)$ **Bootstrap validity**: Requires $\sqrt{n}$-consistent, regular estimators. ### Influence Function-Based Bootstrap More stable than full recomputation: $$\hat{\theta}^*_b = \hat{\theta} + n^{-1} \sum_{i=1}^n (W_i^* - 1) \hat{\phi}(O_i)$$ where $W_i^*$ are bootstrap weights. --- ## Inference ### Confidence Intervals **Wald interval**: $$\hat{\theta} \pm z_{1-\alpha/2} \cdot \hat{SE}$$ **Percentile bootstrap**: $$[\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]$$ **BCa bootstrap** (bias-corrected accelerated): Corrects for bias and skewness. ### Hypothesis Testing **Wald test**: $W = (\hat{\theta} - \theta_0)^2 / \hat{V} \sim \chi^2_1$ **Score test**: Based on score at null. **Likelihood ratio test**: $2(\ell(\hat{\theta}) - \ell(\theta_0)) \sim \chi^2_k$ --- ## Product of Coefficients (Mediation) ### Setup Mediation effect = $\alpha \beta$ (or $\alpha_1 \beta_1 \gamma_2$ for sequential) ### Distribution of Products **Not normal**: Product of normals is NOT normal. **Exact distribution**: Complex (involves Bessel functions for two normals). **Approximations**: 1. **Sobel test**: Normal approximation via delta method 2. **PRODCLIN**: Distribution of product method (RMediation) 3. **Monte Carlo**: Simulate from joint distribution ### Delta Method Variance For $\psi = \alpha\beta$: $$Var(\hat{\alpha}\hat{\beta}) \approx \beta^2 Var(\hat{\alpha}) + \alpha^2 Var(\hat{\beta}) + Var(\hat{\alpha})Var(\hat{\beta})$$ The last term often omitted (Sobel) but matters when effects are small. ### Product of Three For sequential mediation $\psi = \alpha_1 \beta_1 \gamma_2$: - Distribution more complex - Monte Carlo or specialized methods needed - Your "product of three" manuscript addresses this --- ## Regularity Conditions Checklist ### For Consistency - [ ] Parameter space compact (or bounded away from boundary) - [ ] Objective function continuous in $\theta$ - [ ] Uniform convergence of criterion - [ ] Unique maximizer at $\theta_0$ ### For Asymptotic Normality - [ ] $\theta_0$ interior point - [ ] Twice differentiable criterion - [ ] Non-singular Hessian - [ ] CLT applies to score - [ ] Lindeberg/Lyapunov conditions if non-iid ### For Efficiency - [ ] Model correctly specified - [ ] Nuisance parameters consistently estimated - [ ] Sufficient smoothness for influence function calculation - [ ] Rate conditions on nuisance estimation (for doubly robust) --- ## Common Pitfalls ### 1. Ignoring Estimation of Nuisance Parameters Wrong: Treat $\hat{\eta}$ as known when computing variance. Right: Account for $\hat{\eta}$ uncertainty or use cross-fitting. ### 2. Slow Nuisance Estimation For doubly robust estimators, need: $$\|\hat{\mu} - \mu_0\| \cdot \|\hat{\pi} - \pi_0\| = o_p(n^{-1/2})$$ If both converge at $n^{-1/4}$, product is $n^{-1/2}$. ### 3. Bootstrap Failure Bootstrap can fail for: - Non-differentiable functionals - Super-efficient estimators - Boundary parameters ### 4. Underestimating Variance Sandwich estimator assumes correct influence function. Model misspecification → wrong variance. --- ## Template: Asymptotic Result ```latex \begin{theorem}[Asymptotic Distribution] Under Assumptions \ref{A1}--\ref{An}: \begin{enumerate} \item (Consistency) $\hat{\theta}_n \xrightarrow{p} \theta_0$ \item (Asymptotic normality) $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ \item (Variance) $V = E[\phi(O)^2]$ where $\phi$ is the influence function \item (Variance estimation) $\hat{V} \xrightarrow{p} V$ \end{enumerate} \end{theorem} \begin{proof} \textbf{Step 1 (Consistency):} [Apply M-estimation or direct argument] \textbf{Step 2 (Expansion):} Taylor expand around $\theta_0$: \[ 0 = \mathbb{P}_n[\psi(O; \hat{\theta})] = \mathbb{P}_n[\psi(O; \theta_0)] + \mathbb{P}_n[\dot{\psi}(\tilde{\theta})](\hat{\theta} - \theta_0) \] \textbf{Step 3 (Rearrangement):} \[ \sqrt{n}(\hat{\theta} - \theta_0) = -[\mathbb{P}_n[\dot{\psi}]]^{-1} \sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)] \] \textbf{Step 4 (CLT):} $\sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)] \xrightarrow{d} N(0, E[\psi\psi^\top])$ by CLT. \textbf{Step 5 (Slutsky):} $\mathbb{P}_n[\dot{\psi}] \xrightarrow{p} E[\dot{\psi}]$ by WLLN. Apply Slutsky. \textbf{Step 6 (Identify $V$):} $V = E[\dot{\psi}]^{-1} E[\psi\psi^\top] E[\dot{\psi}]^{-\top}$. \end{proof} ``` --- ## Integration with Other Skills This skill works with: - **proof-architect** - For structuring asymptotic proofs - **identification-theory** - Identification precedes estimation/inference - **simulation-architect** - Validate asymptotic approximations - **methods-paper-writer** - Present results in manuscripts --- ## Key References - Bickel - Newey - Robins - van der Vaart, A.W. (1998). Asymptotic Statistics - Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data - Kennedy, E.H. (2016). Semiparametric Theory and Empirical Processes - van der Laan, M.J. & Rose, S. (2011). Targeted Learning --- **Version**: 1.0 **Created**: 2025-12-08 **Domain**: Asymptotic Statistics, Semiparametric Inference