--- name: empirical-playbook argument-hint: "" description: >- This skill covers applied microeconomic empirical methods and research design. Use when the user is selecting an identification strategy, comparing estimators, running diagnostics, designing a research study, or evaluating an empirical strategy. Triggers on "which method", "what estimator", "how to choose", "method comparison", "empirical strategy", "research design", "applied micro", "identification strategy", "power analysis", "design-based", "model-based", "minimum detectable effect", "specification". --- # Applied Micro Toolkit Reference for applied micro research design: method selection, diagnostics, inference, pitfalls, reporting standards, and power analysis. ## When to Use This Skill Use when the user is: - Choosing between empirical methods for a causal question - Evaluating which identification strategy fits their data and setting - Running standard diagnostic tests and unsure which ones apply - Designing a study and needs to calculate statistical power - Reviewing or critiquing an empirical strategy - Preparing the "Empirical Strategy" section of a paper - Downloading macroeconomic or cross-national data (see `references/data-sources.md` for FRED/World Bank API access) Skip when: - Implementation details for a specific method are needed (use `causal-inference` skill for IV, DiD, RDD, SC, matching) - The task is structural estimation (use `structural-modeling` skill) - The task is manuscript preparation or journal logistics (use `submission-guide` skill) - The task is formal identification proof (use `identification-proofs` skill) - The task is Bayesian model specification (use `bayesian-estimation` skill) After selecting a method, the `econometric-reviewer` agent can review the implementation and the `identification-critic` agent can evaluate the identification argument. ## Method Selection Decision Tree Start with the fundamental question: **What source of variation identifies the causal effect?** ### Step 1: What is your source of variation? | Source of Variation | Method Family | Key Assumption | |--------------------|---------------|----------------| | Randomized assignment (with full compliance) | Experimental analysis (OLS on treatment indicator) | Random assignment | | Randomized assignment (with imperfect compliance) | IV / 2SLS using random assignment as instrument | Exclusion restriction, monotonicity | | Policy change at a sharp threshold | Sharp RDD | Continuity of potential outcomes at cutoff | | Policy change at a threshold with imperfect compliance | Fuzzy RDD (= IV at the cutoff) | Continuity + monotonicity at cutoff | | Policy change at a point in time, with affected and unaffected groups | Difference-in-differences | Parallel trends | | Staggered policy adoption across units over time | Staggered DiD (Callaway-Sant'Anna, Sun-Abraham, etc.) | Parallel trends (conditional on group and time) | | Rare event affecting a single unit, long pre-treatment data | Synthetic control | Pre-treatment fit implies post-treatment counterfactual | | Exogenous shifter of treatment that does not affect outcome directly | IV / 2SLS / GMM | Exclusion restriction, relevance, monotonicity | | Rich set of observables that plausibly captures all confounders | Matching, IPW, AIPW (selection on observables) | Conditional independence (no unobserved confounders) | | No credible exogenous variation | Sensitivity analysis, bounds, partial identification | Depends on bounding assumptions | ### Step 2: Refinements Within Method Families **Within DiD:** ``` Is treatment timing staggered? ├── No → Classic 2x2 DiD (TWFE is fine) └── Yes ├── Can treatment turn off (reversals)? │ ├── Yes → de Chaisemartin-D'Haultfoeuille (2020) │ └── No │ ├── Do you have never-treated units? │ │ ├── Yes → Callaway-Sant'Anna (2021) with never-treated controls │ │ └── No → Callaway-Sant'Anna with not-yet-treated controls │ │ or Sun-Abraham (2021) │ └── Are effects likely heterogeneous across cohorts? │ ├── Yes → Callaway-Sant'Anna or Sun-Abraham (NOT TWFE) │ └── No → TWFE is OK, but report Bacon decomposition ``` **Within IV:** ``` How many instruments for how many endogenous regressors? ├── Exactly identified (K instruments = K endogenous) │ └── 2SLS (= IV = Wald estimator for single instrument) ├── Over-identified (K instruments > K endogenous) │ ├── 2SLS (default) │ ├── GMM (efficient, use if heteroskedasticity suspected) │ └── LIML (less biased with weak instruments) └── Under-identified (K instruments < K endogenous) └── Cannot identify all parameters — need more instruments or fewer endogenous regressors ``` **Within RDD:** ``` Does crossing the threshold guarantee treatment? ├── Yes → Sharp RDD └── No → Fuzzy RDD └── Is the running variable continuous? ├── Yes → Standard rdrobust └── No (discrete / few mass points) └── Cattaneo-Idrobo-Titiunik (2019) discrete RD methods ``` **Within Matching / Selection on Observables:** ``` Is the selection-on-observables assumption plausible? ├── No → Need a different identification strategy └── Yes ├── Do you need ATE or ATT? │ ├── ATE → IPW or AIPW │ └── ATT → Matching or IPW with ATT weights ├── Is the propensity score model well-specified? │ ├── Uncertain → Use AIPW (doubly robust) │ └── Confident → IPW or regression adjustment └── Many covariates or nonlinear confounding? ├── Yes → ML-based methods (causal forests, DML) └── No → Parametric PS model + AIPW ``` ## Standard Diagnostics by Method Key diagnostics to run for each method family. For full reporting checklists and minimum standards, see `references/reporting-standards.md`. | Method | Must-Run Diagnostics | Key Concern | |--------|---------------------|-------------| | IV / 2SLS | First-stage F (KP), reduced form, overid test | Weak instruments (F < 10), exclusion restriction | | DiD (classic) | Pre-trend F-test, event study plot, raw means by group/period | Parallel trends violation | | Staggered DiD | Bacon decomposition, Callaway-Sant'Anna group-time ATTs | Negative TWFE weights with heterogeneous effects | | RDD | McCrary density test, covariate balance at cutoff, bandwidth sensitivity | Manipulation of running variable, extrapolation bias | | Synthetic Control | Pre-fit RMSPE, permutation p-value, leave-one-out | Pre-period fit quality, donor pool sensitivity | | Matching / AIPW | Overlap plots, Love plot (SMD before/after), Oster/Rosenbaum bounds | Lack of overlap, unobserved confounders | | Structural | Convergence, identification rank condition, robustness to starting values | Global vs local optimum, identification failure | For implementation details and diagnostic code by method, see the `causal-inference` skill. ## Inference Frameworks ### Clustering Decision Rule 1. Identify the level at which treatment is assigned → cluster at that level (minimum) 2. If there are within-cluster correlations beyond treatment (e.g., spatial), consider multi-way clustering 3. If the number of clusters is small (< 30–40), use wild cluster bootstrap (Cameron-Gelbach-Miller 2008) 4. If the number of clusters is very small (< 10), cluster-robust methods may not work at all — consider randomization inference or aggregate to the cluster level | Mistake | Consequence | Fix | |---------|------------|-----| | Clustering too fine (individual when treatment is at state level) | SEs too small; over-rejection | Cluster at the level of treatment assignment | | Few clusters (< 30–40) with standard cluster-robust SEs | Poor finite-sample properties | Wild cluster bootstrap | | Not clustering when treatment varies at group level | SEs dramatically understated | Always cluster at level of treatment assignment | ### Design-Based vs Model-Based Inference | Dimension | Design-Based | Model-Based | |-----------|-------------|-------------| | Source of randomness | Treatment assignment mechanism | Outcome draws from a superpopulation | | Key assumption | Known or modeled treatment assignment | Correct outcome model specification | | Examples | Experiments, RCTs, RDD, DiD, natural experiments | Structural models, matching, cross-sectional surveys | | Advantages | Transparent; does not require outcome model | More powerful; extends to complex settings | Design-based is appropriate when the assignment mechanism is known (experiments, lotteries, cutoffs). Model-based when random sampling is reasonable. The standard in applied micro is hybrid: design-based identification + model-based inference. Doubly robust methods (AIPW) combine both. ## Power Analysis The key quantity is the Minimum Detectable Effect (MDE) — the smallest effect detectable with 80% power at alpha = 0.05. **Quick MDE formula (equal groups, two-sided test):** ``` MDE = 2.8 × sigma / sqrt(N) Required N = (2.8 × sigma / MDE)² ``` For IV designs, the effective MDE is inflated by the inverse of the first-stage coefficient: `MDE_IV ≈ MDE_OLS / |pi|`. A weak first stage (small pi) dramatically reduces power. For DiD designs, effective power increases with more post-treatment periods and higher within-group correlation (absorbed by FEs). For RDD, use effective N (observations within bandwidth), not total N. For cluster-randomized designs, the design effect `(1 + (m-1) × ICC)` inflates variance — with ICC = 0.05 and cluster size m = 50, you need 3.45x as many observations. For full MDE formulas (DiD, IV, RDD, cluster-randomized), power simulation code, and MDE interpretation tables, see `references/reporting-standards.md`. ## Research Design Checklist ### Before Touching Data - [ ] **Research question**: What causal parameter are you trying to estimate? Write it as a formal estimand. - [ ] **Identification strategy**: What source of variation identifies the effect? Draw the DAG. - [ ] **Assumptions**: List all identification assumptions explicitly. Which are testable? - [ ] **Threats**: For each assumption, what is the most plausible violation? How would you detect it? - [ ] **Power**: Given your expected sample size, what is the MDE? Is it policy-relevant? - [ ] **Pre-analysis plan**: For prospective studies, register the plan before seeing outcomes. ### During Analysis - [ ] **Data cleaning documented**: Every sample restriction justified and recorded. - [ ] **Summary statistics**: Know your data before running regressions. - [ ] **Main specification**: Run the main spec first. Resist the urge to search for significance. - [ ] **Diagnostics**: Run all standard diagnostics for your method (see table above). - [ ] **Robustness**: Vary specification choices systematically. - [ ] **Magnitude interpretation**: Can you explain the coefficient in plain language? ### Before Submission - [ ] **All diagnostics reported**: See method-specific standards in `references/reporting-standards.md`. - [ ] **Replication package**: Code runs from raw data to all tables and figures. - [ ] **Seeds set**: All random number generators seeded for reproducibility. - [ ] **Limitations discussed**: What are the strongest objections? Address them in the paper. - [ ] **Literature positioned**: Have you cited and compared to the 5 closest papers? ## Common Pitfalls ### Bad Controls A "bad control" is a variable that is itself an outcome of treatment. Conditioning on it introduces selection bias. | Variable Type | Example | Why It Is Bad | |--------------|---------|---------------| | Post-treatment outcome | Controlling for occupation when estimating returns to education | Education affects occupation; conditioning selects on an outcome of treatment | | Mediator | Controlling for wages when estimating effect of training on employment | Blocks part of the causal effect | | Collider | Conditioning on "survived" when estimating health effects | Opens a non-causal path | **Rule of thumb:** If you cannot be sure a variable is determined before treatment, do not include it as a control. When in doubt, draw the DAG. ### Staggered DiD with Heterogeneous Effects | Mistake | Consequence | Fix | |---------|------------|-----| | Running TWFE with staggered timing | Already-treated units used as controls; negative weights; estimate can have wrong sign | Use Callaway-Sant'Anna, Sun-Abraham, or other modern DiD estimator | | Using single post-treatment indicator for all cohorts | Masks heterogeneity in treatment effects across cohorts | Estimate group-time ATTs separately, then aggregate | | Not reporting the Bacon decomposition | Reader cannot assess how much of the TWFE estimate comes from problematic comparisons | Report `bacondecomp` output | ### Forbidden Regressions Never plug a manual first-stage into an OLS second stage (SEs are wrong — use proper 2SLS). Never use a nonlinear first stage with linear second stage (not consistent — use control function). Never include generated regressors without bootstrapping the full two-step procedure. ## Integration For full minimum reporting standards (method-specific checklists for IV, DiD, RDD, SC, Matching) and complete power analysis code, see `references/reporting-standards.md`. For sensitivity analysis procedures (Oster bounds, Conley bounds, breakdown frontiers, specification curves), see `references/sensitivity-analysis.md`. **Agents:** - `econometric-reviewer`: Reviews identification strategy, standard errors, and diagnostic results - `identification-critic`: Evaluates identification argument completeness and exclusion restrictions - `numerical-auditor`: Designs power simulations for nonstandard study designs - `journal-referee`: Reviews whether the empirical strategy meets journal standards **Cross-references:** - `identification-proofs` skill: Formalize an identification argument for the chosen method - `references/diagnostic-battery.md`: Run the full diagnostic battery for the estimated specification - `references/sensitivity-analysis.md`: Run sensitivity analysis (Oster bounds, specification curve, breakdown frontier) - `publication-output` skill: Format regression tables and diagnostic output for publication