* stata_codebook.do - attach long-form notes to the .dta files (run once in Stata).
* Generated by build_data_dictionary.py - do not edit by hand.

* ---- cml_data.dta ----
use "cml_data.dta", clear
label data "Observed synthetic ALMP cohort: 6 covariates + treatment + outcome"
note _dta: Synthetic Flanders-ALMP-style cross-section, N=5,000. Six pre-treatment covariates X, binary training indicator D, and months employed over a 30-month window Y. Joined positionally (row i) to cml_truth.csv.
note age: Jobseeker age in years at programme entry (pre-treatment covariate).. Construction: Drawn age ~ Uniform(20, 60).. Units: years. Source: Simulation
note edu_years: Completed years of formal education (pre-treatment covariate).. Construction: Drawn N(12, 3), clipped to [6, 20].. Units: years. Source: Simulation
note prior_emp_months: Months employed during the pre-programme look-back window (pre-treatment covariate).. Construction: Drawn 60 - Beta(2, 5).. Units: months. Source: Simulation
note dutch_prof: Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.. Construction: Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).. Units: 0-3 (ordinal). Source: Simulation
note female: Sex indicator, 1 if female else 0 (pre-treatment covariate).. Construction: Drawn Bernoulli(0.48).. Units: 0/1. Source: Simulation
note migrant: Migrant-background indicator, 1 if migrant else 0 (pre-treatment covariate; a moderator of the effect).. Construction: Drawn Bernoulli(0.30).. Units: 0/1. Source: Simulation
note D: Binary training indicator - 1 if the jobseeker received the ALMP training, else 0.. Construction: D ~ Bernoulli(pi_true), with pi_true the true logistic propensity in the covariates.. Units: 0/1. Source: Simulation
note Y: Observed months employed over the 30-month follow-up - the realised potential outcome under the assigned treatment.. Construction: Y = D-Y1 + (1-D)-Y0, where Y0/Y1 are the clipped potential outcomes.. Units: months (0-30). Source: Simulation
save "cml_data.dta", replace

* ---- cml_truth.dta ----
use "cml_truth.dta", clear
label data "Hidden ground truth: potential outcomes, individual effect, true propensity"
note _dta: Known only because the data are simulated. Both potential outcomes Y0/Y1, the individual treatment effect tau, the true propensity pi_true, and dutch_prof carried over for self-contained group-bys. NOT observable predictors - for scoring only.
note Y0: Months employed the jobseeker WOULD have over 30 months WITHOUT training - counterfactual; observable only in the simulation.. Construction: Y0 = clip(12 + 0.20-prior_emp + 0.30-edu + 1.0-dutch_prof - 0.005(age-40)^2 + N(0,2.5), 0, 30).. Units: months (0-30). Source: Simulation (ground truth)
note Y1: Months employed the jobseeker WOULD have over 30 months WITH training - counterfactual; observable only in the simulation.. Construction: Y1 = clip(Y0 + tau, 0, 30).. Units: months (0-30). Source: Simulation (ground truth)
note tau: The true per-jobseeker effect of training, ? = Y(1) - Y(0) (pre-clipping). Benchmark for the IATE and input to the oracle/welfare rule.. Construction: tau = 3.0 + 1.5(3 - dutch_prof) + 0.4-migrant - 0.02(age - 40) + N(0, 0.3).. Units: months. Source: Simulation (ground truth)
note pi_true: The true probability of receiving training given covariates, used to generate D. The estimand a propensity model tries to recover.. Construction: logistic(-0.6 + 0.020(40-age) + 0.05(12-edu) + 0.015(30-prior_emp) + 0.30(3-dutch) + 0.20-migrant - 0.10-female), clipped to [0.05, 0.95].. Units: 0-1 (probability). Source: Simulation (ground truth)
note dutch_prof: Ordinal Dutch proficiency: 0=no, 1=low, 2=intermediate, 3=native. The GATE stratifier.. Construction: Drawn from {0,1,2,3} with probabilities (0.25, 0.30, 0.30, 0.15).. Units: 0-3 (ordinal). Source: Simulation
save "cml_truth.dta", replace