# R Validation Report

This document provides complete transparency into how the Rust statistical implementations are validated against R.

## Overview

All statistical tests in this library are validated using **Test-Driven Development (TDD)** with R as the ground truth oracle. The validation process ensures numerical agreement between Rust implementations and established R packages.

## Validation Architecture

```
┌─────────────────────────┐     ┌─────────────────────────┐
│   R/generate_refs.R     │     │   tests/tdd_*.rs        │
│   (Reference Generator) │     │   (Rust Test Suite)     │
└───────────┬─────────────┘     └───────────┬─────────────┘
            │                               │
            │  set.seed(42)                 │  load CSV files
            │  compute expected values      │  compute Rust values
            │                               │  assert_relative_eq!
            ▼                               ▼
┌─────────────────────────────────────────────────────────┐
│                    R/data/*.csv                         │
│              (76 Reference Data Files)                  │
│                                                         │
│   Test Vectors:  Input data generated in R              │
│   Expected Values: R-computed reference results         │
└─────────────────────────────────────────────────────────┘
```

## How to Reproduce Validation

### Step 1: Generate R Reference Data

```bash
cd /path/to/anofox-statistics
Rscript R/generate_refs.R
```

This generates 76 CSV files in `R/data/` containing:
- Test vectors (input data)
- Expected results (R-computed values)

### Step 2: Run Rust Tests

```bash
cargo test
```

Tests load the CSV files and compare Rust results against R references.

## Reference Data Files

### Test Vector Files (Input Data)

| File | Description | R Generation Code |
|------|-------------|-------------------|
| `vec_short.csv` | 5-element vector | `c(1.2, 2.3, 3.4, 4.5, 5.6)` |
| `vec_long.csv` | 100-element normal sample | `rnorm(100, mean=10, sd=2)` |
| `vec_outlier.csv` | Vector with outlier | `c(1, 1, 1, 1, 100)` |
| `vec_even.csv` | Even-length vector | `c(1.0, 2.0, 3.0, 4.0)` |
| `vec_skewed.csv` | Right-skewed data | `c(1,1,1,1,1,2,2,3,5,10,20)` |
| `ttest_g1.csv` | T-test group 1 (n=20) | `rnorm(20, mean=5, sd=1)` |
| `ttest_g2.csv` | T-test group 2 (n=25) | `rnorm(25, mean=5.5, sd=1.5)` |
| `ttest_g1_paired.csv` | Paired t-test group 1 | `rnorm(15, mean=10, sd=2)` |
| `ttest_g2_paired.csv` | Paired t-test group 2 | `g1_paired + rnorm(15, mean=0.5, sd=0.5)` |
| `yuen_g1.csv` | Yuen test group 1 | `c(g1, 50)` (with outlier) |
| `yuen_g2.csv` | Yuen test group 2 | `c(g2, -20)` (with outlier) |
| `levene_g3.csv` | Third group for Brown-Forsythe | `rnorm(22, mean=6, sd=2)` |
| `rank_simple.csv` | Simple ranking data | `c(3, 1, 4, 1, 5, 9, 2, 6)` |
| `rank_ties.csv` | Data with ties | `c(1, 2, 2, 3, 3, 3, 4)` |
| `mw_x.csv` | Mann-Whitney X (n=30) | `rnorm(30, mean=5, sd=1)` |
| `mw_y.csv` | Mann-Whitney Y (n=35) | `rnorm(35, mean=5.8, sd=1.2)` |
| `mw_x_exact.csv` | Mann-Whitney exact X (n=8) | `c(1.1, 2.3, 3.5, 4.7, 5.9, 6.2, 7.4, 8.6)` |
| `mw_y_exact.csv` | Mann-Whitney exact Y (n=9) | `c(2.2, 3.4, 4.6, 5.8, 7.0, 8.2, 9.4, 10.6, 11.8)` |
| `wsr_x.csv` | Wilcoxon signed-rank X | `rnorm(25, mean=10, sd=2)` |
| `wsr_y.csv` | Wilcoxon signed-rank Y | `wsr_x + rnorm(25, mean=0.5, sd=0.8)` |
| `wsr_x_exact.csv` | Wilcoxon SR exact X (n=10) | `seq(10, 20, length.out=10)` |
| `wsr_y_exact.csv` | Wilcoxon SR exact Y (n=10) | `wsr_x_exact - c(0.3, 0.5, ..., 2.1)` |
| `kw_a.csv` | Kruskal-Wallis group A | `rnorm(20, mean=5, sd=1)` |
| `kw_b.csv` | Kruskal-Wallis group B | `rnorm(22, mean=6, sd=1.2)` |
| `kw_c.csv` | Kruskal-Wallis group C | `rnorm(18, mean=5.5, sd=0.9)` |
| `sw_normal_small.csv` | Shapiro-Wilk normal (n=20) | `rnorm(20, mean=5, sd=2)` |
| `sw_normal_medium.csv` | Shapiro-Wilk normal (n=50) | `rnorm(50, mean=10, sd=3)` |
| `sw_normal_large.csv` | Shapiro-Wilk normal (n=100) | `rnorm(100, mean=0, sd=1)` |
| `sw_uniform.csv` | Shapiro-Wilk uniform (n=30) | `runif(30, min=0, max=10)` |
| `sw_exp.csv` | Shapiro-Wilk exponential (n=40) | `rexp(40, rate=1)` |
| `sw_n3.csv` | Shapiro-Wilk edge case (n=3) | `rnorm(3)` |
| `sw_n4.csv` | Shapiro-Wilk edge case (n=4) | `rnorm(4)` |
| `sw_n5.csv` | Shapiro-Wilk edge case (n=5) | `rnorm(5)` |
| `sw_n10.csv` | Shapiro-Wilk edge case (n=10) | `rnorm(10)` |
| `dag_normal.csv` | D'Agostino normal data | `rnorm(50, mean=10, sd=2)` |
| `dag_skewed.csv` | D'Agostino skewed data | `rexp(50, rate=0.5)` |
| `bm_x.csv` | Brunner-Munzel X | `c(1,2,1,1,1,1,1,1,1,1,2,4,1,1)` |
| `bm_y.csv` | Brunner-Munzel Y | `c(3,3,4,3,1,2,3,1,1,5,4)` |
| `dm_e1.csv` | Diebold-Mariano errors 1 | `actual - forecast1` |
| `dm_e2.csv` | Diebold-Mariano errors 2 | `actual - forecast2` |
| `spa_benchmark.csv` | SPA benchmark losses | `abs(rnorm(100, 2, 0.5))` |
| `spa_model1.csv` | SPA model 1 losses | `abs(rnorm(100, 1, 0.3))` |
| `spa_model2.csv` | SPA model 2 losses | `abs(rnorm(100, 2.1, 0.5))` |
| `spa_model3.csv` | SPA model 3 losses | `abs(rnorm(100, 3, 0.6))` |
| `mcs_model0.csv` | MCS model 0 losses | `abs(rnorm(200, 1.0, 0.3))` |
| `mcs_model1.csv` | MCS model 1 losses | `abs(rnorm(200, 1.1, 0.35))` |
| `mcs_model2.csv` | MCS model 2 losses | `abs(rnorm(200, 2.0, 0.5))` |
| `mcs_model3.csv` | MCS model 3 losses | `abs(rnorm(200, 3.5, 0.8))` |
| `ed_x.csv` | Energy distance X | `rnorm(30, mean=0, sd=1)` |
| `ed_y.csv` | Energy distance Y | `rnorm(30, mean=3, sd=1)` |
| `perm_x.csv` | Permutation test X | `rnorm(20, mean=5, sd=1)` |
| `perm_y.csv` | Permutation test Y | `rnorm(25, mean=7, sd=1.2)` |

### Expected Value Files (R Results)

| File | Test | R Function | Values Stored |
|------|------|------------|---------------|
| `math_basic.csv` | Math primitives | `mean()`, `var()`, `median()`, `mean(trim=0.2)` | mean, variance, median, trimmed mean for each test vector |
| `math_extended.csv` | Extended math | `sd()`, `e1071::skewness()`, `e1071::kurtosis()` | std_dev, skewness (type=2), kurtosis (type=2) |
| `math_skewed.csv` | Skewed data stats | `e1071::skewness()`, `e1071::kurtosis()` | skewness, kurtosis for skewed vector |
| `ttest_welch.csv` | Welch t-test | `t.test(var.equal=FALSE)` | statistic, df, p_value for all 3 alternatives, mu=0.5, conf.int at 95/90/99% |
| `ttest_student.csv` | Student t-test | `t.test(var.equal=TRUE)` | statistic, df, p_value for all 3 alternatives |
| `ttest_paired.csv` | Paired t-test | `t.test(paired=TRUE)` | statistic, df, p_value for all 3 alternatives |
| `yuen.csv` | Yuen's test | `WRS2::yuen()` | statistic, df, p_value, diff for 20% and 10% trim |
| `brown_forsythe.csv` | Brown-Forsythe | `car::leveneTest(center=median)` | F-statistic, df1, df2, p_value for 2 and 3 groups |
| `ranks.csv` | Ranking | `rank(ties.method="average")` | All rank values for simple and tied data |
| `mann_whitney.csv` | Mann-Whitney U | `wilcox.test(exact=FALSE, correct=FALSE/TRUE)` | statistic, p_value, p_value_less, p_value_greater, p_value_corrected, estimate, conf_low, conf_high, p_value_mu (mu=0.5) |
| `mann_whitney_exact.csv` | Mann-Whitney exact | `wilcox.test(exact=TRUE, conf.int=TRUE)` | statistic, p_value, p_value_less, p_value_greater, estimate, conf_low_95, conf_high_95, conf_low_90, conf_high_90 |
| `wilcoxon_signed_rank.csv` | Wilcoxon SR | `wilcox.test(paired=TRUE, exact=FALSE, correct=FALSE/TRUE)` | statistic, p_value, p_value_less, p_value_greater, p_value_corrected, estimate, conf_low, conf_high, p_value_mu (mu=0.3) |
| `wilcoxon_signed_rank_exact.csv` | Wilcoxon SR exact | `wilcox.test(paired=TRUE, exact=TRUE, conf.int=TRUE)` | statistic, p_value, p_value_less, p_value_greater, estimate, conf_low_95, conf_high_95, conf_low_90, conf_high_90 |
| `kruskal_wallis.csv` | Kruskal-Wallis | `kruskal.test()` | statistic, df, p_value |
| `shapiro_wilk.csv` | Shapiro-Wilk | `shapiro.test()` | W, p_value for 5 test vectors |
| `shapiro_wilk_edge.csv` | Shapiro-Wilk edge | `shapiro.test()` | W, p_value for n=3,4,5,10 |
| `dagostino.csv` | D'Agostino | `moments::agostino.test()`, `moments::anscombe.test()` | Z_skew, Z_kurt, p_values |
| `brunner_munzel.csv` | Brunner-Munzel | `lawstat::brunner.munzel.test()` | statistic, df, p_value, estimate, p_value_less, p_value_greater, conf.int at 95/90/99% |
| `diebold_mariano.csv` | Diebold-Mariano | `forecast::dm.test()` | statistic, p_value for SE/AE loss, h=1/h=3, alternative=less/greater, varestimator=acf/bartlett |
| `ttest_welch.csv` | Welch t-test with mu | `t.test(var.equal=FALSE, mu=0.5)` | statistic_mu, p_value_mu |
| `clark_west.csv` | Clark-West | Manual HAC computation | statistic, p_value for h=1 and h=3 |
| `spa.csv` | SPA test | Manual computation | t-statistics, means for all models |
| `mcs.csv` | Model Confidence Set | Manual computation | pairwise t-statistics, range statistic |
| `energy_distance.csv` | Energy distance | Manual computation | statistic, mean_xy, mean_xx, mean_yy |
| `permutation_t.csv` | Permutation t-test | Manual t-statistic | t_statistic, means, variances |

## Validation by Test Category

### 1. Math Primitives

**R Packages:** `base`, `e1071`

| Function | R Equivalent | Tolerance | Test Cases |
|----------|-------------|-----------|------------|
| `mean()` | `base::mean()` | 1e-12 | short, long, outlier, single |
| `variance()` | `base::var()` | 1e-12 | short, long, outlier |
| `std_dev()` | `base::sd()` | 1e-12 | short, long, outlier |
| `median()` | `base::median()` | 1e-12 | short, long, even, single |
| `trimmed_mean()` | `base::mean(trim=0.2)` | 1e-12 | short, long, outlier |
| `skewness()` | `e1071::skewness(type=2)` | 1e-12 | short, long, outlier, skewed |
| `kurtosis()` | `e1071::kurtosis(type=2)` | 1e-12 | short, long, outlier, skewed |

### 2. Parametric Tests

**R Packages:** `stats`, `WRS2`, `car`

| Function | R Equivalent | Tolerance | Test Cases |
|----------|-------------|-----------|------------|
| `t_test(..., Welch)` | `t.test(var.equal=FALSE)` | 1e-10 | two-sided, less, greater, mu=0.5, conf.level=0.95/0.90/0.99, null_value |
| `t_test(..., Student)` | `t.test(var.equal=TRUE)` | 1e-10 | two-sided, less, greater |
| `t_test(..., Paired)` | `t.test(paired=TRUE)` | 1e-10 | two-sided, less, greater |
| `yuen_test()` | `WRS2::yuen()` | 1e-10 | 20% trim, 10% trim, less, greater, conf.level=0.95 |
| `brown_forsythe()` | `car::leveneTest(center=median)` | 1e-10 | 2 groups, 3 groups |

### 3. Nonparametric Tests

**R Packages:** `stats`, `lawstat`

| Function | R Equivalent | Tolerance | Test Cases |
|----------|-------------|-----------|------------|
| `rank()` | `base::rank(ties.method="average")` | 1e-10 | simple, with ties |
| `mann_whitney_u()` | `wilcox.test(exact=FALSE/TRUE)` | 1e-6 | two-sided, less, greater, continuity correction, exact p-value, mu=0.5, conf.level=0.95/0.90, null_value |
| `wilcoxon_signed_rank()` | `wilcox.test(paired=TRUE, exact=FALSE/TRUE)` | 1e-6 | two-sided, less, greater, continuity correction, exact p-value, mu=0.3, conf.level=0.95/0.90, null_value |
| `kruskal_wallis()` | `kruskal.test()` | 1e-10 | 3 groups |
| `brunner_munzel()` | `lawstat::brunner.munzel.test()` | 1e-6 | two-sided, less, greater, alpha=0.05/0.10/0.01 (conf.int) |

### 4. Distributional Tests

**R Packages:** `stats`, `moments`

| Function | R Equivalent | Tolerance | Notes |
|----------|-------------|-----------|-------|
| `shapiro_wilk()` | `shapiro.test()` | W: 1e-3, p: 0.05 | Algorithm variations across implementations |
| `dagostino_k_squared()` | `moments::agostino.test()` + `moments::anscombe.test()` | 1e-6 | Combines skewness and kurtosis |

**Note on Shapiro-Wilk Tolerance:** The relaxed tolerances are intentional. Different implementations of the Royston AS R94 algorithm can produce slightly different results due to:
- Different polynomial approximations for small samples
- Different p-value approximation methods
- Floating-point precision in the algorithm's iterative calculations

### 5. Forecast Evaluation Tests

**R Packages:** `forecast`

| Function | R Equivalent | Tolerance | Test Cases |
|----------|-------------|-----------|------------|
| `diebold_mariano()` | `forecast::dm.test()` | 1e-6 | SE/AE loss, h=1/h=3, alternative=two-sided/less/greater, varestimator=acf/bartlett, horizon, loss_function, varestimator (output fields) |
| `clark_west()` | Manual HAC computation | 1e-6 | h=1, h=3 |
| `spa_test()` | Manual computation | 1e-6 | 3 competing models |
| `model_confidence_set()` | Manual computation | 1e-6 | 4 models |

### 6. Modern Distribution Tests

| Function | R Equivalent | Tolerance | Notes |
|----------|-------------|-----------|-------|
| `energy_distance_test()` | Manual distance calculation | 1e-6 | Permutation p-value |
| `mmd_test()` | Kernel-based computation | 1e-6 | Multiple kernel types |

## Example: Complete Validation Flow

### Welch T-Test Validation

**Step 1: R generates test data** (`R/generate_refs.R`, lines 83-84)
```r
set.seed(42)
g1 <- rnorm(20, mean = 5, sd = 1)
g2 <- rnorm(25, mean = 5.5, sd = 1.5)
```

**Step 2: R computes expected values** (`R/generate_refs.R`, lines 98-100)
```r
welch_two <- t.test(g1, g2, var.equal = FALSE, alternative = "two.sided")
welch_less <- t.test(g1, g2, var.equal = FALSE, alternative = "less")
welch_greater <- t.test(g1, g2, var.equal = FALSE, alternative = "greater")
```

**Step 3: R saves to CSV** (`R/data/ttest_welch.csv`)
```csv
"statistic_two","df_two","p_value_two","mean_x_two","mean_y_two",...
-0.0643125318393203,42.9656281823889,0.949019668509795,5.01533768967797,5.03745840727566,...
```

**Step 4: Rust test loads and validates** (`tests/tdd_parametric.rs`, lines 13-30)
```rust
#[test]
fn test_welch_two_sided() {
    let refs = common::load_reference_scalars("ttest_welch.csv");
    let g1 = common::load_reference_vector("ttest_g1.csv");
    let g2 = common::load_reference_vector("ttest_g2.csv");

    let result = t_test(&g1, &g2, TTestKind::Welch, Alternative::TwoSided, 0.0)
        .expect("t_test should succeed");

    assert_relative_eq!(result.statistic, refs["statistic_two"], epsilon = 1e-10);
    assert_relative_eq!(result.df, refs["df_two"], epsilon = 1e-10);
    assert_relative_eq!(result.p_value, refs["p_value_two"], epsilon = 1e-10);
}
```

## R Package Dependencies

The following R packages are required to regenerate reference data:

| Package | CRAN | Purpose |
|---------|------|---------|
| `stats` | Base R | Core statistical functions (t.test, wilcox.test, etc.) |
| `WRS2` | [CRAN](https://cran.r-project.org/package=WRS2) | Robust statistics (yuen) |
| `car` | [CRAN](https://cran.r-project.org/package=car) | Companion to Applied Regression (leveneTest) |
| `lawstat` | [CRAN](https://cran.r-project.org/package=lawstat) | Brunner-Munzel test |
| `e1071` | [CRAN](https://cran.r-project.org/package=e1071) | Skewness, kurtosis (type 2) |
| `moments` | [CRAN](https://cran.r-project.org/package=moments) | D'Agostino tests |
| `forecast` | [CRAN](https://cran.r-project.org/package=forecast) | Diebold-Mariano test |

Install all dependencies in R:
```r
install.packages(c("WRS2", "car", "lawstat", "e1071", "moments", "forecast"))
```

## Tolerance Rationale

| Tolerance | Used For | Rationale |
|-----------|----------|-----------|
| 1e-12 | Math primitives | Machine epsilon level; direct floating-point operations |
| 1e-10 | Most statistical tests | High precision; allows for minor floating-point accumulation |
| 1e-6 | Bootstrap/permutation tests | Randomness introduces variation; seed controls but precision varies |
| 1e-3 | Shapiro-Wilk W statistic | Known algorithm variations between implementations |
| 0.05 | Shapiro-Wilk p-value | Different p-value approximation methods produce larger variation |

## Test Coverage Summary

| Category | Tests | Test Cases | Reference Files |
|----------|-------|------------|-----------------|
| Math Primitives | 7 | 59 | 8 |
| Parametric | 5 | 27 | 13 |
| Nonparametric | 5 | 42 | 14 |
| Distributional | 2 | 17 | 12 |
| Resampling | 3 | 13 | 4 |
| Modern | 2 | 8 | 3 |
| Forecast | 5 | 25 | 22 |
| **Total** | **29** | **191** | **76** |

## Reproducibility

All reference data is reproducible:

1. **Fixed seed:** `set.seed(42)` in R ensures identical random data
2. **Version controlled:** CSV files are committed to the repository
3. **CI/CD verified:** GitHub Actions runs `cargo test` against these references
4. **Transparent:** All R code and CSV data are publicly available

## Verifying Your Installation

```bash
# 1. Clone repository
git clone https://github.com/sipemu/anofox-statistics-rs.git
cd anofox-statistics-rs

# 2. (Optional) Regenerate R references
Rscript R/generate_refs.R

# 3. Run validation tests
cargo test

# All 251 tests should pass (60 unit + 191 TDD integration)
```