# ARCUS-H 1.0 ## Adaptive Reinforcement Coherence Under Stress ### Open Benchmark for Behavioral Stability in Reinforcement Learning [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19075167-024BA0)](https://zenodo.org/records/19075898) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/) [![SB3](https://img.shields.io/badge/SB3-compatible-green.svg)](https://stable-baselines3.readthedocs.io/) > **ARCUS-H is an open-source evaluation harness that adds a second axis to RL benchmarking: Behavioral stability under structured stress — not just reward.** [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19075167-024BA0)](https://zenodo.org/records/19075898) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) --- ## Why another benchmark? Standard RL optimizes: $$J(\pi) = \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$ But return alone does not reveal how an agent behaves when execution assumptions are violated. ARCUS-H evaluates behavioral stability under controlled stress — and shows that **reward and stability can diverge dramatically**. **Key empirical finding:** Pearson $r = +0.14$, $p = 0.364$ between normalized reward and collapse rate under valence inversion across 9 environments and 7 algorithms. High-reward agents are not necessarily stable agents. --- ## What's in ARCUS-H 1.0 | Dimension | Coverage | |-----------|----------| | Environments | 9: 6 classic control, 2 MuJoCo, 1 Atari (Pong) | | Algorithms | 7: PPO, A2C, TRPO, DQN, DDPG, SAC, TD3 | | Stressors | 4: concept drift, resource constraint, trust violation, valence inversion | | Seeds | 10 per configuration | | Eval modes | Deterministic + stochastic | | Total runs | ~830 (env × algo × seed × mode × schedule) | --- ## Evaluation Protocol Each evaluation run is divided into three contiguous phases: $$\textbf{PRE} \;\rightarrow\; \textbf{SHOCK} \;\rightarrow\; \textbf{POST}$$ With 120 episodes per run (40 per phase). Stress transformations apply **only during SHOCK**. --- ## Stress Schedules ### 1. Concept Drift (CD) Observation distribution shifts during shock via an auto-calibrated additive drift: $$s_t^{exec} = s_t + \delta_t, \quad \delta_t = \delta_{t-1} + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_{obs}^2 I)$$ $\sigma_{obs}$ is calibrated from the reference pass — no free parameters. ### 2. Resource Constraint (RC) Models reduced control authority. Continuous: $a_t^{exec} = \kappa a_t, \quad 0 < \kappa < 1$ Discrete: $a_t^{exec} = \begin{cases} a_t & \text{with prob } 1-p \\ a_{default} & \text{otherwise} \end{cases}$ ### 3. Trust Violation (TV) Models action-execution mismatch. Continuous: $a_t^{exec} = \mathbf{M} a_t + \varepsilon_t$ Discrete: $a_t^{exec} = \pi_f(a_t)$ (fixed non-identity permutation) ### 4. Valence Inversion (VI) Corrupts reward feedback: $$r_t^{exec} = -r_t$$ Designed as the severest stressor — reward sign inversion renders the agent's optimization objective inconsistent with its learned policy. --- ## Behavioral Stability Channels ARCUS-H constructs a per-episode stability score $I_e \in [0,1]$ from five interpretable channels. No internal model access is required — all channels are computed from episode statistics. | Channel | What it measures | |---------|-----------------| | **Competence** | Reward improvement relative to recent EMA trend | | **Coherence** | Action smoothness (switch rate / jerk) | | **Continuity** | Self-consistency across consecutive episodes | | **Integrity** | Fidelity to pre-phase behavioral anchor | | **Meaning** | Constraint respect and regret-free behavior | $$I_e = w_c c_e + w_h h_e + w_t t_e + w_i i_e + w_m m_e$$ Weights are derived from per-channel baseline MADs — noisier channels receive lower weight automatically. --- ## Adaptive Calibration A key contribution is the **adaptive p95 threshold**: the binary collapse event threshold is set to the 95th percentile of collapse scores computed over the pre-phase of each run: $$\eta = \mathrm{p95}\!\left(\{S_e : e \in \mathcal{T}_{pre}\}\right)$$ This achieves FPR $\approx \alpha = 0.05$ by construction without any environment-specific tuning. Empirical mean FPR across 83 runs: **2.0%**. --- ## Metrics **Shock collapse rate:** $$CR_{shock} = \frac{1}{|\mathcal{T}_{shock}|}\sum_{e \in \mathcal{T}_{shock}} \mathbf{1}[S_e \geq \eta]$$ **Pre-to-shock stability drop:** $$\Delta I = \mu_{pre}(I) - \mu_{shock}(I)$$ **Leaderboard score** (stability-weighted): $$\mathrm{robust} = 0.55 \cdot \bar{I} + 0.30 \cdot (1 - CR_{shock}) + 0.15 \cdot \mathrm{rwd\_norm}$$ --- ## Key Results ### Reward does not predict stability ![Reward vs Stability](runs/plots/reward_vs_collapse_scatter.png) Pearson $r = +0.14$, $p = 0.364$ — no significant correlation between normalized reward and collapse rate. High-reward MuJoCo agents collapse at 73–84% under stress; DQN on MountainCar collapses near 0%. --- ### Collapse rate heatmap (all envs × stressors) ![Heatmap](runs/plots/heatmap_collapse_rate.png) --- ### Each stressor has a distinct channel signature ![Radar](runs/plots/identity_components_radar.png) - **CD** depresses integrity (observation shift breaks behavioral anchor) - **TV** suppresses all channels uniformly - **VI** attacks meaning (inverted reward generates constraint-violating behavior) - **RC** reduces competence and coherence --- ### Deterministic vs stochastic verdicts agree strongly ![Det vs Stoc](runs/plots/stochastic_vs_deterministic.png) Pearson $r = 0.82$–$0.96$ across stressors — eval mode choice does not change ARCUS-H rankings. --- ### Discrete vs continuous action spaces ![Action Space](runs/plots/collapse_by_action_space.png) Continuous action spaces are significantly more vulnerable under RC, TV, and VI (Mann-Whitney $p < 0.001$ for VI). --- ### Suite-level comparison ![Suite](runs/plots/mujoco_vs_classic_depth.png) MuJoCo agents collapse most severely (73–84%) despite achieving the highest reward — the clearest demonstration of reward/stability divergence. --- ## Leaderboard (baseline schedule, deterministic) Top performers per environment: | Environment | Algo | Robust | Identity | CR_shock | Rew_norm | |-------------|------|--------|----------|----------|----------| | MountainCar-v0 | dqn | 0.972 | 0.950 | 0.001 | 1.000 | | MountainCarContinuous-v0 | trpo | 0.940 | 0.891 | 0.000 | 1.000 | | Hopper-v4 | sac | 0.904 | 0.848 | 0.042 | 1.000 | | Acrobot-v1 | trpo | 0.920 | 0.856 | 0.005 | 1.000 | | CartPole-v1 | trpo | 0.891 | 0.833 | 0.056 | 1.000 | | Pendulum-v1 | sac | 0.866 | 0.756 | 0.000 | 1.000 | | Pong (ALE) | ppo | 0.859 | 0.772 | 0.053 | 0.912 | Full leaderboard: [`runs/leaderboard.csv`](runs/leaderboard.csv) Full tables (LaTeX-ready): [`runs/plots/tables/`](runs/plots/tables/) --- ## All Plots All 15 benchmark plots are generated automatically in `runs/plots/`, in both PNG (300 dpi) and PDF (vector, for LaTeX inclusion). | Plot | Description | |------|-------------| | `heatmap_collapse_rate` | Global env × stressor collapse rate matrix | | `reward_vs_collapse_scatter` | Core finding: reward ≠ stability | | `identity_components_radar` | Per-stressor channel signatures | | `vulnerability_heatmap` | Worst stressor per algo × env | | `collapse_by_action_space` | Discrete vs continuous (Mann-Whitney) | | `stochastic_vs_deterministic` | Eval mode robustness | | `fpr_validation` | Scoring calibration (FPR = 2.0%) | | `per_seed_consistency` | Seed stability (CV < 0.15) | | `score_by_schedule_per_env` | Collapse score curves per env | | `collapse_rate_by_algo` | Per-algo stressor profiles | | `on_policy_vs_off_policy` | Policy family comparison | | `seed_variance_boxplot` | Distribution over seeds | | `leaderboard_bar` | Full leaderboard | | `reward_degradation_heatmap` | Normalised reward drop | | `mujoco_vs_classic_depth` | Suite-level CI comparison | --- ## Reproducibility ### Install ```bash git clone https://github.com/karimzn00/ARCUSH_1.0.git cd ARCUSH_1.0 python -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` ### Train ```bash # Classic control (300k steps) python -m arcus.harness_rl.run_train \ --env CartPole-v1 --algo ppo \ --timesteps 300000 --seeds 0-9 # MuJoCo (300k steps) python -m arcus.harness_rl.run_train \ --env HalfCheetah-v4 --algo sac \ --timesteps 300000 --seeds 0-9 # Atari (3M steps) python -m arcus.harness_rl.run_train \ --env ALE/Pong-v5 --algo ppo \ --timesteps 3000000 --seeds 0-9 ``` ### Evaluate ```bash python -m arcus.harness_rl.run_eval \ --run_dir runs/YOUR_RUN_DIR \ --env CartPole-v1 --algo ppo \ --episodes 120 --seeds 0-9 --both \ --save_per_episode ``` ### Generate all plots and tables ```bash python -m arcus.harness_rl.compare \ --root runs \ --leaderboard \ --print \ --write_csv \ --plots ``` Output: - `runs/leaderboard.csv` — full leaderboard - `runs/plots/*.png` + `runs/plots/*.pdf` — 15 plots (300 dpi PNG + vector PDF) - `runs/plots/tables/*.tex` — LaTeX-ready tables --- ## Supported Algorithms | Algorithm | Family | Action Space | Library | |-----------|--------|-------------|---------| | PPO | On-policy | Discrete + Continuous | stable-baselines3 | | A2C | On-policy | Discrete + Continuous | stable-baselines3 | | TRPO | On-policy | Discrete + Continuous | sb3-contrib | | DQN | Off-policy | Discrete | stable-baselines3 | | DDPG | Off-policy | Continuous | stable-baselines3 | | SAC | Off-policy | Continuous | stable-baselines3 | | TD3 | Off-policy | Continuous | stable-baselines3 | --- ## Supported Environments | Environment | Suite | Action Space | Obs Type | |-------------|-------|-------------|----------| | CartPole-v1 | Classic | Discrete | State | | Acrobot-v1 | Classic | Discrete | State | | FrozenLake-v1 | Classic | Discrete | State | | MountainCar-v0 | Classic | Discrete | State | | MountainCarContinuous-v0 | Classic | Continuous | State | | Pendulum-v1 | Classic | Continuous | State | | HalfCheetah-v4 | MuJoCo | Continuous | State | | Hopper-v4 | MuJoCo | Continuous | State | | ALE/Pong-v5 | Atari | Discrete | Pixels | --- ## Limitations **Stationarity assumption.** ARCUS-H assumes the agent's behavioral distribution is stationary across pre/shock/post before any stressor. Procedurally-generated environments (Procgen) violate this — each episode draws a new level, causing pre-phase calibration to differ from shock-phase even without a stressor. **Image-based off-policy.** DQN on Atari requires ~10× more training steps than on-policy methods to reach competence, making matched-comparison infeasible. **Stressor scope.** Current stressors cover action and reward perturbations. Observation corruption and latent dynamics shift are not yet included. --- ## Citation If you use ARCUS-H in your research, please cite: ```bibtex @misc{zinebi2025arcush, title = {ARCUS-H: Behavioral Stability Under Controlled Stress as a Complementary RL Evaluation Axis}, author = {ZINEBI, Karim}, year = {2025}, url = {https://github.com/karimzn00/ARCUSH} } ``` --- ## License MIT License — see [LICENSE](LICENSE).