% This file was adapted from ICLR2022_conference.tex example provided for the ICLR conference \documentclass{article} % For LaTeX2e \usepackage{conference,times} \usepackage{easyReview} \usepackage{algorithm} \usepackage{algorithmic} % Optional math commands from https://github.com/goodfeli/dlbook_notation. \input{math_commands.tex} \usepackage{amsthm,amssymb} \newtheorem{theorem}{Theorem}[section] \newtheorem{corollary}{Corollary}[theorem] \newtheorem{lemma}[theorem]{Lemma} \newtheorem{definition}[theorem]{Definition} % Please leave these options as they are \usepackage{hyperref} \hypersetup{ colorlinks=true, linkcolor=red, filecolor=magenta, urlcolor=blue, citecolor=purple, pdftitle={Conditional Constrained Routing and Metric Bridging for SymbolicAI Workflows}, pdfpagemode=FullScreen, } \title{Conditional Constrained Routing and Metric Bridging for SymbolicAI Workflows Under CPU-Only Budgets} \author{Anonymous Authors\\ Anonymous Institution\\ \texttt{anonymous@anonymous.edu}} \begin{document} \maketitle \begin{abstract} Modular language-agent systems increasingly combine large language models, tool calls, and symbolic operators, but objective design and evaluation practice remain misaligned: trajectory-quality surrogates, benchmark-native outcomes, and deployment constraints are often optimized in isolation. We present a hybrid framework for SymbolicAI workflows that jointly optimizes constrained routing, bridge-calibrated metric alignment, and uncertainty-qualified acceptance under CPU-only budgets. The method defines a constrained router objective that couples trajectory quality, native task loss, route cost, and uncertainty terms; a bridge model that maps trajectory-level signals to heterogeneous benchmark-native outcomes; and a one-sided confidence predicate that controls deployment acceptance under practical gain thresholds. Across a benchmark suite spanning interactive tasks and code-oriented slices, the proposed router improves mean joint objective relative to strong planning and tool-use baselines (0.739 versus 0.701 for fixed-route SymbolicAI and 0.688 for OR-Toolformer-style routing), and the full bridge model improves both gain and calibration relative to distance-only controls (mean $\Gamma=0.060$; AUROC $=0.742$; ECE $=0.118$). Drift stress tests show higher robust success and lower invalid-call rates for symbolic fallback routing than static alternatives. Symbolic audits support most formal obligations while exposing two unresolved obligations, so theorem-strength claims are stated conditionally rather than globally. The resulting manuscript provides a claim-evidence-uncertainty closure that is explicit about what is proven, what is empirically supported, and where boundary failures begin. \end{abstract} % Fallbacks for externally generated tables that use booktabs commands. \providecommand{\toprule}{\hline} \providecommand{\midrule}{\hline} \providecommand{\bottomrule}{\hline} \section{Introduction} Modern agentic systems no longer execute as monolithic prompt-response loops. They compose language reasoning, external tools, retrieval modules, and symbolic subroutines into multi-step trajectories whose quality depends on both local decisions and global orchestration. This shift has produced rapid progress in structured reasoning and tool use, including reasoning-action interleaving, program-aided execution, tree and graph search, and planner-coupled language control \citep{src_0baffed4a75c,src_b7887b2b5ae2,src_d2c123e2329d,src_3f9105492cf9,src_a8ab719bfa87,src_fab8a7e15294}. At the same time, evaluation ecosystems have diversified: interactive agent benchmarks, software engineering issue-resolution suites, and multi-step tool-use benchmarks expose materially different native success criteria \citep{src_1bb96f4eca55,src_e0cb42ccf284,src_01fdaa0fb2ad,src_4483461cffa2,src_2dddcffecec1}. The central technical difficulty is therefore no longer just solving one task family; it is aligning objective functions and evidence across heterogeneous regimes without hiding uncertainty. SymbolicAI offers a natural test bed for this alignment problem because it explicitly models compositional workflows and trajectory-level quality through VERTEX-style similarity concepts while enabling modular solver routing \citep{src_b2199aad751f,src_182ca0e3292b,src_2d30e7648342}. However, prior phases in this project identified a persistent contradiction boundary: high trajectory similarity does not automatically imply high benchmark-native completion, especially under API drift, tool schema changes, and tight compute envelopes. This contradiction is not a minor reporting detail; it alters what can be claimed about optimization, generalization, and deployment readiness \citep{src_e2afe39a9d1b,src_11400d800724,src_d784c0f0951e,src_bf6ff4aab9fe,src_5636f52395f6}. This paper addresses that boundary with a hybrid contribution: we combine formal optimization and theorem-audit machinery with benchmark-grounded empirical validation, then report claims conditionally where symbolic closure is incomplete. Rather than presenting a single scalar score, we expose a chain from problem statement to objective, from objective to executable protocol, and from protocol to caveated claims. The practical scope follows the user-imposed constraints of this run: CPU-only execution, open compute budget (still not globally fixed), and mixed benchmark families. \textbf{Contributions.} \begin{itemize} \item We define a constrained routing objective for SymbolicAI workflows that jointly optimizes trajectory quality, benchmark-native loss, route cost, and uncertainty under explicit feasibility constraints. \item We introduce a bridge-calibration layer that maps trajectory-level surrogates to heterogeneous native benchmark outcomes and quantifies bridge gain with a manuscript-defined audit quantity. \item We formalize an uncertainty-aware acceptance predicate with complete proof blocks under bounded assumptions, and we pair theorem claims with symbolic obligation checks and counterexample conditions. \item We execute a hybrid validation package showing objective and calibration gains over matched baselines, while explicitly reporting unresolved symbolic obligations and data-provenance limits that bound interpretation. \end{itemize} The remainder of the manuscript is organized as follows: \Secref{sec:related_work} synthesizes consensus and contradiction structure in prior work; \Secref{sec:problem_setting} defines symbols, spaces, assumptions, and optimality criteria; \Secref{sec:method} presents the method and integrated algorithm; \Secref{sec:formal_analysis} states and proves the main formal claims; \Secref{sec:protocol} and \Secref{sec:results} report the experimental and symbolic evidence; and \secref{sec:limitations} describes unresolved gaps and concrete follow-up experiments. \section{Related Work and Novelty Boundary} \label{sec:related_work} \subsection{Reasoning and Planning Control} Reasoning-action frameworks establish that explicit intermediate structure can improve long-horizon decision quality relative to single-pass prompting. ReAct-style interleaving, program-aided decomposition, and deliberate search over thought branches each expose distinct tradeoffs between interpretability, branching cost, and error propagation \citep{src_0baffed4a75c,src_b7887b2b5ae2,src_d2c123e2329d,src_3f9105492cf9}. LATS and planner-coupled pipelines push this direction toward explicit planning constraints \citep{src_a8ab719bfa87,src_fab8a7e15294}. The consensus is clear: structured control helps. The contradiction is also clear: stronger search often raises compute demands and can degrade robustness under runtime constraints. For a CPU-only deployment regime, this tension becomes first-order rather than incidental. \subsection{Tool Learning and Orchestration} Toolformer and Gorilla demonstrate that API-grounded behavior can be learned or adapted, while HuggingGPT, AutoGen, and DSPy emphasize compositional orchestration and programmatic pipeline control \citep{src_e2afe39a9d1b,src_11400d800724,src_57a6127f412a,src_7be0501cb1dc,src_23021c8a5f6c}. Recent surveys and multi-LLM analyses highlight instability in argument validity and version-sensitive behavior, especially when tool schemas change or retrieval support is imperfect \citep{src_a572df70e757,src_82d98f4e68f9}. These findings motivate our explicit validity constraint and drift-caveated interpretation instead of treating tool success as stationary. \subsection{Benchmark Heterogeneity and Metric Mismatch} AgentBench, SWE-bench, TravelPlanner, and multimodal tool-use benchmarks encode non-equivalent native outcomes: task completion, issue resolution, action validity, and environment-dependent utility are not interchangeable \citep{src_1bb96f4eca55,src_e0cb42ccf284,src_01fdaa0fb2ad,src_4483461cffa2}. SymbolicAI's trajectory similarity framing is valuable but not sufficient as a universal surrogate \citep{src_b2199aad751f}. Prior project phases therefore identified a blocking gap: objective closure requires an explicit bridge between trajectory and native metrics, with disagreement analysis rather than aggregate-only reporting. \subsection{Formal Guarantees and Neuro-Symbolic Validation} Neuro-symbolic composition and verification-oriented lines suggest that constrained symbolic structure can improve compositional reasoning and post-hoc validity \citep{src_db5eb4358891,src_69018508daf9,src_865b0aa37ef4}. However, many practical systems still stop short of workflow-level guarantees under perturbations, and proofs are often not tied to executable assumption audits. OR-Toolformer-style optimization formulations provide stronger optimization language for tool planning but still require careful boundary accounting in non-stationary settings \citep{src_4a4f020dcb77}. \subsection{Robustness, Uncertainty, and High-Stakes Transfer} In healthcare and other high-stakes domains, retrieval augmentation can improve average metrics while uncertainty and critical-error behavior remain unresolved \citep{src_d784c0f0951e,src_bf6ff4aab9fe,src_745c8e50b70a,src_5e2fa6bd5e8f,src_5b90080b5321,src_888b1eb13315}. Medical-agent benchmarks further emphasize this gap by exposing scenarios where moderate aggregate gains coexist with unacceptable error patterns \citep{src_139ff817eb0e,src_5636f52395f6,src_840da66ba21c}. These findings directly motivate our uncertainty-qualified acceptance predicate and our refusal to claim global guarantees where bounded assumptions fail. \textbf{Novelty boundary.} We do not claim to invent trajectory metrics, tool orchestration, or confidence bounds in isolation. Instead, this work contributes a closed hybrid assembly: (i) constrained SymbolicAI routing objective with explicit feasibility and optimality criteria, (ii) manuscript-specific bridge gain quantity for metric alignment, (iii) theorem-to-audit linkage with explicit pass/fail obligations, and (iv) claim-level caveating when formal closure is partial. \section{Problem Setting, Symbols, and Assumptions} \label{sec:problem_setting} \subsection{Workflow Graph and Objectives} Let $G=(V,E)$ denote a typed workflow graph where each node corresponds to a solver invocation and each edge carries an artifact transition. This graph-level modeling perspective is adapted from the SymbolicAI formulation \citep{src_b2199aad751f}. We evaluate policies over tasks $x\sim P_{\text{task}}$ and induced trajectories $\tau$. Following prior trajectory-quality framing, we use $\mathcal{D}(\mathbb{P}_{\text{gen}},\mathbb{P}_{\text{ref}})$ as the trajectory-level distance primitive at first introduction \citep{src_b2199aad751f}. In this manuscript, we define an explicit composite objective \begin{equation} \label{eq:router_objective} J(\vz,\pi_{\text{tool}})=\mathbb{E}_{x\sim P_{\text{task}}}\Big[\alpha\,\mathcal{D}(\mathbb{P}^{\vz}_{\text{gen}},\mathbb{P}_{\text{ref}})+\beta\,\mathcal{L}_{\text{task}}(x;\vz,\pi_{\text{tool}})+\gamma\,C_{\text{route}}(x;\vz)+\lambda\,U(x)\Big], \end{equation} where $\alpha,\beta,\gamma,\lambda\ge 0$ are scalar weights, $\vz$ are relaxed route variables, $\mathcal{L}_{\text{task}}$ is benchmark-native loss, $C_{\text{route}}$ is route cost, and $U$ is an uncertainty penalty. \subsection{Decision Variables, Feasible Set, and Optimality Criterion} For each decision step $t\in\{1,\dots,T\}$ and backend index $k\in\{1,\dots,K\}$, we set $z_{t,k}\in[0,1]$ with $\sum_k z_{t,k}=1$, so each $\vz_t$ lies in the simplex $\Delta^{K-1}$. The feasible set is \begin{equation} \label{eq:feasible_set} \mathcal{F}:=\left\{(\vz,\pi_{\text{tool}}):\;\mathbb{E}[C_{\text{route}}]\le B_{\text{cpu}},\;\mathbb{P}(I_{\text{valid}}=1)\ge\tau_{\text{valid}},\;\vz_t\in\Delta^{K-1}\;\forall t\right\}, \end{equation} where $B_{\text{cpu}}$ is the CPU budget and $\tau_{\text{valid}}$ is the minimum validity rate. We define the relaxed optimality criterion by \begin{equation} \label{eq:optimality} (\vz^{\star},\pi_{\text{tool}}^{\star})\in\arg\min_{(\vz,\pi_{\text{tool}})\in\mathcal{F}}J(\vz,\pi_{\text{tool}}). \end{equation} This criterion is conditional: when $\mathcal{F}$ is empty, \eqref{eq:optimality} is undefined and strong optimality claims are disallowed. \subsection{Assumptions and Scope} We use six assumptions inherited from the symbolic blueprint and validation design: feasibility (A1), convex surrogate and bounded subgradients (A2), drift-bounded reporting window (A3), reproducible logging (A4), bridge-support data fields (A5), and bounded episode gains for Hoeffding-style auditing (A6). Several are standard in optimization and concentration analysis; others are manuscript-specific operational assumptions. We explicitly distinguish borrowed versus introduced conventions: \begin{itemize} \item Borrowed conventions: workflow graph object and trajectory distance framing from SymbolicAI \citep{src_b2199aad751f}; benchmark heterogeneity motivation from AgentBench-like evaluations \citep{src_1bb96f4eca55,src_e0cb42ccf284}. \item Manuscript-defined components: bridge gain quantity $\Gamma$, acceptance predicate $\mathcal{A}_{\delta}$, and the integrated constrained objective in \eqref{eq:router_objective}. \end{itemize} Our scope is intentionally narrow: we test conditional constrained improvement under CPU-only envelopes with explicit caveats on unresolved symbolic obligations and unfinished global budget calibration. \section{Method: Constrained Routing, Metric Bridging, and Acceptance Auditing} \label{sec:method} \subsection{Constrained Routing Objective} The first method component optimizes \eqref{eq:router_objective} over \eqref{eq:feasible_set}. Intuitively, the router balances four pressures: trajectory quality, native success, computational cost, and uncertainty. This extends existing tool-use planning paradigms that optimize one or two terms but often leave cross-metric closure implicit \citep{src_e2afe39a9d1b,src_11400d800724,src_4a4f020dcb77}. In our implementation, route cost is measured in CPU-seconds per episode, and feasibility scans over $B_{\text{cpu}}\in\{60,120,240\}$ and $\tau_{\text{valid}}\in\{0.80,0.90,0.95\}$. For dynamic environments, we track variation-aware regret against a comparator sequence $\{\vu_t\}_{t=1}^T$: \begin{equation} \label{eq:dynamic_regret} \mathrm{Regret}_T := \sum_{t=1}^{T}\big(f_t(\vz_t)-f_t(\vu_t)\big), \end{equation} with path variation $V_T:=\sum_{t=2}^T\|\vu_t-\vu_{t-1}\|_1$. Under A2, mirror-descent-style analysis yields \begin{equation} \label{eq:dynamic_bound} \mathrm{Regret}_T\le \frac{\log K}{\eta}+\frac{\eta TG^2}{2}+GV_T, \end{equation} which we use as an auditable certificate rather than a universal performance guarantee. \subsection{VERTEX-Native Bridge Model} Metric mismatch is handled by a bridge model that predicts native outcomes from trajectory-level and operational features: \begin{equation} \label{eq:bridge_model} \widehat{M}_{\text{native}}(\tau)=\sigma\Big(w_0+w_1\big(-\mathcal{D}(\tau)\big)+w_2I_{\text{valid}}(\tau)+w_3C_{\text{complete}}(\tau)\Big), \end{equation} where $\sigma$ is the logistic link and $C_{\text{complete}}$ is benchmark-specific completion status. We fit parameters via regularized empirical risk minimization: \begin{equation} \label{eq:bridge_erm} \min_{\vw}\sum_{\tau}\ell\big(\widehat{M}_{\text{native}}(\tau),M_{\text{native}}(\tau)\big)+\rho\|\vw\|_2^2. \end{equation} To audit added explanatory value beyond distance-only controls, we define \begin{equation} \label{eq:gamma_def} \Gamma := \mathbb{E}_{\tau}\!\left[\ell(\widehat{M}^{(0)}_{\text{native}}(\tau),M_{\text{native}}(\tau)) - \ell(\widehat{M}^{(1)}_{\text{native}}(\tau),M_{\text{native}}(\tau))\right], \end{equation} where superscripts $(0)$ and $(1)$ denote restricted and full bridge classes. By construction, nonnegative $\Gamma$ indicates improved empirical risk under the richer class; negative $\Gamma$ indicates mismatch or overfitting. \subsection{Uncertainty-Aware Acceptance Predicate} Mean improvements alone are insufficient for deployment claims under drift \citep{src_d784c0f0951e,src_bf6ff4aab9fe}. We therefore define acceptance using a one-sided lower confidence bound. For episode gains $Y_i\in[0,1]$ with sample mean $\widehat{\Delta}$ and $n$ samples, \begin{equation} \label{eq:radius} r(n,\delta)=\sqrt{\frac{\log(1/\delta)}{2n}}, \end{equation} \begin{equation} \label{eq:acceptance_pred} \mathcal{A}_{\delta}=\mathbf{1}\!\left[\widehat{\Delta}-r(n,\delta)>\kappa_C\right], \end{equation} where $\kappa_C$ is the minimum practical gain threshold after accounting for added route cost. This predicate is used for reporting acceptance precision, false positives, and boundary behavior, not as a substitute for task-level diagnostics. \subsection{Integrated Procedure} \Algref{alg:integrated} summarizes the integrated workflow used in this study. \begin{algorithm}[t] \caption{Constrained Router + Bridge + Acceptance Evaluation} \label{alg:integrated} \begin{algorithmic}[1] \STATE \textbf{Input:} benchmark tasks $\train$, budget grid $\mathcal{B}$, validity thresholds $\mathcal{T}$, seeds $\mathcal{S}$ \FOR{each $(B_{\text{cpu}},\tau_{\text{valid}})\in\mathcal{B}\times\mathcal{T}$ and seed $s\in\mathcal{S}$} \STATE Optimize relaxed routing variables $(\vz,\pi_{\text{tool}})$ for \eqref{eq:router_objective} under \eqref{eq:feasible_set} \STATE Log trajectory distance, native outcomes, route cost, tool validity, and completion indicators \STATE Fit bridge model via \eqref{eq:bridge_erm} and compute $\Gamma$ using \eqref{eq:gamma_def} \STATE Compute acceptance predicate \eqref{eq:acceptance_pred} and drift-tier diagnostics \STATE Run symbolic obligation checks for routing optimality, bridge consistency, and acceptance safety, and generate counterexamples when assumptions fail \ENDFOR \STATE Aggregate across seeds using bootstrap confidence summaries and report claim-level support with caveats \end{algorithmic} \end{algorithm} \section{Formal Analysis} \label{sec:formal_analysis} \begin{theorem}[Existence of a relaxed constrained minimizer] \label{thm:existence} Assume A1 and continuity of all objective components in \eqref{eq:router_objective}. Then there exists $(\vz^{\star},\pi^{\star}_{\text{tool}})\in\mathcal{F}$ such that \[ J(\vz^{\star},\pi^{\star}_{\text{tool}})=\min_{(\vz,\pi_{\text{tool}})\in\mathcal{F}}J(\vz,\pi_{\text{tool}}). \] \end{theorem} \begin{proof} By construction, each $\vz_t\in\Delta^{K-1}$, and finite products of simplices are compact. Under A1, the feasibility inequalities in \eqref{eq:feasible_set} define a nonempty closed subset of this compact product domain (combined with the admissible policy class for $\pi_{\text{tool}}$ used in the experiment design). Therefore $\mathcal{F}$ is compact and nonempty. Continuity of $J$ on $\mathcal{F}$ implies, by the Weierstrass theorem, that $J$ attains a minimum on $\mathcal{F}$. Hence a minimizer exists. \end{proof} \begin{theorem}[Dynamic regret bound under convex relaxation] \label{thm:regret} Assume A2 and mirror-descent updates with step size $\eta>0$ over simplex-constrained routes. Let $G$ bound subgradient infinity norm and let $V_T$ be comparator path variation. Then \[ \mathrm{Regret}_T\le \frac{\log K}{\eta}+\frac{\eta TG^2}{2}+GV_T. \] \end{theorem} \begin{proof} For mirror descent with entropic regularization, standard online convex optimization analysis yields \[ \sum_{t=1}^{T}\langle \vg_t,\vz_t-\vu_t\rangle \le \frac{\log K}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\|\vg_t\|_\infty^2+\sum_{t=1}^{T}\langle \vg_t,\vu_t-\vu_{t-1}\rangle, \] where $\vg_t\in\partial f_t(\vz_t)$. Convexity gives $f_t(\vz_t)-f_t(\vu_t)\le \langle \vg_t,\vz_t-\vu_t\rangle$, so summing over $t$ and applying $\|\vg_t\|_\infty\le G$ gives the first two terms in \eqref{eq:dynamic_bound}. For the variation term, $\langle \vg_t,\vu_t-\vu_{t-1}\rangle\le \|\vg_t\|_\infty\|\vu_t-\vu_{t-1}\|_1\le G\|\vu_t-\vu_{t-1}\|_1$. Summing yields $GV_T$, establishing the bound. \end{proof} \begin{lemma}[Nested-class bridge non-negativity] \label{lem:gamma} Let $\mathcal{F}_0\subseteq\mathcal{F}_1$ denote restricted and full bridge hypothesis classes in \eqref{eq:bridge_model}. Let $\widehat{R}(f)$ denote empirical risk under the training protocol. If $f_0=\arg\min_{f\in\mathcal{F}_0}\widehat{R}(f)$ and $f_1=\arg\min_{f\in\mathcal{F}_1}\widehat{R}(f)$, then $\widehat{R}(f_1)\le \widehat{R}(f_0)$ and empirical $\Gamma\ge 0$. \end{lemma} \begin{proof} Because $\mathcal{F}_0\subseteq\mathcal{F}_1$, minimization over $\mathcal{F}_1$ cannot produce larger optimal risk than minimization over $\mathcal{F}_0$. Therefore \[ \widehat{R}(f_1)=\min_{f\in\mathcal{F}_1}\widehat{R}(f)\le\min_{f\in\mathcal{F}_0}\widehat{R}(f)=\widehat{R}(f_0). \] By definition, empirical bridge gain is $\Gamma=\widehat{R}(f_0)-\widehat{R}(f_1)$, hence $\Gamma\ge 0$. \end{proof} \begin{theorem}[One-sided acceptance safety] \label{thm:acceptance} Assume A6: i.i.d. bounded gains $Y_i\in[0,1]$ with mean $\mu$. If $\mathcal{A}_{\delta}=1$ under \eqref{eq:acceptance_pred}, then with probability at least $1-\delta$, $\mu>\kappa_C$. \end{theorem} \begin{proof} Hoeffding's inequality gives \[ \Pr\!\left(\mu\ge \widehat{\Delta}-\sqrt{\frac{\log(1/\delta)}{2n}}\right)\ge 1-\delta. \] Define $L_\delta:=\widehat{\Delta}-\sqrt{\log(1/\delta)/(2n)}$. If $\mathcal{A}_{\delta}=1$, then by \eqref{eq:acceptance_pred}, $L_\delta>\kappa_C$. Combining both statements implies \[ \Pr(\mu>\kappa_C)\ge\Pr(\mu\ge L_\delta)\ge 1-\delta. \] Hence acceptance implies a confidence-qualified practical gain. \end{proof} These theorems provide conditional guarantees. They do not supersede empirical checks for assumption violations. In particular, if A1 fails (empty feasible set) or A6 fails (heavy-tailed non-i.i.d. gains), the relevant theorem cannot be used as stated and must be replaced by alternative diagnostics. \section{Experimental Protocol} \label{sec:protocol} \subsection{Datasets and Benchmark Families} We evaluate on benchmark slices that collectively stress planning, tool invocation, and long-horizon execution. The design combines SymbolicAI's benchmark context with interactive-agent and software-task settings to avoid single-regime overfitting \citep{src_b2199aad751f,src_2d30e7648342,src_1bb96f4eca55,src_e0cb42ccf284}. This mixed design reflects the project objective: test whether constrained routing and metric bridging generalize across heterogeneous native objectives instead of optimizing only one benchmark family. \subsection{Baselines and Ablations} Baseline families include fixed-route SymbolicAI, OR-Toolformer-style routing, ReAct, Tree-of-Thoughts, Graph-of-Thoughts, LATS, and additional bridge and fallback ablations. The comparisons are matched under CPU-envelope accounting to isolate policy quality from unconstrained compute expansion. We also include bridge ablations that remove completion or validity features, because those terms are central to our metric-closure hypothesis. This protocol is aligned with recommendations for comparator diversity and failure-mode reporting in recent agent-evaluation surveys \citep{src_2dddcffecec1,src_a572df70e757}. \subsection{Metrics, Uncertainty, and Claim Tests} Primary optimization metrics are joint objective $J$, native success rate, route cost, and validity rate. Bridge metrics include $\Gamma$, AUROC, and expected calibration error. Drift and uncertainty diagnostics include robust success under perturbations, invalid-call rate, acceptance false-positive rate, and critical-error rate. We aggregate across seeds using bootstrap summaries and evaluate theorem obligations through symbolic checks and counterexample tables. This claim structure intentionally binds each major claim to explicit evidence artifacts and caveats. \subsection{Compute Budget and Reproducibility Design} The current run is CPU-only, with per-experiment envelopes and five fixed seeds $\{11,23,37,73,101\}$. The global budget cap is still unresolved, so all conclusions are reported as conditional to the tested envelope rather than as universal scaling claims. Reproducibility instrumentation includes provider/version controls, deterministic config manifests, and symbolic audit outputs. This design follows repository-level reproducibility practices from benchmark and framework companions \citep{src_2d30e7648342,src_182ca0e3292b}. \section{Results} \label{sec:results} \subsection{Router Objective, Feasibility, and Regret Behavior} \Figref{fig:router_panels} and Table~\ref{tab:router_objective} evaluate the constrained routing claim. The proposed router leads the comparator mean objective (0.739 vs 0.701 for fixed-route SymbolicAI and 0.688 for OR-style routing), consistent with directional support for constrained improvement under the tested envelope. The panelized figure further shows that feasibility degrades at stricter validity thresholds with low CPU budgets, which confirms that claim strength must remain conditional on A1-like feasibility. \begin{figure}[t] \centering \includegraphics[width=0.68\linewidth]{figures/iter_1/router_panels_iter1.pdf} \caption{This figure summarizes constrained routing behavior across objective, feasibility, and drift axes. The left panel compares joint objective across budget tiers, the middle panel shows the feasibility frontier over CPU budget and validity threshold, and the right panel reports dynamic-regret proxy behavior under drift perturbations. The key interpretation is that the proposed router improves objective quality while preserving feasibility in broad but not universal regions, so optimality language must be scoped to feasible regimes rather than stated globally.} \label{fig:router_panels} \end{figure} \begin{table}[t] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \input{tables/iter_1/router_objective_comparison.tex} \caption{This table reports mean joint objective values for the routing baselines under matched CPU-envelope settings. The ranking supports the constrained-routing claim in aggregate, but it does not by itself certify universal dominance because feasibility and drift assumptions still govern when the formal guarantees apply.} \label{tab:router_objective} \end{table} \subsection{Metric Bridging and Calibration Reliability} \Figref{fig:bridge_panels} and Table~\ref{tab:bridge_results} address the metric-closure question. The full bridge model shows higher mean bridge gain and stronger calibration than distance-only controls (mean $\Gamma=0.060$, AUROC $=0.742$, ECE $=0.118$). This result indicates that trajectory distance alone is informative but incomplete, and that completion and validity channels carry significant additional signal for native outcomes. Importantly, the disagreement panel shows residual error clusters, which means the bridge improves alignment but does not remove all mismatch modes. \begin{figure}[t] \centering \includegraphics[width=0.68\linewidth]{figures/iter_1/metric_bridge_panels_iter1.pdf} \caption{This figure reports bridge calibration behavior with three complementary views: bridge gain by model family, calibration error by model family, and disagreement structure across benchmark strata. The combined interpretation is that the full bridge improves both discriminative and calibration behavior relative to distance-only controls, while the disagreement atlas identifies remaining high-risk regions associated with invalid tool calls and incomplete plans. This panelized presentation is essential because aggregate gain alone can hide systematic disagreement clusters.} \label{fig:bridge_panels} \end{figure} \begin{table}[t] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \input{tables/iter_1/metric_bridge_results.tex} \caption{This table summarizes bridge gain, discrimination, and calibration across full and ablated models. The full bridge dominates the distance-only and native-only controls in this run, supporting the metric-bridging claim as an empirical relation rather than an unconditional theorem about all future benchmark families.} \label{tab:bridge_results} \end{table} \subsection{Drift Robustness and Uncertainty-Qualified Acceptance} \Figref{fig:uncertainty_panels} and Table~\ref{tab:uncertainty_results} evaluate cross-cut robustness. Symbolic fallback achieves the strongest drift-robust success with lower invalid-call and critical-error rates than static alternatives in the tested perturbation tiers. Acceptance false-positive rates are low in this run, but symbolic boundary checks still show unresolved monotonicity obligations under certain parameter regimes, so acceptance guarantees are reported with explicit bounded-assumption caveats. \begin{figure}[t] \centering \includegraphics[width=0.68\linewidth]{figures/iter_1/uncertainty_panels_iter1.pdf} \caption{This figure integrates robust success trends, acceptance behavior, and critical-error dynamics under increasing drift tiers. The left and right panels show that symbolic fallback preserves a better performance-safety tradeoff than static policies in this experimental envelope, while the center panel demonstrates that stricter confidence settings produce more conservative acceptance decisions. Together, these panels support robustness gains while also clarifying that acceptance sensitivity depends on confidence and distributional assumptions.} \label{fig:uncertainty_panels} \end{figure} \begin{table}[t] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \input{tables/iter_1/uncertainty_acceptance_table.tex} \caption{This table reports drift-robust success, invalid API call rate, acceptance false positives, and critical-error rate by baseline. The aggregate pattern favors symbolic fallback, but the table is interpreted jointly with symbolic boundary checks because acceptance claims rely on bounded assumptions that can fail under heavy-tail or non-stationary conditions.} \label{tab:uncertainty_results} \end{table} \subsection{Claim-Evidence Closure} The evidence profile is hybrid and asymmetric. Empirical evidence is strongest: all three main result tables and the first three multi-panel figures support the direction of core claims in the tested envelope. Formal evidence is partial: symbolic obligations pass in most cases but not all, with unresolved failures in one router-related and one acceptance-related obligation. Counterexample infrastructure correctly detects violated assumptions for feasibility, drift envelope, and heavy-tail gain conditions. Therefore, we report claim support as \emph{supported}, \emph{supported}, and \emph{mixed} for routing, bridge, and uncertainty cross-cut claims, respectively. \section{Ablation, Contradiction Resolution, and Sensitivity Analyses} \label{sec:ablation_breakdown} \subsection{Resolving the Metric-Alignment Contradiction} One of the central contradictions from upstream literature phases was whether trajectory-level similarity can serve as a reliable proxy for benchmark-native success. Our results suggest a precise answer: trajectory distance is useful as a first-order signal, but by itself it is under-specified for cross-benchmark closure. This is visible in Table~\ref{tab:bridge_results}, where distance-only variants remain above native-only controls in some metrics but are consistently weaker than the full bridge that integrates validity and completion channels. The practical interpretation is that the bridge is not merely a calibration tweak; it is a structural correction for heterogeneous benchmark semantics. This conclusion aligns with benchmark literature that reports objective heterogeneity as a recurring source of misinterpretation in agent evaluation \citep{src_1bb96f4eca55,src_e0cb42ccf284,src_2dddcffecec1}. It also aligns with tool-learning literature showing that argument correctness and completion state carry substantial explanatory power beyond coarse similarity or retrieval-only confidence proxies \citep{src_e2afe39a9d1b,src_11400d800724,src_a572df70e757}. In other words, our bridge result does not claim that trajectory signals are weak; it claims they are incomplete for heterogeneous outcomes unless operational success indicators are modeled jointly. An important byproduct of this conclusion is methodological: disagreements should be treated as scientific objects, not error bars to be compressed into a single aggregate score. The disagreement atlas in \figref{fig:bridge_panels} provides evidence for this claim by localizing systematic mismatch around invalid tool calls and incomplete plans. These regimes are exactly where prior work warns that agent pipelines can look superficially coherent while failing on action-level correctness \citep{src_82d98f4e68f9,src_2dddcffecec1}. A reporting pipeline that omits this decomposition would likely overstate metric closure. \subsection{Baseline Lineage and Causal Attribution} The baseline design in this study is intentionally broad because causal claims about routing quality are fragile under narrow comparators. ReAct, Tree-of-Thoughts, Graph-of-Thoughts, and LATS represent distinct reasoning-control lineages \citep{src_0baffed4a75c,src_d2c123e2329d,src_3f9105492cf9,src_a8ab719bfa87}; Toolformer and Gorilla capture API-grounded tool-use behavior \citep{src_e2afe39a9d1b,src_11400d800724}; fixed-route SymbolicAI and OR-style tool routing stress the optimization framing most directly \citep{src_b2199aad751f,src_4a4f020dcb77}. By combining these families under matched CPU accounting, we reduce the risk that gains are artifacts of one comparator design choice. Even with broad baselines, causal attribution remains conditional. Our routing gains could reflect multiple mechanisms: better route allocation, better feasibility handling, or improved uncertainty penalties. The method design addresses this by coupling objective and feasibility views in the same figure and by requiring negative-case containers in the appendix. However, negative-case coverage in this iteration is still limited to selected stress slices, which weakens fine-grained attribution. Accordingly, we avoid claiming complete mechanism isolation and instead report mechanism-consistent evidence with explicit residual ambiguity. This conservative stance matches recommendations from recent survey work on agent evaluation and tool learning, which repeatedly emphasizes that uncontrolled protocol differences can masquerade as method superiority \citep{src_2dddcffecec1,src_a572df70e757}. In practice, the right standard is not ``one headline metric plus one baseline,'' but a comparator lineage that spans planning, tool-use, and orchestration families under harmonized constraints. \subsection{Sensitivity to Budget, Feasibility, and Drift} The feasibility frontier in \figref{fig:router_panels} is crucial for interpreting optimization claims. As $\tau_{\text{valid}}$ tightens and $B_{\text{cpu}}$ shrinks, feasible policy mass decreases. This behavior is expected from \eqref{eq:feasible_set}; it is not a failure of the optimization algorithm per se. The scientific implication is that constrained-improvement claims should be conditioned on feasible operating regions, not extrapolated across infeasible corners. Drift sensitivity adds a second layer: even when feasibility holds, non-stationary tools and provider shifts can change objective landscapes. The uncertainty panel and counterexample outputs show that stress regimes can trigger boundary behaviors that are invisible in average-case tables. This is consistent with domain findings where retrieval-augmented methods improve means but still exhibit brittle behavior under distribution or interface shifts \citep{src_d784c0f0951e,src_bf6ff4aab9fe,src_745c8e50b70a,src_5e2fa6bd5e8f}. For this reason, our dynamic-regret interpretation remains explicitly conditional on drift-envelope assumptions rather than presented as blanket temporal robustness. An additional sensitivity axis concerns reporting granularity. Mean objective improvements are meaningful, but policy-selection decisions in practice often hinge on tail behavior and failure concentration. Our appendix structure intentionally separates obligation failures, counterexamples, and failure-case tables so that readers can audit both central tendency and adverse regimes. This design choice follows high-stakes evaluation lessons: mean performance without critical-error visibility is insufficient for deployment-facing inference \citep{src_139ff817eb0e,src_5636f52395f6,src_888b1eb13315}. \subsection{Uncertainty, Acceptance, and False-Positive Control} The acceptance predicate in \eqref{eq:acceptance_pred} provides a practical control knob, but it is not a free safety certificate. Its validity depends on bounded and sufficiently well-behaved gain distributions (A6). Symbolic checks and counterexample generation are therefore integral to interpretation, not peripheral technicalities. In this run, acceptance false positives are low for the tested tiers, but unresolved symbolic monotonicity and heavy-tail counterexample triggers mean that conservative language remains mandatory. This distinction is particularly important when comparing with mean-only reporting pipelines. Mean-only summaries can appear stable even when confidence-qualified acceptance should fail under plausible distributional changes. The requirement to provide both acceptance metrics and boundary-condition evidence directly addresses this risk. In short, uncertainty qualification should be evaluated as a first-class algorithmic component with explicit failure criteria, rather than as an appendix-only confidence interval. \section{Practical Implications and Generalization Scope} \label{sec:practical_scope} \subsection{Operational Guidance for SymbolicAI-Style Systems} The most immediate practical takeaway is that constrained adaptive routing is useful when it is deployed with three safeguards: feasibility auditing, metric bridging, and uncertainty-qualified acceptance. Without feasibility auditing, optimization routines can report impressive objective values in regions that are operationally invalid. Without metric bridging, trajectory-quality gains can misrepresent native outcome quality. Without uncertainty-qualified acceptance, deployment decisions can overfit mean improvements and underweight adverse conditions. For practitioners building SymbolicAI-style pipelines, this translates to a concrete workflow: first, define the feasible envelope and verify that candidate policies satisfy budget and validity constraints; second, fit and monitor a bridge model that explicitly relates trajectory and native outcomes; third, gate acceptance through confidence-qualified predicates and counterexample checks. This workflow is more demanding than single-score evaluation, but it is also more robust to the exact contradictions documented across prior benchmark studies. The workflow also clarifies module responsibilities in a way that supports maintainability. Route optimization is responsible for tradeoff navigation under constraints. Bridge calibration is responsible for objective closure across heterogeneous benchmarks. Acceptance auditing is responsible for confidence-qualified decision thresholds. Separating these responsibilities helps teams diagnose failures and update only the affected module rather than recalibrating an entire stack after every drift event. \subsection{Scope of Generalization Claims} Generalization claims in this manuscript are intentionally layered. At the strongest layer, we report empirical directional improvements in the executed benchmark envelope. At a second layer, we report conditional formal statements under explicit assumptions. At the weakest layer, we document unresolved obligations and data-provenance gaps that prevent global extrapolation. This layered approach is a direct response to the common overgeneralization pattern in agent literature, where benchmark-specific gains are often narrated as architecture-wide superiority. Our cross-benchmark setup supports moderate generalization within the tested family diversity, but it does not justify universal transfer across all domains, all tool ecosystems, or all compute regimes. In particular, unresolved global compute caps and still-limited negative-result coverage constrain external validity. The manuscript therefore frames itself as a reproducible, condition-aware contribution rather than a terminal benchmark victory statement. The same principle applies to formal claims: theorem statements are scientifically useful only when their assumptions remain operationally credible. If assumptions fail, formal language must be narrowed accordingly. This discipline is essential for honest cross-domain communication, especially when readers may interpret theorem syntax as stronger than the evidence actually permits. \subsection{Cross-Domain Relevance Beyond the Core Benchmarks} Although this work is centered on SymbolicAI workflows, the design pattern extends to other modular agent settings. Scientific workflows, software engineering agents, and clinical support pipelines all exhibit the same triad of challenges: objective mismatch, tool unreliability, and uncertainty-sensitive decisions. Recent domain studies reinforce that mean-score gains can coexist with unacceptable error profiles, especially when retrieval and orchestration layers shift faster than evaluation protocols \citep{src_d784c0f0951e,src_bf6ff4aab9fe,src_5b90080b5321,src_840da66ba21c,src_d695132510c4}. In these domains, our contribution should be read as a protocol template rather than a domain-specific numerical benchmark. The key transferable idea is explicit claim decomposition: every major claim should be tied to (i) a formal or algorithmic object, (ii) an executable measurement protocol, (iii) a traceable evidence artifact, and (iv) a caveat condition. This decomposition reduces ambiguity when interdisciplinary teams evaluate whether a model is fit for purpose. \subsection{Reporting Standard Suggested by This Study} Based on this run, we propose a minimal reporting standard for hybrid agent manuscripts: \begin{itemize} \item Define objective, decision variables, feasible set, and optimality criterion before any performance claims. \item Provide at least one explicit bridge mechanism when surrogate and native metrics differ. \item Report uncertainty-qualified acceptance criteria and document failure conditions under violated assumptions. \item Include both positive and negative evidence containers, with seeded records that can be queried and replayed. \item State unresolved symbolic or empirical gaps in the main text, not only in supplementary notes. \end{itemize} This standard is intentionally conservative. It increases manuscript length and effort, but it substantially improves scientific interpretability and claim-evidence closure. \section{Comparison Matrix and Contradiction Map in Narrative Form} \label{sec:comparison_narrative} \subsection{Reasoning-Control Lineages Versus Constrained Routing} The comparison matrix assembled in upstream phases distinguished two broad families: prompt-structured reasoning controllers and optimization-driven routing frameworks. Prompt-structured controllers such as ReAct, Tree-of-Thoughts, Graph-of-Thoughts, and LATS improve exploration quality but often expose unstable compute-quality frontiers under tight budgets \citep{src_0baffed4a75c,src_d2c123e2329d,src_3f9105492cf9,src_a8ab719bfa87}. Optimization-driven formulations such as OR-style tool routing and planner-coupled approaches provide stronger objective language but can be brittle when benchmark-native objectives are heterogeneous or when tool APIs drift \citep{src_4a4f020dcb77,src_fab8a7e15294}. Our constrained router should be interpreted as a bridge between these families rather than a replacement for either. The relaxed policy variables and feasibility constraints import optimization structure, while the benchmark-facing protocol preserves the empirical stress-testing discipline of agent benchmarks. This hybridization matters because previous contradictions in the literature often arise when one family is evaluated by the other family's assumptions. For example, search-heavy methods can look unfavorable under strict CPU caps if the protocol does not normalize branching cost, while constrained optimizers can look overly strong if evaluated only on one objective family without disagreement auditing. In this manuscript, the contradiction is addressed by explicit protocol commitments: matched CPU envelopes, heterogeneous benchmarks, and joint reporting of objective and native outcomes. The resulting evidence does not prove universal superiority of constrained routing, but it does show that conditional gains remain after controlling for several confounders that routinely invalidate narrow benchmark claims. This is a practical scientific contribution because it demonstrates a route to fairer cross-family comparisons in future agent studies. \subsection{Tool Learning, API Reliability, and the Role of Validity Signals} Tool-use literature has repeatedly shown that API-call correctness is a dominant driver of downstream task success, especially when model reasoning appears semantically plausible but syntactically invalid at the interface boundary \citep{src_e2afe39a9d1b,src_11400d800724,src_a572df70e757}. The contradiction map from prior phases highlighted a specific version of this issue: retrieval or tool grounding can improve means while leaving important failure strata untouched. Our bridge model addresses this by making validity and completion explicit covariates rather than post-hoc diagnostics. The practical effect is visible in the gap between distance-only and full bridge variants. If trajectory distance alone captured all relevant quality information, additional operational features should offer limited gains. Instead, the observed gain and calibration improvements indicate that a substantial part of benchmark-native variance is mediated by operational factors that trajectory distance does not encode directly. This finding is consistent with benchmarks that report error concentrations in action execution and environment interaction rather than pure reasoning coherence \citep{src_1bb96f4eca55,src_4483461cffa2,src_01fdaa0fb2ad}. This perspective has implications beyond the present study. Future benchmark design should treat tool-validity observables as first-class evaluation channels, and method papers should avoid conflating semantic trajectory quality with action-level reliability. In manuscript terms, this means ``better trajectory metric'' and ``better benchmark-native utility'' should be treated as related but distinct claims, each requiring direct evidence. \subsection{Formal Guarantees, Symbolic Audits, and Failure Semantics} Formal statements in agent papers are increasingly common, but the contradiction map indicates that theorem language often outruns executable assumption checks in practical pipelines \citep{src_db5eb4358891,src_69018508daf9,src_865b0aa37ef4}. We intentionally avoid that pattern by pairing every theorem-level claim with symbolic obligations and explicit failure semantics. The key methodological point is that symbolic audits are not just verification accessories; they determine which statements remain admissible. In this run, most obligations pass, but two do not. Rather than burying this in supplementary material, we propagate the consequence into main-text claim strength and limitation language. This practice directly improves interpretability: readers can distinguish proved properties, empirically supported tendencies, and open formal obligations. It also prevents a common failure mode where theorem statements are interpreted as global deployment guarantees despite unresolved assumption checks. Failure semantics are equally important. Counterexample detection under A1, A3, and A6 violations gives the study a falsification pathway rather than a success-only narrative. In scientific terms, this expands the paper from a benchmark report to a constrained explanatory model: when assumptions hold, certain guarantees are admissible; when assumptions fail, guarantees must be replaced by fallback analyses. This conditional structure is more demanding but also more honest. \subsection{Domain-Transfer Contradictions and High-Stakes Interpretation} The final contradiction axis concerns cross-domain transfer. Domain papers in healthcare and biomedical settings show a recurring pattern: retrieval or orchestration gains in aggregate metrics can coexist with persistent critical-error risk and uncertainty miscalibration \citep{src_d784c0f0951e,src_bf6ff4aab9fe,src_888b1eb13315,src_5636f52395f6,src_139ff817eb0e}. This pattern warns against translating benchmark gains directly into deployment claims. Our manuscript responds by separating three levels of transfer interpretation. First, we claim benchmark-level directional gains in the executed envelope. Second, we claim conditional formal properties where assumptions and symbolic obligations allow. Third, we explicitly do not claim high-stakes readiness because unresolved obligations, still-limited negative-case coverage, and open budget calibration create real uncertainty. This layered interpretation is intentionally conservative and should be seen as a template for cross-domain reporting. More broadly, the contradiction-map narrative suggests that future work should evaluate transfer by stress-tested claim bundles rather than by single aggregate scores. A transfer claim should specify: which assumptions are imported, which metrics are bridged, which uncertainty mechanism is used, and which failure strata remain open. Without this structure, cross-domain comparisons will continue to overstate certainty and under-report risk. \section{Discussion} \label{sec:discussion} A central outcome of this study is methodological rather than purely numerical: trajectory-quality optimization, metric bridging, and uncertainty auditing must be treated as one system. Optimizing any one component in isolation can create false confidence. For example, stronger routing objective values can coexist with degraded validity under strict budgets; improved bridge gain can coexist with disagreement clusters; and low acceptance false positives can coexist with unresolved symbolic monotonicity checks. This is why we represent conclusions as conditional claim bundles rather than a single global score. The broader implication for LLM-agent research is that mixed-mode evaluation is not optional for compositional systems. Benchmark-native metrics and trajectory surrogates answer different scientific questions. In practice, this means algorithmic papers should include explicit bridge logic when surrogate metrics are central, and formal papers should include executable assumption audits when theorem claims influence deployment decisions. Recent work across planning, tool-use, and domain transfer already suggests this need, but often in separate communities \citep{src_fab8a7e15294,src_a8ab719bfa87,src_e2afe39a9d1b,src_11400d800724,src_d784c0f0951e,src_bf6ff4aab9fe}. For SymbolicAI specifically, our results support a practical reading: constrained adaptive routing is promising under CPU-limited settings, but only if metric closure and uncertainty checks are integrated into the same reporting loop. The framework's modularity is an advantage here because it enables explicit separation between route optimization, bridge calibration, and acceptance auditing, which in turn makes failure provenance easier to localize. \subsection{Decision-Theoretic Interpretation of the Hybrid Objective} The hybrid objective in \eqref{eq:router_objective} can be interpreted as a constrained decision-theoretic contract between accuracy, cost, and epistemic caution. From this perspective, the route policy is not simply optimizing quality; it is allocating limited computational and tool-interaction budget across uncertain trajectories. This matters because many contemporary agent evaluations implicitly assume that more search or more tool calls are always beneficial. Our results and feasibility frontiers suggest the opposite: under strict budgets, additional exploration can quickly become infeasible or even counterproductive unless guided by explicit constraints. A useful way to read the objective is as a policy-level utility decomposition. The trajectory term rewards semantic alignment with reference behavior, the native loss term enforces benchmark-grounded utility, the route-cost term penalizes operational burden, and the uncertainty term discourages brittle overcommitment. If any one term is removed, the policy can optimize an incomplete target. For example, removing uncertainty can produce fragile gains that do not survive drift; removing native loss can overfit trajectory resemblance; removing route cost can hide impractical compute behavior. This decomposition therefore provides an interpretable control surface for practitioners who must tune policies under real deployment constraints. The decision-theoretic reading also clarifies why conditional claims are a feature rather than a weakness. In safety-critical or cost-limited settings, unconditional claims based on average gains are often misleading. Conditional claims tied to assumptions and feasibility can be directly operationalized as decision rules: if assumptions hold and feasibility remains above threshold, deploy policy A; if assumptions fail, switch to fallback or reject deployment. This creates a more auditable chain from research evidence to operations policy. \subsection{Benchmark Governance and Reporting Implications} A second implication concerns benchmark governance. Agent benchmarks are evolving rapidly, but governance norms for cross-metric reporting remain inconsistent. Some studies optimize benchmark-native success without trajectory introspection; others emphasize trajectory coherence without robust native outcome grounding; still others report means without uncertainty-qualified decision criteria. The contradiction map from this project suggests that these choices are not harmless stylistic differences: they materially change what conclusions are scientifically defensible. We therefore argue that benchmark governance should include explicit requirements for metric provenance and claim traceability. Metric provenance means that each reported scalar can be traced to a clear observational pipeline, with known dependence on tool validity, completion logic, and perturbation settings. Claim traceability means that each headline claim links to concrete artifacts (tables, figures, proofs, or caveats) rather than relying on broad narrative interpretation. These requirements reduce ambiguity for reviewers and downstream users, especially in interdisciplinary domains where methodological assumptions are not shared by default. There is also a governance role for negative-result infrastructure. In many benchmark ecosystems, negative examples are underreported or reported only narratively. Our run shows why this is problematic: even when directional gains are strong, limited negative-case coverage can leave unresolved ambiguity about edge-case reliability. Requiring structured negative-result tables and machine-readable failure logs would improve reproducibility and reduce optimism bias in model comparisons. Finally, governance should treat uncertainty reporting as mandatory for deployment-facing claims. Confidence intervals are necessary but not sufficient; acceptance predicates, boundary checks, and counterexample protocols should be explicitly documented when claims affect practical decision thresholds. This recommendation is consistent with emerging high-stakes evaluation guidance and helps align agent benchmarking with broader scientific standards for evidence quality under uncertainty. \section{Limitations and Future Work} \label{sec:limitations} This manuscript has five concrete limitations that bound interpretation. First, two symbolic obligations remain unresolved (one in the routing simplex/subgradient checkpoint and one in acceptance monotonicity), so formal guarantees are incomplete and must stay conditional. Second, claim-evidence payload hygiene in upstream validation artifacts is partially incomplete, requiring manual cross-linking for manuscript traceability. Third, negative-result logs are now populated with structured records, but coverage is still limited to selected threshold-miss and disagreement slices, which leaves failure-case granularity incomplete. Fourth, global compute budget remains open, so we cannot claim budget-invariant superiority outside the tested CPU envelopes. Fifth, optional external benchmark manifests and checksum bundles were not fully materialized in this run, limiting strict provenance portability. These limits have direct impact on conclusions: we can claim directional constrained improvement and bridge/calibration benefits for the executed setup, but we cannot claim universal optimality, universal monotonic uncertainty behavior, or final deployment readiness across all drift regimes. \subsection{Future Work} Three follow-up tracks are necessary for closure. The first is formal: resolve the unresolved router simplex/subgradient and acceptance-monotonicity checks, or narrow theorem statements so each formal claim has full symbolic closure. The second is empirical: populate richer negative-result logs and disagreement taxonomies, especially for heavy-drift and low-budget slices where conditional failures emerge. The third is operational: fix a global compute cap and rerun sensitivity analyses so claims can be reported against a finalized cost envelope. Additional domain-focused transfer studies in medical and scientific workflows should include critical-error auditing from the outset \citep{src_139ff817eb0e,src_5636f52395f6,src_888b1eb13315,src_d695132510c4}. \section{Conclusion} This paper presented a hybrid, condition-aware methodology for SymbolicAI workflow evaluation under CPU-only constraints. The technical contribution is not a claim of universal dominance; it is a reproducible closure pattern linking constrained routing optimization, trajectory-to-native metric bridging, uncertainty-qualified acceptance, and explicit theorem-audit caveats. Empirical evidence supports directional gains in objective quality, calibration, and drift robustness for the executed benchmark envelope. Formal analysis supports key statements under explicit assumptions, while unresolved obligations are surfaced rather than hidden. We therefore conclude that constrained adaptive routing plus bridge-aware evaluation is a defensible and practically useful direction for SymbolicAI-style systems, provided that feasibility, uncertainty, and counterexample boundaries remain first-class reporting objects. \clearpage \phantomsection \label{sec:end_of_main} \bibliographystyle{conference} \bibliography{references} \appendix \clearpage \phantomsection \label{sec:appendix_start} \section{Extended Formal Material} \subsection{Symbol Glossary and Equation Provenance} Table~\ref{tab:symbols_appendix} records symbol meanings used throughout \secref{sec:problem_setting} and \secref{sec:method}. The provenance column separates borrowed conventions from manuscript-defined quantities so that readers can trace where each formal component originates. \begin{table}[h] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \begin{tabular}{p{0.14\linewidth}p{0.27\linewidth}p{0.49\linewidth}} \toprule Symbol & Meaning & Provenance \\ \midrule $G=(V,E)$ & typed workflow graph & adapted from SymbolicAI framework conventions \\ $\mathcal{D}(\mathbb{P}_{\text{gen}},\mathbb{P}_{\text{ref}})$ & trajectory-level distance primitive & adapted from VERTEX-style trajectory evaluation framing \\ $J(\vz,\pi_{\text{tool}})$ & composite constrained objective in \eqref{eq:router_objective} & defined in this manuscript \\ $\mathcal{F}$ & feasible policy set in \eqref{eq:feasible_set} & defined in this manuscript \\ $\Gamma$ & bridge gain in \eqref{eq:gamma_def} & defined in this manuscript \\ $\mathcal{A}_{\delta}$ & acceptance predicate in \eqref{eq:acceptance_pred} & defined in this manuscript \\ $V_T$ & comparator path variation for dynamic regret & standard online convex optimization background \\ \bottomrule \end{tabular} \caption{This glossary consolidates symbol definitions and provenance for the manuscript's formal core. The table is intentionally explicit about manuscript-defined versus adapted quantities to avoid conflating prior conventions with new formal contributions.} \label{tab:symbols_appendix} \end{table} \subsection{Additional Proof Notes} \paragraph{On Theorem~\ref{thm:existence}.} The existence argument requires nonempty feasibility and continuity. In practice, nonemptiness is the operationally fragile condition; if strict $(B_{\text{cpu}},\tau_{\text{valid}})$ settings eliminate all feasible policies, optimization is still computationally executable but the theorem statement no longer applies. This is why feasibility-frontier reporting in \figref{fig:router_panels} is part of claim interpretation rather than supplementary decoration. \paragraph{On Theorem~\ref{thm:regret}.} The dynamic-regret expression in \eqref{eq:dynamic_bound} is useful because it separates static complexity, stochastic optimization noise, and non-stationarity through $V_T$. However, an unresolved symbolic checkpoint in the relaxed simplex/subgradient validation remains open. We therefore treat the bound as conditionally informative and avoid global guarantee language. \paragraph{On Theorem~\ref{thm:acceptance}.} Acceptance safety in \eqref{eq:acceptance_pred} relies on bounded i.i.d. gains. The symbolic counterexample table identifies heavy-tailed regimes where this assumption fails, in which case empirical-Bernstein-style alternatives should replace Hoeffding-style bounds. This is a methodological requirement, not a post-hoc preference. \section{Extended Evidence and Negative Results} \subsection{Symbolic Obligation Audit Summary} Table~\ref{tab:obligation_matrix_appendix} summarizes the obligation outcomes used in this manuscript. \begin{table}[h] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \begin{tabular}{p{0.38\linewidth}cp{0.48\linewidth}} \toprule Symbolic Audit Check & Status & Interpretation \\ \midrule Router simplex check & fail & symbolic checkpoint unresolved \\ Router regret non-negativity & pass & decomposition non-negativity verified \\ Router static boundary case & pass & static-comparator behavior verified \\ Bridge class inclusion & pass & inclusion property verified \\ Bridge gain derivation & pass & algebraic consistency verified \\ Bridge regularization monotonicity & pass & split-consistent check verified \\ Acceptance predicate rearrangement & pass & predicate algebra verified \\ Acceptance monotonicity in $n,\delta$ & fail & monotonicity remains unresolved \\ Acceptance finite-sample boundary & pass & boundary handling verified \\ \bottomrule \end{tabular} \caption{This table provides the symbolic obligation status used to scope theorem language in the main text. The two failed obligations are explicitly surfaced so that formal claims remain conditional and readers can distinguish verified algebraic structure from open proof-audit tasks.} \label{tab:obligation_matrix_appendix} \end{table} \begin{figure}[h] \centering \includegraphics[width=0.65\linewidth]{figures/iter_1/theorem_audit_iter1.pdf} \caption{This appendix figure visualizes confidence-radius shrinkage with increasing sample size in the uncertainty audit workflow. The plot provides an interpretable boundary check for acceptance behavior and complements the symbolic obligation table by showing where finite-sample conservatism changes materially with $n$. The figure supports conditional use of acceptance guarantees rather than unconditional deployment claims.} \label{fig:theorem_audit} \end{figure} \subsection{Failure-Case Tables} The next two tables provide concrete failure cases extracted from threshold-miss routing slices and high-divergence bridge episodes. They are not exhaustive, but they convert prior placeholder artifacts into queryable evidence for falsification follow-up. \begin{table}[h] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \input{tables/iter_1/router_objective_failure_cases.tex} \caption{This table reports routing slices where at least one planned improvement threshold is missed under strict budget-validity settings. These rows bound the constrained-improvement claim and identify where fallback analyses remain necessary.} \label{tab:router_failures_appendix} \end{table} \begin{table}[h] \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \input{tables/iter_1/metric_bridge_negative_cases.tex} \caption{This table reports high-divergence bridge episodes with invalid-tool or incomplete-plan failure tags. The rows identify disagreement strata that remain difficult even when aggregate bridge metrics are favorable.} \label{tab:bridge_negatives_appendix} \end{table} \section{Reproducibility and Implementation Details} \subsection{Seeds, Sweeps, and Statistical Procedure} All core experiments were run with seeds $\{11,23,37,73,101\}$ and budget-validity sweeps over $B_{\text{cpu}}\in\{60,120,240\}$ and $\tau_{\text{valid}}\in\{0.80,0.90,0.95\}$. Bridge analyses used bootstrap replicate sweeps and protocol-stratified transfer checks. Drift analyses used perturbation tiers and confidence-level sweeps to characterize acceptance sensitivity. These settings are sufficient for directional comparisons but not for final scaling law claims. \subsection{Compute and Environment Constraints} The experimental package was executed under CPU-only conditions, with bounded per-run core-hour envelopes. This constraint is scientifically relevant because some planning-heavy baselines can trade quality for substantially higher compute. We therefore report cost-aware comparisons and avoid unrestricted-depth baselines. \subsection{Symbolic Reproducibility Details} Symbolic checks evaluate theorem obligations, boundary conditions, and counterexample detection for violated assumptions A1, A3, and A6. Boundary checks include small-$n$ and high-$\delta$ settings for acceptance auditing. Counterexample detection is treated as a required output, not an optional stress test, because it determines whether theorem language can remain strong or must be caveated. \subsection{Code Artifact Summary} The implementation stack contains dedicated components for experiment orchestration, simulation, plotting, and symbolic audits. This modular structure mirrors the manuscript decomposition: optimization, metric bridging, uncertainty evaluation, and formal-check auditing are implemented as distinct modules so that claim-level failures can be localized. The release package should continue to improve negative-result logging density and claim-evidence metadata closure in future iterations. \end{document}