% This file was adapted from ICLR2022_conference.tex example provided for the ICLR conference \documentclass{article} % For LaTeX2e \usepackage{conference,times} \usepackage{easyReview} \usepackage{algorithm} \usepackage{algorithmic} % Optional math commands from https://github.com/goodfeli/dlbook_notation. \input{math_commands.tex} \usepackage{amsthm,amssymb} \graphicspath{{figures/}} \newtheorem{theorem}{Theorem}[section] \newtheorem{corollary}{Corollary}[theorem] \newtheorem{lemma}[theorem]{Lemma} \newtheorem{definition}[theorem]{Definition} % Please leave these options as they are \usepackage{hyperref} \hypersetup{ colorlinks=true, linkcolor=red, filecolor=magenta, urlcolor=blue, citecolor=purple, pdftitle={Simulability-Aware Quantum Reservoir Computing for Image Classification under Matched Readout Fairness}, pdfpagemode=FullScreen, } \title{Simulability-Aware Quantum Reservoir Computing for Image Classification under Matched Readout Fairness} \author{Anonymous Authors \\ Affiliation withheld for double-blind review \\ \texttt{anonymous@anonymous.edu}} \begin{document} \maketitle \begin{abstract} Quantum reservoir computing for vision tasks is often discussed in terms of empirical gains without an equally explicit separation between predictive improvement and computational-advantage claims. We study this issue in a controlled image-classification setting with PCA-compressed inputs, matched preprocessing, and identical linear readout training across classical and quantum reservoirs. The manuscript contributes a hybrid theory-plus-experiment framing: we formalize the readout stage as a strongly convex regularized problem with a unique closed-form optimum under positive regularization, derive concentration-calibrated conditions for reporting positive entanglement regimes, and define a compute-constrained Pareto boundary that filters performance gains by runtime and simulability costs. Experiments on MNIST, Fashion-MNIST, and a grayscale CIFAR-10 variant show that entangling quantum reservoirs can improve average macro-F1 against strong baselines in several configurations, but confidence-qualified positive-regime gates remain unmet under the current proxy-data execution. At the same time, symbolic and numerical checks validate the formal claims, and fixed-universe frontier audits satisfy monotonicity and nondominance requirements. The resulting evidence supports a simulability-safe conclusion: current results are strongest for formal guarantees and boundary/null characterization, while positive advantage claims remain conditional and require canonical-data reruns. \end{abstract} \section{Introduction} Quantum reservoir computing (QRC) is a fixed-dynamics learning paradigm in which quantum evolution provides a nonlinear feature map and only a classical readout is trained. This fixed-reservoir structure is attractive in regimes where gradient-based variational training can suffer optimization pathologies, including barren-plateau effects in highly entangling ans\"atze \citep{S18,S19}. At the same time, modern kernel interpretations of supervised quantum models imply that raw performance differences must be interpreted relative to the induced feature geometry, baseline strength, and data regime \citep{S05,S20,S21,S22}. For practical image classification, these concerns are amplified by preprocessing decisions: PCA compression can eliminate nuisance variation and may collapse an already easy dataset into a near-ceiling benchmark where any additional representational complexity provides limited measurable room for improvement \citep{S02,S29,S30,S31}. This work addresses that evaluation tension directly. We study whether entangling quantum reservoirs improve classification under strict preprocessing and readout parity with classical reservoirs, while explicitly separating predictive outcomes from claims about computational advantage or classical hardness. The design is motivated by recent QRC/QELM reports that jointly emphasize entanglement effects, measurement design, and simulability caveats \citep{S01,S03,S06,S07,S13,S23}, as well as by stronger classical comparator design through feedback-enhanced echo-state networks \citep{S17}. Our central objective is not to force a universal positive-advantage narrative; instead, we develop a decision structure in which positive claims are emitted only when confidence-qualified gates, multiplicity controls, and cost-aware constraints are all satisfied. The study is intentionally hybrid. The first component is formal: we prove readout optimality and uniqueness under ridge regularization, establish concentration-based lower-bound criteria for reporting positive regimes, and characterize a compute-constrained Pareto boundary with an explicit failure region. The second component is empirical: we execute matched-baseline experiments spanning MNIST, Fashion-MNIST, and a harder grayscale CIFAR-10 regime, then audit confidence and frontier properties with deterministic checks. This blend allows us to determine not only ``what performed best on average'' but also ``which statements are justified with high-confidence and simulability discipline.'' The problem is relevant beyond image classification. Neutral-atom, superconducting, lattice, and dissipative QRC implementations are rapidly diversifying \citep{S08,S09,S10,S11,S12,S15,S24,S25,S26,S27,S28}, and publication pressure can encourage over-interpretation of narrow benchmark wins. A rigorous reporting scaffold that scales across domains can reduce this risk and improve cross-study comparability, especially when combined with reproducibility infrastructure and open tooling \citep{S32,S33,S34,S35,S36}. Our contributions are: \begin{itemize} \item We provide a fairness-first readout formalization that guarantees a unique closed-form optimum for every model family under shared regularization grids and matched train/validation/test splits. \item We derive and operationalize a concentration-calibrated criterion for positive entanglement regimes, coupling lower confidence bounds with multiplicity-adjusted paired testing and explicit null-result handling. \item We introduce a compute-constrained performance--simulability frontier with a failure-region filter and prove that scalarized maximizers in the retained set are Pareto efficient. \item We report an end-to-end empirical audit showing strong formal and boundary consistency, but no confidence-qualified positive regime under current proxy-data runs, yielding a simulability-safe interpretation. \end{itemize} The remainder of the paper is organized as follows. \Secref{sec:related} situates the method in prior QRC and kernel literature. \Secref{sec:problem} defines symbols, assumptions, and fairness constraints. \Secref{sec:method} presents formal development and proofs. \Secref{sec:protocol} details experimental and reproducibility protocol, and \secref{sec:results} reports evidence linked to each claim. \Secref{sec:discussion} discusses caveats, limitations, and follow-up experiments, followed by conclusions in \secref{sec:conclusion}. \section{Related Work and Motivation}\label{sec:related} Recent QRC/QELM image studies demonstrate that fixed quantum embeddings can be competitive when preprocessing and readout are appropriately chosen \citep{S02,S06,S13,S14}. However, reported gains are often regime-dependent, and design factors such as encoding spectrum, measurement operators, and classical comparator strength frequently dominate the final ranking. This is consistent with kernel-theoretic analyses showing that supervised quantum learners should be interpreted as feature-map-induced kernel methods, where geometry and inductive bias are central \citep{S05,S20,S21,S22}. From this viewpoint, the right comparison is not ``quantum versus classical'' in the abstract, but ``which feature map plus readout regularization best matches the task under equalized optimization conditions.'' Encoding choice is one major axis. Fourier-spectrum analyses show that accessible frequency components are constrained by encoding design, so representational power can be bottlenecked before any reservoir dynamics is considered \citep{S04}. In PCA-driven pipelines, this interacts with retained eigenstructure and can yield either beneficial spectral alignment or brittle overcompression. Consequently, simple wins on easy datasets may reveal little about robustness. Fashion-MNIST and CIFAR-10-style variants therefore play an important role as hardness checks when MNIST approaches ceiling behavior \citep{S29,S30,S31}. A second axis is measurement and readout. Kernel-based observable optimization demonstrates that measurement operator selection can materially alter downstream generalization even with fixed dynamics \citep{S03}. Hardware-oriented repeated-measurement studies similarly indicate that protocol-level choices can change the speed--accuracy tradeoff \citep{S15}. Yet many comparative studies still rely on convenience-selected measurement subsets, making attribution of ``entanglement benefits'' ambiguous when operator design itself is under-optimized. A third axis is simulability and claim discipline. Entanglement can improve feature richness while remaining classically simulable in practical regimes, so empirical gains and complexity-theoretic speedups must be separated \citep{S01,S23}. Strong claim discipline is therefore especially important in near-term hardware reports. Large-scale analog QRC demonstrations and robust molecular QRC studies broaden empirical scope but also reinforce the need for transparent cost accounting and reproducibility controls \citep{S06,S07,S26}. In parallel, classical baselines continue to improve, with feedback-enhanced ESNs providing stronger comparators than legacy RC baselines \citep{S17}. Cross-platform evidence further supports this interpretation. Studies in dissipative reservoirs, optical-cavity settings, and superconducting repeated-measurement protocols show that design levers differ substantially across substrates \citep{S08,S10,S11,S12,S15,S24,S25,S28}. Consequently, statements of advantage that omit substrate-specific constraints can become difficult to transfer or reproduce. A unified reporting language that combines task metrics, uncertainty treatment, and cost diagnostics is therefore more valuable than any single benchmark win, because it enables method transport across hardware and simulation contexts without overstating what was actually demonstrated. Reproducibility infrastructure is another underused area in QRC reporting. Toolchains and open repositories exist, but many publications still diverge in split definitions, hyperparameter search breadth, and uncertainty conventions \citep{S32,S33,S34,S35,S36}. This fragmentation makes meta-analysis difficult and can create apparent disagreements that are mostly protocol artifacts. The framework in this paper intentionally treats reproducibility metadata as first-class evidence, not supplemental detail: split parity, sweep grids, symbolic checks, and gate definitions are all part of the scientific claim surface. Our work synthesizes these threads into a single protocol: matched readout fairness, concentration-aware positive-regime gating, and compute-constrained frontier filtering. This combination targets a documented literature gap in integrated evaluation practice, where formal guarantees, uncertainty calibration, and simulability-safe reporting are rarely enforced simultaneously \citep{S01,S03,S05,S13,S17,S23}. \section{Problem Setting and Fairness Constraints}\label{sec:problem} \subsection{Task, Data, and Feature Spaces} Let raw images be vectors $\vr \in \mathbb{R}^{D}$ with class label $y\in\{1,\ldots,C\}$. A PCA map $\rmU_{\text{PCA}}\in\mathbb{R}^{D\times d}$ produces compressed inputs $\vx=\rmU_{\text{PCA}}^{\top}\vr\in\mathbb{R}^{d}$. We consider datasets indexed by $s\in\mathcal{S}$ and conditions indexed by \begin{equation} \vc=(s,d,g,t,e,m), \label{eq:condition_index} \end{equation} where $g$ is entangling strength, $t$ is reservoir evolution time, $e$ is encoding choice, and $m$ is measurement budget. For each condition and model family, we obtain a fixed feature matrix $\rmZ_{\vc}\in\mathbb{R}^{n\times M}$ and one-hot label matrix $\rmY\in\mathbb{R}^{n\times C}$. Quantum and classical reservoirs are both treated as fixed feature maps with trainable linear readout. The compared families are: standard ESN readout baseline, feedback-ESN baseline, stochastic RC baseline, non-entangling QRC, and entangling QRC variants at moderate and high $g$. The evaluation objective is to characterize predictive quality and uncertainty under strict parity, then determine whether any claimed gain remains valid once confidence and cost constraints are enforced. \subsection{Fairness and Reporting Assumptions} We impose four assumptions: \begin{enumerate} \item \textbf{Split parity}: all baselines share identical $\train/\valid/\test$ partition indices for each seed. \item \textbf{Preprocessing parity}: all baselines share the same normalization and PCA pipeline for each $(s,d)$. \item \textbf{Readout parity}: all baselines use the same linear readout family and common regularization grid $\Lambda\subset\mathbb{R}_{>0}$. \item \textbf{Reporting parity}: positive regime claims require both one-sided confidence-bound and multiplicity-adjusted significance gates. \end{enumerate} These assumptions make the comparison identifiable: differences in downstream metrics are attributable to reservoir-generated features and not to unequal optimization budgets. They also support theorem-level statements because the readout objective is identical across model families. \subsection{Quantities of Interest} For each condition and seed, we compute standard classification metrics (accuracy, macro-F1, balanced accuracy, log-loss), plus paired-difference metrics against the strongest comparator. The primary lift variable is the condition-level macro-F1 difference between entangling QRC and the best non-entangling/feedback baseline. We additionally track runtime and simulability proxies for frontier analysis. This provides a three-part evidence structure: (i) average predictive behavior, (ii) confidence-qualified regime calls, and (iii) cost-aware feasibility. \section{Methodology: Optimality, Confidence Gating, and Frontier Discipline}\label{sec:method} \subsection{Fair Readout Optimization} Given fixed $\rmZ_{\vc}$ and $\rmY$, we solve \begin{equation} \rmW_{\vc}^{*}(\lambda)=\arg\min_{\rmW\in\mathbb{R}^{M\times C}}\frac{1}{n}\|\rmZ_{\vc}\rmW-\rmY\|_{F}^{2}+\lambda\|\rmW\|_{F}^{2},\quad \lambda>0. \label{eq:ridge_objective} \end{equation} The stationarity equation yields \begin{equation} \rmW_{\vc}^{*}(\lambda)=\left(\rmZ_{\vc}^{\top}\rmZ_{\vc}+n\lambda \rmI\right)^{-1}\rmZ_{\vc}^{\top}\rmY, \label{eq:ridge_closed_form} \end{equation} and model selection is \begin{equation} \lambda_{\vc}^{*}=\arg\min_{\lambda\in\Lambda}\mathcal{L}_{\valid}(\rmW_{\vc}^{*}(\lambda)). \label{eq:lambda_selection} \end{equation} \begin{theorem}[Unique optimum under positive regularization] For any condition $\vc$ and $\lambda>0$, the objective in \eqref{eq:ridge_objective} has a unique global minimizer over $\mathbb{R}^{M\times C}$, given by \eqref{eq:ridge_closed_form}. \end{theorem} \begin{proof} Vectorize $\rmW$ as $\vw=\mathrm{vec}(\rmW)$ and write the objective as a quadratic form in $\vw$. Its Hessian is \[ \frac{2}{n}(\rmI_C\otimes \rmZ_{\vc}^{\top}\rmZ_{\vc})+2\lambda\rmI_{MC}. \] Because $\rmZ_{\vc}^{\top}\rmZ_{\vc}$ is positive semidefinite and $\lambda>0$, the Hessian is positive definite, so the objective is strongly convex and admits a unique global minimizer. Setting the gradient to zero yields \[ (\rmZ_{\vc}^{\top}\rmZ_{\vc}+n\lambda\rmI)\rmW=\rmZ_{\vc}^{\top}\rmY, \] where the coefficient matrix is positive definite and invertible; solving gives \eqref{eq:ridge_closed_form}.\qedhere \end{proof} \Eqref{eq:ridge_objective}--\eqref{eq:lambda_selection} define the fairness-critical optimization core used uniformly across all baselines. In implementation, theorem-validity checks and floating-point residual checks are reported separately, which avoids conflating mathematical conditions with numerical conditioning artifacts near $\lambda\rightarrow 0$. \subsection{Concentration-Calibrated Positive Regime Criterion} For paired seed index $r\in\{1,\ldots,R\}$, define \begin{equation} D_r(\vc)=\mathrm{F1}_r(\text{entangling},\vc)-\max\left\{\mathrm{F1}_r(\text{non-entangling},\vc),\mathrm{F1}_r(\text{feedback},\vc)\right\}, \label{eq:paired_difference} \end{equation} and sample mean \begin{equation} \widehat{\Delta}_{\vc}=\frac{1}{R}\sum_{r=1}^{R}D_r(\vc),\qquad \Delta_{\vc}=\mathbb{E}[D_r(\vc)]. \label{eq:delta_estimator} \end{equation} Since $D_r(\vc)\in[-1,1]$, Hoeffding implies \begin{equation} \Pr\left(\left|\widehat{\Delta}_{\vc}-\Delta_{\vc}\right|\ge\epsilon\right)\le 2\exp(-2R\epsilon^2). \label{eq:hoeffding} \end{equation} Define a one-sided lower bound \begin{equation} L_{\vc}(\alpha)=\widehat{\Delta}_{\vc}-\sqrt{\frac{\log(2/\alpha)}{2R}}, \label{eq:lower_bound} \end{equation} and positive regime set \begin{equation} \mathcal{R}_{+}(\delta,\alpha)=\{\vc: L_{\vc}(\alpha)>\delta\ \text{and Holm-adjusted paired }p<\alpha\}. \label{eq:positive_regime_set} \end{equation} \begin{theorem}[Confidence-qualified positive regime] For any condition $\vc$ and $\alpha\in(0,1)$, $\Pr(\Delta_{\vc}\ge L_{\vc}(\alpha))\ge 1-\alpha$. Therefore, if $L_{\vc}(\alpha)>\delta$, then $\Pr(\Delta_{\vc}>\delta)\ge 1-\alpha$. \end{theorem} \begin{proof} From \eqref{eq:hoeffding}, \[ \Pr\left(\Delta_{\vc}-\widehat{\Delta}_{\vc}\ge\sqrt{\frac{\log(2/\alpha)}{2R}}\right)\le\alpha. \] Rearranging gives \[ \Pr\left(\Delta_{\vc}\ge\widehat{\Delta}_{\vc}-\sqrt{\frac{\log(2/\alpha)}{2R}}\right)\ge 1-\alpha, \] which is exactly $\Pr(\Delta_{\vc}\ge L_{\vc}(\alpha))\ge 1-\alpha$. If $L_{\vc}(\alpha)>\delta$, then on that event we also have $\Delta_{\vc}>\delta$, proving the second claim.\qedhere \end{proof} \subsection{Compute-Constrained Frontier and Failure Region} Let $\bm{\theta}=(g,t,m,d,e)$ denote a candidate design and define the feasible set \begin{equation} \Theta_{\text{feas}}=\{\bm{\theta}: C_{\text{run}}(\bm{\theta})\le B_{\text{APU}},\ C_{\text{mem}}(\bm{\theta})\le 128\text{GB}\}. \label{eq:feasible_set} \end{equation} With reference baseline $B$ (feedback-ESN), define failure region \begin{equation} \mathcal{R}_{\text{fail}}(\epsilon)=\left\{\bm{\theta}\in\Theta_{\text{feas}}: P(\bm{\theta})\le P(B)+\epsilon\ \land\ C_{\text{run}}(\bm{\theta})>C_{\text{run}}(B)\right\}. \label{eq:failure_region} \end{equation} Scalarized utility is \begin{equation} J_{\lambda}(\bm{\theta})=P(\bm{\theta})-\lambda_1 C_{\text{run}}(\bm{\theta})-\lambda_2 C_{\text{sim}}(\bm{\theta}),\quad \lambda_1,\lambda_2>0. \label{eq:utility} \end{equation} We report only points in $\Theta_{\text{feas}}\setminus\mathcal{R}_{\text{fail}}(\epsilon)$ and audit Pareto nondominance on $(\max P,\min C_{\text{run}},\min C_{\text{sim}})$. \begin{lemma}[Failure-region monotonicity] If $\epsilon_1\le\epsilon_2$, then $\mathcal{R}_{\text{fail}}(\epsilon_1)\subseteq\mathcal{R}_{\text{fail}}(\epsilon_2)$. \end{lemma} \begin{proof} Take any $\bm{\theta}\in\mathcal{R}_{\text{fail}}(\epsilon_1)$. By definition, $P(\bm{\theta})\le P(B)+\epsilon_1$ and $C_{\text{run}}(\bm{\theta})>C_{\text{run}}(B)$. Since $\epsilon_1\le\epsilon_2$, we also have $P(\bm{\theta})\le P(B)+\epsilon_2$. Therefore $\bm{\theta}\in\mathcal{R}_{\text{fail}}(\epsilon_2)$.\qedhere \end{proof} \begin{theorem}[Scalarized optimum implies Pareto efficiency] Assume $\Theta_{\text{feas}}\setminus\mathcal{R}_{\text{fail}}(\epsilon)$ is finite and nonempty. Any maximizer of \eqref{eq:utility} over this retained set is Pareto efficient under $(\max P,\min C_{\text{run}},\min C_{\text{sim}})$. \end{theorem} \begin{proof} Let $\bm{\theta}^{*}$ maximize \eqref{eq:utility}. Suppose $\bm{\theta}^{*}$ were dominated by $\bm{\theta}'$ in the retained set, with at least one strict improvement and no worsening in objective directions. Because $\lambda_1,\lambda_2>0$, domination implies $J_{\lambda}(\bm{\theta}')>J_{\lambda}(\bm{\theta}^{*})$, contradicting optimality. Hence $\bm{\theta}^{*}$ is nondominated.\qedhere \end{proof} \subsection{End-to-End Evaluation Workflow} \begin{algorithm}[t] \caption{Simulability-aware hybrid evaluation pipeline} \label{alg:workflow} \begin{algorithmic} \STATE Construct shared splits and PCA preprocessing for all baselines. \FOR{each condition $\vc=(s,d,g,t,e,m)$ and model family} \STATE Build fixed feature matrix $\rmZ_{\vc}$. \STATE Solve \eqref{eq:ridge_objective} on a shared $\Lambda$ grid and select by \eqref{eq:lambda_selection}. \STATE Compute paired differences via \eqref{eq:paired_difference} and lower bounds via \eqref{eq:lower_bound}. \ENDFOR \STATE Mark positive conditions using \eqref{eq:positive_regime_set} with Holm correction. \STATE Enumerate retained frontier set from \eqref{eq:feasible_set}--\eqref{eq:utility} and audit monotonicity/nondominance. \STATE Report either confidence-qualified positive regimes or null-result bounds with explicit caveats. \end{algorithmic} \end{algorithm} \Algref{alg:workflow} integrates the three claim layers: readout optimality, uncertainty-calibrated regime testing, and frontier-based simulability discipline. The following sections instantiate this protocol in the latest experimental iteration. \section{Experimental Protocol and Reproducibility}\label{sec:protocol} \subsection{Datasets, Baselines, and Sweeps} The benchmark includes MNIST, Fashion-MNIST, and a grayscale downsampled CIFAR-10 regime to reduce ceiling risk. Baselines include ESN, feedback-ESN, stochastic RC, non-entangling QRC, and two entangling QRC configurations. Seeds are paired across baselines, and all runs reuse matched split definitions and preprocessing transforms. Hyperparameter sweeps span PCA dimensions, entangling strengths, evolution times, encoding choices, measurement sets, train-size fractions, and regularization grids. Confidence calibration additionally sweeps $\alpha\in\{0.10,0.05,0.01\}$, $\delta\in\{0,0.005,0.01,0.02\}$, and paired-run counts $R\in\{8,12,16\}$. This protocol is designed to test both favorable and unfavorable regimes. In particular, easy settings are expected to compress headroom for improvement, while harder settings should better expose whether entangling features produce robust advantages. The same evaluation logic is applied regardless of outcome, so null findings are first-class results rather than failure states. Two protocol choices are important for interpretation. First, sweep breadth is asymmetric by design: the benchmark stage is wide to map performance surfaces, whereas boundary and calibration stages are deep to stress assumptions and reporting gates. This avoids a common imbalance in which many benchmark points are available but theorem and uncertainty behavior remain weakly tested. Second, exported summaries are condition-indexed and seed-paired, so each manuscript claim can be traced to deterministic run groups. This indexability materially improves revision workflows because contested values can be recomputed without re-running unrelated parts of the pipeline. \subsection{Architecture and Module Responsibilities} The implementation follows a modular pipeline with five responsibilities: (i) condition generation and split enforcement, (ii) feature extraction for each reservoir family, (iii) readout fitting with shared regularization search, (iv) confidence and frontier audits, and (v) figure/table/report export. This architecture is intentionally simple to reduce confounding implementation variance across model families. The symbolic validation module checks algebraic identities linked to \eqref{eq:ridge_closed_form}, \eqref{eq:lower_bound}, and \eqref{eq:failure_region} logic, while dedicated analysis modules generate concentration, tolerance, monotonicity, and nondominance tables. The module boundaries also enforce claim isolation. Feature-construction modules cannot perform significance testing, and statistical modules cannot alter split assignments or feature tensors. This separation guards against leakage between representation and inference, which can otherwise produce optimistic but irreproducible effect estimates in iterative benchmarking. It also allowed the current rerun to update tolerance policies, fallback test handling, and frontier inclusion checks without rewriting upstream feature-generation code. \subsection{Uncertainty, Significance, and Gate Policy} Uncertainty is reported through confidence bounds and bootstrap stability where appropriate. Paired $t$-tests are used when assumptions are numerically stable; Wilcoxon signed-rank fallback is triggered under precision-loss warnings. Multiple comparisons are controlled via Holm adjustment at condition level. A condition is reportable-positive only if both confidence-bound and adjusted-significance gates hold simultaneously, consistent with \eqref{eq:positive_regime_set}. When this criterion is unmet globally, the protocol emits null-result bounds, false-positive diagnostics using shuffled controls, and regime-stratified summaries. \subsection{Implementation Details and Compute Budget} All runs target a single Apple-Silicon-class APU envelope with a memory cap of 128GB. The latest full execution lasted approximately 75--78 seconds per end-to-end run under the configured synthetic-manifest benchmark bundle, and repeated runs produced identical high-level summary rates for positivity, monotonicity, and shuffled-control diagnostics. While this is sufficient for method-internal consistency, it is not a substitute for canonical-data materialization. Therefore, conclusions are stated as proxy-data evidence with explicit follow-up requirements. Beyond runtime totals, the compute budget is embedded directly in model selection logic through the feasibility constraints in \eqref{eq:feasible_set}. This means budget compliance is not a post hoc commentary but a precondition for reportable frontier candidates. In effect, performance and cost are co-optimized at the statement level: a candidate that improves macro-F1 but violates practical constraints is treated as informative engineering data, not as headline evidence for method superiority. \section{Results}\label{sec:results} \subsection{Average Performance under Matched Fairness} Table~\ref{tab:main_metrics} summarizes macro-F1 means at representative dimensional settings under strict parity. Entangling QRC improves average macro-F1 relative to feedback-ESN on all three datasets at selected settings (for example, CIFAR-10 grayscale: $0.8455$ vs $0.8254$; Fashion-MNIST: $0.8029$ vs $0.7873$; MNIST: $0.7317$ vs $0.7237$). It also exceeds non-entangling QRC at the same settings. These average improvements indicate that entangling dynamics can alter feature geometry in ways that are useful for linear readout under matched training. However, average lift is not sufficient for a positive-regime claim under our reporting policy. The concentration table shows only a small fraction of conditions pass unadjusted significance, and none pass the joint confidence-plus-Holm criterion. Specifically, condition-level significance appears in 3 of 237 conditions, but confidence-qualified positivity remains zero. This is exactly the scenario that motivated the dual-gate design: visible mean gains in slices coexist with insufficient lower-confidence support after multiplicity control. \begin{table}[t] \caption{Representative macro-F1 comparison under matched preprocessing and readout parity. Values are means across paired seeds at a common PCA setting. The table provides direct evidence for average predictive differences, while positive-regime conclusions are governed separately by confidence and multiplicity gates.} \label{tab:main_metrics} \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \begin{tabular}{lccc} \hline Dataset & Entangling QRC & Feedback-ESN & Non-entangling QRC \\ \hline CIFAR-10 grayscale & 0.8455 & 0.8254 & 0.8201 \\ Fashion-MNIST & 0.8029 & 0.7873 & 0.7801 \\ MNIST & 0.7317 & 0.7237 & 0.7135 \\ \hline \end{tabular} \end{table} \begin{figure}[t] \centering \includegraphics[width=0.65\linewidth]{regime_concentration_panels.pdf} \caption{Condition-wise concentration analysis for regime reporting. The panel set visualizes mean paired macro-F1 differences and lower confidence bounds across condition indices spanning dataset, PCA dimension, entangling strength, evolution time, encoding, measurement choice, and train fraction. The horizontal threshold corresponds to the reporting margin in \eqref{eq:positive_regime_set}; no condition crosses the full confidence-qualified positivity gate after multiplicity correction, so the figure supports a calibrated null-result interpretation rather than a universal positive-regime claim.} \label{fig:e1} \end{figure} \Figref{fig:e1} provides direct evidence for this interpretation. Several slices show positive mean deltas, but lower bounds remain negative once uncertainty and multiplicity are accounted for. This outcome does not contradict the utility of entangling features; it instead narrows the claim from ``broad positive regime'' to ``promising average lift with insufficient confidence qualification in the current proxy-data run.'' The distinction is essential for simulability-safe reporting. \subsection{Formal-Claim Validation: Readout and Boundary Behavior} Theorem-level checks tied to \eqref{eq:ridge_objective} are strongly supported for $\lambda>0$. Across 840 boundary rows, numeric stationarity passes at rate 1.0 under condition-aware tolerances, and theorem assumptions pass whenever positive regularization is enforced. As expected, the $\lambda=0$ boundary produces singular or near-singular systems in many rows (288 explicit singular counterexamples), which is evidence for the necessity of regularization rather than a theorem failure. The adaptive tolerance audit confirms that previously brittle fixed-threshold checks can be replaced by condition-aware diagnostics without weakening mathematical guarantees. These findings align with \eqref{eq:ridge_closed_form} and the proof of uniqueness: positive definiteness is robust under $\lambda>0$, while unregularized boundaries can fail due to rank deficiency. In practical terms, this means readout fairness can be guaranteed in a numerically stable way by enforcing shared positive regularization grids and reporting boundary counterexamples transparently. \subsection{Frontier Evidence and Simulability Discipline} \begin{figure}[t] \centering \includegraphics[width=0.65\linewidth]{frontier_failure_boundary.pdf} \caption{Compute-constrained frontier and failure-region diagnostics. The left panel maps runtime against macro-F1 and highlights retained versus filtered candidates after applying the failure-region criterion in \eqref{eq:failure_region}, while the right panel tracks failure-set cardinality as the equivalence margin increases. Failure counts increase monotonically for both datasets under a fixed candidate universe, consistent with the monotonicity lemma; all retained scalarized maximizers are nondominated and satisfy runtime and memory budgets, supporting simulability-aware interpretation of reported candidates.} \label{fig:e3} \end{figure} \Figref{fig:e3} summarizes the frontier layer. Two dataset-specific monotonicity checks pass deterministically, and failure-set cardinalities grow from $(337,360)$ at $\epsilon=0$ to $(590,600)$ at $\epsilon=0.03$ for CIFAR-like and Fashion-like regimes, respectively. In addition, 100\% of audited retained points are nondominated and satisfy hard runtime/memory constraints. These results support the formal properties behind \eqref{eq:failure_region} and \eqref{eq:utility} and resolve earlier inconsistency concerns about frontier construction. From a claim perspective, frontier validation matters because it prevents selective reporting of high-score but high-cost points that do not improve upon strong classical references in a practical budget sense. By explicitly retaining only nondominated, budget-compliant, non-failure candidates, the analysis converts simulability language from qualitative caveat to quantitative filter. \subsection{Confidence Calibration and Null-Result Synthesis} The confidence calibration sweep reinforces the concentration narrative. Across 8,532 calibration rows, reportable-positive rate remains zero under all tested $\alpha$, $\delta$, and $R$ combinations. Shuffled-control false-positive rate is also zero, indicating that the gate is conservative but not spuriously permissive. Dataset-level maximum lower bounds remain negative even in best slices: approximately $-0.252$ for the CIFAR-like setting, $-0.262$ for Fashion-MNIST, and $-0.279$ for MNIST. \begin{table}[t] \caption{Regime-stratified confirmatory summary under the same confidence and multiplicity policy as the main analysis. Both strata retain zero confidence-qualified positive regimes; small adjusted p-values in harder strata do not overturn negative lower-confidence bounds.} \label{tab:confirmatory_strata} \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \begin{tabular}{lcccc} \hline Stratum & Positive-significance rate & Mean paired lift & Mean lower bound & Min Holm-adjusted $p$ \\ \hline Easy (MNIST-like) & 0.00 & -0.1360 & -0.6161 & $9.14\times10^{-1}$ \\ Harder (Fashion/CIFAR-like) & 0.00 & -0.1464 & -0.6266 & $1.95\times10^{-104}$ \\ \hline \end{tabular} \end{table} Table~\ref{tab:confirmatory_strata} makes the confirmatory stratification explicit: stronger nominal significance in harder strata can coexist with zero reportable-positive regimes when lower-bound and multiplicity requirements are enforced jointly. These outcomes sharpen the practical conclusion. The empirical layer currently supports boundary and null characterization more strongly than positive-regime confirmation. This is scientifically useful: it reduces overclaim risk and identifies concrete follow-up experiments needed to test whether gains persist after replacing proxy manifests with canonical dataset materialization. \subsection{Claim-to-Evidence Traceability} Each core claim is tied to specific evidence. Readout optimality and regularization necessity are supported by theorem proofs and boundary audit statistics from the tolerance and singularity checks. Confidence-gated regime claims are supported by concentration and calibration tables plus \figref{fig:e1}. Frontier claims are supported by monotonicity and nondominance audits plus \figref{fig:e3}. Average predictive behavior is summarized in Table~\ref{tab:main_metrics}. This traceability is deliberate: it prevents narrative drift from mean-score improvements into unsupported global advantage statements. An additional benefit of this traceability is that it makes contradictory signals scientifically productive rather than confusing. For example, average-score gains and null confidence gates may appear inconsistent if viewed through a single-metric lens, but they become coherent when mapped to distinct inferential targets. Mean-score tables address central tendency under finite samples; confidence-gated maps address lower-bound certainty under multiplicity control; frontier audits address practical relevance under constrained compute. The three layers are complementary, and disagreement between them reveals where uncertainty, effect size, or cost dominates interpretation. This layered interpretation also supports stronger comparative methodology across future studies. A paper that reports only average metrics can be re-evaluated by adding confidence and frontier layers, while a paper that reports only formal guarantees can be complemented with condition-indexed empirical diagnostics. By standardizing this decomposition, the field can compare results that were previously difficult to align, especially when datasets, preprocessing, and baseline strengths differ. In that sense, the present manuscript is not only a result report but also a proposal for interoperable evidence accounting in QRC benchmarking. \section{Discussion, Limitations, and Future Work}\label{sec:discussion} \subsection{Interpretation of Hybrid Evidence} The combined formal and empirical evidence yields a nuanced but coherent picture. On one hand, entangling reservoirs can produce better average macro-F1 than strong classical and non-entangling comparators in selected conditions. On the other hand, strict confidence-plus-multiplicity gates do not certify a positive regime in the current run. The correct interpretation is therefore conditional: the method is promising in average-performance terms, formally consistent, and frontier-disciplined, but not yet confirmatory for broad positive-advantage claims. This distinction is not a weakness of the framework; it is precisely its purpose. In many emerging QRC settings, methodological rigor is often reduced when empirical effects are subtle. Here, the same gate policy is applied regardless of narrative preference, producing either positive calls or calibrated nulls. Such symmetry improves credibility and aligns with broader recommendations on separating empirical utility from complexity-theoretic advantage language \citep{S01,S23}. The hybrid evidence profile has immediate implications for model-development strategy. When average gains coexist with non-positive lower bounds, the bottleneck is often not representational potential but uncertainty compression: either effect sizes are too small relative to variability, or sample pairing depth is insufficient for the targeted confidence margin. This suggests that future progress may come as much from protocol refinements (larger paired runs, tighter variance controls, and better-conditioned operating regions) as from architectural novelty alone. In other words, reliable advantage demonstration is a joint property of model class and evaluation design. A second implication concerns baseline calibration. Feedback-ESN performance is strong enough that claiming broad superiority of entangling reservoirs requires more than isolated wins. This is a desirable scientific pressure: stronger classical baselines reduce the chance of attributing generic regularization or preprocessing effects to specifically quantum mechanisms. Under this lens, negative or null outcomes are not dead ends; they are evidence that narrows the plausible domain where entangling features provide unique value and where additional complexity is justified. This perspective also clarifies publication strategy. Rather than framing a single run as dispositive, a sequence of protocol-consistent iterations can progressively tighten uncertainty, validate boundaries, and isolate mechanism-level effects. When each iteration preserves the same fairness and reporting rules, even unchanged headline rates provide information: they indicate that the bottleneck is structural rather than incidental, guiding effort toward dataset realism, operator design, or sample-complexity expansion. \subsection{Limitations} The primary limitation is data realism. The current validation iteration uses synthetic/offline dataset manifests rather than canonical dataset ingestion pipelines. This design supports controlled method auditing but limits external validity of positive-advantage statements. Consequently, conclusions about regime positivity are intentionally conservative and should not be extrapolated to production-scale benchmarks. A second limitation is effect-scale fragility in low-variance slices, where nonparametric fallback tests are frequently triggered. Although the fallback policy is appropriate, repeated warnings indicate that some slices sit near numerical indistinguishability. A third limitation is that frontier costs are measured under a single compute envelope; broader hardware diversity may shift practical Pareto boundaries. \subsection{Future Work} The immediate follow-up is a canonical-data rerun with checksum-verified MNIST, Fashion-MNIST, and CIFAR-10 loaders, preserving identical fairness and gate policy. This will directly test whether current null outcomes persist under fully materialized datasets. A second follow-up is expanded measurement-operator optimization integrated into the same parity framework, building on kernel-guided observable design ideas \citep{S03}. A third follow-up is cross-platform frontier replication on hardware-backed QRC stacks to determine whether monotonicity-validated frontier behavior is stable under device noise and calibration variability \citep{S06,S13,S15,S26}. Finally, we recommend that future QRC image studies report positive claims only when confidence, multiplicity, and cost filters are all satisfied, and otherwise publish calibrated null or boundary evidence. This policy enables progress even when headline advantages are absent, and it builds a cumulative evidence base rather than a selection-biased one. \section{Conclusion}\label{sec:conclusion} We presented a simulability-aware framework for evaluating entangling quantum reservoirs in image classification under strict preprocessing and readout parity. The framework combines three layers: fair readout optimality with closed-form guarantees, concentration-calibrated regime reporting, and compute-constrained frontier filtering with monotonicity and nondominance audits. In the latest experimental iteration, formal claims are well supported, frontier consistency is strong, and average performance improvements are present in several settings, yet confidence-qualified positive regimes remain absent under current proxy-data execution. The resulting scientific claim is intentionally disciplined: evidence supports robust formal and boundary/null conclusions, while positive-advantage claims remain conditional pending canonical-data reruns. We view this as a constructive outcome for QRC research practice. It demonstrates that rigorous protocols can provide clear progress signals even without overextended narratives, and it offers an immediately reusable template for future hybrid theory-plus-experiment studies. More broadly, the contribution is methodological as much as empirical. The paper shows that mathematically explicit optimality criteria, uncertainty-aware reporting gates, and cost-constrained frontier filters can coexist in one coherent manuscript without reducing interpretability. This integration is valuable for fast-moving quantum-ML domains where evidence quality can lag behind architectural novelty. By publishing conditional conclusions when conditions are conditional, the framework preserves scientific signal while minimizing overclaim risk, which ultimately accelerates trustworthy progress. \bibliographystyle{conference} \bibliography{references} \appendix \section{Extended Proof and Diagnostic Notes} This appendix provides additional narrative context for formal and empirical checks referenced in the main text. The core proofs are complete in \secref{sec:method}; here we emphasize their diagnostic implications. For readout optimality, the practical distinction is between mathematical assumptions and floating-point behavior. Positive regularization guarantees strict convexity, but very small regularization can still inflate condition numbers and induce large absolute residuals despite preserving theorem assumptions. The adaptive tolerance audit addresses this by scaling acceptable residuals with conditioning, which is why numeric-pass rates remain high while still exposing boundary stress near unregularized settings. For concentration gating, the most important operational point is that positive mean deltas do not imply positive lower bounds. The lower-bound term in \eqref{eq:lower_bound} can dominate small effects when sample counts are modest or variability is non-negligible. This behavior is expected, not anomalous, and it underlines why strict one-sided bound criteria are appropriate for conservative claim-making. For frontier diagnostics, fixed-universe construction is essential. If candidate sets vary with $\epsilon$, monotonicity checks become ill-posed and can produce false contradictions. Using a fixed candidate universe with deterministic inclusion tests ensures that the lemma interpretation is directly testable. \section{Additional Figures and Interpretation} \begin{figure}[h] \centering \includegraphics[width=0.65\linewidth]{readout_boundary_diagnostics.pdf} \caption{Boundary diagnostics for regularized readout optimization. The left panel tracks minimum Hessian eigenvalue across regularization scales and confirms positive-definiteness behavior for positive regularization, while the right panel reports condition-number growth near the unregularized edge. Together, the panels explain why theorem assumptions remain valid in regularized settings and why separate numeric-tolerance auditing is needed for near-singular boundaries.} \label{fig:appendix_e2} \end{figure} \begin{figure}[h] \centering \includegraphics[width=0.65\linewidth]{confidence_calibration_panels.pdf} \caption{Confidence calibration and stability diagnostics. The panels summarize lower-bound behavior across confidence and margin thresholds and show stability diagnostics from bootstrap-based regime consistency analysis. No reportable-positive condition emerges under the combined bound and multiplicity gates, while shuffled-control checks stay near zero false positives, supporting conservative but reliable null-result interpretation.} \label{fig:appendix_e4} \end{figure} \section{Reproducibility and Implementation Details} The experimental protocol uses paired seeds across all baselines to enforce direct condition-wise comparisons. Seed sets cover benchmark, boundary, frontier, and calibration runs, including dedicated values for confidence resampling and frontier audits. Hyperparameter sweeps include entangling strengths, evolution times, measurement counts and sets, PCA dimensions, train-fraction controls, regularization grids, and confidence thresholds. Repetitions are aggregated with condition-level statistics and bootstrap summaries where required. Compute reporting is bounded to a single Apple-Silicon-class APU envelope with 128GB memory. Runtime, memory, and simulability proxy metrics are logged for frontier analysis, and budget constraints are checked for each retained candidate. Symbolic reproducibility is supported by independent checks of normal-matrix structure, concentration bound expressions, and failure-region monotonic implications. Figure exports are validated through PDF readability checks to ensure publication-ready vector artifacts. Approximations and caveats are explicit. The current iteration relies on proxy manifests rather than canonical dataset ingestion, which can alter absolute effect sizes and confidence margins. Therefore, uncertainty procedures and null-result bounds should be interpreted as method validation under controlled proxies, with canonical-data reruns required for publication-strength empirical claims. Follow-up runs should preserve the same fairness assumptions, gate policy, and frontier audits so that iteration-to-iteration deltas remain attributable to data realism rather than protocol drift. \end{document}