% This file was adapted from ICLR2022_conference.tex example provided for the ICLR conference \documentclass{article} % For LaTeX2e \usepackage{conference,times} \usepackage{easyReview} \usepackage{algorithm} \usepackage{algorithmic} % Optional math commands from https://github.com/goodfeli/dlbook_notation. \input{math_commands.tex} \usepackage{amsthm,amssymb} \newtheorem{theorem}{Theorem}[section] \newtheorem{corollary}{Corollary}[theorem] \newtheorem{lemma}[theorem]{Lemma} \newtheorem{definition}[theorem]{Definition} % Please leave these options as they are \usepackage{hyperref} \hypersetup{ colorlinks=true, linkcolor=red, filecolor=magenta, urlcolor=blue, citecolor=purple, pdftitle={Stability-Aware Bilevel Source Dataset Selection for Importance-Weighted Least Squares in Unsupervised Domain Adaptation}, pdfpagemode=FullScreen, } \title{Stability-Aware Bilevel Source Dataset Selection for Importance-Weighted Least Squares in Unsupervised Domain Adaptation} \author{Anonymous Authors \\ Affiliation Withheld for Double-Blind Review \\ \texttt{anonymous@anonymous.edu}} \begin{document} \maketitle \begin{abstract} Importance-weighted least squares (IWLS) is widely used to correct covariate shift in unsupervised domain adaptation, yet most prior work assumes that an appropriate source dataset is already available. In practical internet-scale retrieval settings, the key decision is reversed: given only unlabeled target covariates and a pool of candidate source datasets, which source (or source mixture) should be selected before fitting a weighted regressor? We study this question through a stability-aware bilevel framework with three formal components: a label-free source-ranking surrogate with a uniform regret guarantee, a multi-source mixture objective with linear-rate upper-level optimization under smoothness and strong-convexity assumptions, and a mixed-shift gate for harmful-source rejection before weighted fitting. We evaluate these components in three settings (distribution-level synthetic shift, semi-synthetic target-sample selection, and a protocol-faithful proxy real-track setting with governance checks). The empirical findings are mixed: symbolic checks and theorem-conditioned diagnostics are consistent, but Holm-corrected comparisons show no statistically significant advantage of the stability-aware selector over MMD-nearest, Wasserstein-nearest, or pooled IWLS in the current iter\_1 run. This outcome clarifies that optimization guarantees and diagnostic structure are not sufficient for global predictive dominance under the present proxy data regime. The study contributes a reproducible no-target-label protocol, explicit failure reporting, and a concrete follow-up agenda for real benchmark ingestion and post-selection-controlled confirmatory analysis. \end{abstract} \section{Introduction} Selecting a source dataset is often treated as a preprocessing detail in unsupervised domain adaptation, but in many real workflows it is the main decision variable. When target labels are unavailable, practitioners must infer source relevance from unlabeled target covariates, source labels, and imperfect shift diagnostics. This setting appears in tabular scientific modeling, forecasting with external archives, and multi-institution time-series transfer, where candidate datasets are plentiful but compatibility and robustness constraints are tight. The statistical consequence is that source selection error can dominate downstream estimator error: even a theoretically valid IWLS model can fail if the selected source induces unstable density ratios or poor effective sample size. The covariate-shift literature established the core identity that enables IWLS, \(R_T(f)=\mathbb{E}_{S}[w(X)\ell(f(X),Y)]\), with \(w(x)=p_T(x)/p_S(x)\), and clarified conditions under which weighted validation and ratio estimation are unbiased or consistent \citep{Sugiyama2007IWCV,Sugiyama2007KLIEP,Huang2006KMM,Sugiyama2012DensityRatioBook}. Adaptation theory then connected target risk to source risk and distribution discrepancy \citep{BenDavid2010Theory,Mansour2009Discrepancy,Mansour2010Renyi,Zhang2019MDD}. However, this body of work only partially answers the internet-scale source retrieval problem because it typically assumes either a single fixed source, clean support overlap, or no explicit penalty for ratio-estimation uncertainty and numerical conditioning. Modern evidence reinforces this gap. Deep alignment methods can reduce mismatch but are predominantly classification-first and may fail under conditional or label-shift contamination \citep{Ganin2016DANN,Long2018CDAN,Sun2016DeepCORAL,Long2015DAN,Tzeng2017ADDA,Saito2018MCD,Combes2020ConditionalShift}. Shift-robustness analyses emphasize that importance weighting is neither uniformly necessary nor uniformly sufficient, especially under misspecification and regularization coupling \citep{Gogolashvili2023NeedIW,Kanagawa2023Regularization,Feng2024Quantile}. Benchmark studies further show that protocol choices can dominate reported gains \citep{Ragab2023AdaTime,Fawaz2023Benchmark,Koh2021WILDS,Zhao2023TTAPitfalls}. Taken together, these results motivate a method that keeps the no-target-label constraint explicit while modeling stability risk as a first-class optimization term. This paper frames source dataset selection as a bilevel decision problem with formal guarantees and reproducible evaluation. The lower level solves weighted ridge regression; the upper level scores candidate sources or mixtures with discrepancy, ratio uncertainty, and stability diagnostics computed without target labels. The method is designed for three settings: (i) source and target as distributions (setting A), (ii) finite unlabeled target samples with source mixtures (setting B), and (iii) protocol-faithful high-shift proxy runs aligned with real benchmark pipelines (setting C). The perspective we adopt is intentionally data-centric rather than model-centric. Instead of asking only how to optimize a predictor once a source has been chosen, we ask how to choose the training data itself when supervision is asymmetric across domains. This distinction matters because dataset retrieval often precedes architecture choice in modern machine learning operations. In industry and scientific practice, one frequently reuses a small set of stable model families while repeatedly changing candidate source pools as new repositories become available. A principled source-selection objective can therefore yield gains even when downstream model classes remain fixed. A second motivation is cross-domain transferability of the decision rule. The same no-target-label source-selection challenge appears in environmental modeling, clinical time-series forecasting, and operations management, where label acquisition lags behind covariate collection. While our concrete experiments target regression under covariate shift, the methodological structure, namely balancing adaptation utility with uncertainty and numerical stability penalties, applies to broader weak-supervision pipelines. This broader view guides our emphasis on explicit assumptions, equation-level guarantees, and reproducible artifacts. \textbf{Contributions.} \begin{itemize} \item We formalize label-free source ranking with a composite surrogate and prove a uniform two-sided bound that yields a finite regret guarantee relative to the oracle target-risk ordering. \item We derive a bilevel multi-source IWLS objective with closed-form lower-level stationarity and prove linear upper-level objective contraction under standard smoothness and strong-convexity assumptions. \item We introduce a mixed-shift pre-selection gate with a deterministic separation guarantee that rejects harmful sources when diagnostic intervals are separable. \item We provide a reproducible evaluation protocol with fixed seeds, paired significance testing, vector-graphics figures, and public experiment artifacts that expose both supportive evidence and unresolved failure modes. \end{itemize} Beyond the immediate IWLS use case, the framework offers a general template for cross-domain data retrieval under weak supervision: combine adaptation benefit estimates with explicit uncertainty and numerical-stability controls, then evaluate with leakage-safe model-selection rules. \section{Related Work} \subsection{Covariate-Shift and IWLS Foundations} Importance weighting under covariate shift is rooted in the assumption that conditionals are invariant while marginals differ. \citet{Sugiyama2007IWCV} showed that ordinary cross-validation is biased under shift, and introduced importance-weighted cross-validation as an unbiased alternative under correct ratios and overlap assumptions. Direct ratio estimation methods such as KLIEP \citep{Sugiyama2007KLIEP} and moment-matching approaches such as KMM \citep{Huang2006KMM} made weighting practically viable in higher dimensions, while later monographs systematized ratio-estimation objectives and failure modes \citep{Sugiyama2012DensityRatioBook}. The key strength of this line is clear probabilistic grounding; its key limitation in internet-scale source retrieval is that ratio error can vary dramatically across candidate sources, and that variation is not typically integrated into selection scores. Recent regression-focused analyses reinforce the need for explicit stability handling. Under model misspecification, importance weighting may become crucial, but finite-sample behavior can still degrade when weights are heavy-tailed or regularization is not tuned jointly with shift correction \citep{Gogolashvili2023NeedIW,Kanagawa2023Regularization,Feng2024Quantile}. Technical reports focused on IWLS sample complexity similarly emphasize effective sample size and conditioning as central practical determinants of error \citep{RICAM2021IWLS}. Our method directly operationalizes these observations by incorporating ESS and conditioning penalties in the selection objective. \subsection{Adaptation Bounds, Discrepancy, and Multi-Source Transfer} Generalization bounds for domain adaptation decompose target risk into source risk plus discrepancy-like terms and irreducible joint error \citep{BenDavid2010Theory,Mansour2009Discrepancy}. Multi-source theory then introduced distribution-weighted combinations and divergence-sensitive guarantees \citep{Mansour2010Renyi}. These frameworks justify using unlabeled discrepancy diagnostics, but they do not by themselves specify robust finite-sample source ranking rules for weighted least-squares regression. The discrepancy family itself has multiple strengths and weaknesses. MMD provides a nonparametric two-sample criterion with concentration guarantees \citep{Gretton2012MMD}, and margin-disparity variants can align better with task loss in some settings \citep{Zhang2019MDD}. Yet discrepancy-only ranking can misorder sources when ratio-estimation noise or covariance conditioning dominates downstream regression error. This motivates combining discrepancy with explicit uncertainty and stability diagnostics rather than treating it as a standalone objective. From an optimization viewpoint, discrepancy signals are attractive because they can be estimated from unlabeled covariates and are typically smooth enough for gradient-based tuning. From a statistical viewpoint, however, they are surrogates for target risk, and surrogate mismatch can be substantial when source conditionals differ in subtle ways or when ratio estimators amplify noise in low-density target regions. A central design principle in this manuscript is therefore to treat discrepancy as informative but incomplete evidence. The composite objective should include discrepancy, not replace everything with discrepancy. \subsection{Deep Alignment Baselines and Shift-Type Ambiguity} Adversarial and discrepancy-based deep adaptation methods, including DANN, CDAN, DAN, CORAL, ADDA, and MCD, establish strong baselines for representation transfer \citep{Ganin2016DANN,Long2018CDAN,Long2015DAN,Sun2016DeepCORAL,Tzeng2017ADDA,Saito2018MCD}. Semi-supervised and weighted variants such as AdaMatch and importance-weighted adversarial designs further improve robustness in many benchmarks \citep{Berthelot2022AdaMatch,Li2020IWCA}. Their strength is flexibility across high-dimensional domains; their limitation for our setting is protocol mismatch: they often assume end-to-end representation learning and classification-centric evaluation, while internet-scale dataset retrieval for IWLS requires source ranking without target labels and with explicit numerical diagnostics. Another limitation is shift-type ambiguity. Conditional or label shift can invalidate pure covariate-shift corrections and mislead discrepancy objectives \citep{Combes2020ConditionalShift}. We address this with a pre-selection gate that uses disagreement and tail diagnostics before weighted fitting. \subsection{Benchmark Rigor and Practical Infrastructure} Benchmark studies on time-series adaptation and out-of-distribution robustness show that model-selection leakage and inconsistent protocols can overshadow algorithmic differences \citep{Ragab2023AdaTime,Fawaz2023Benchmark,Koh2021WILDS,Zhao2023TTAPitfalls}. Multi-source and causal time-series methods suggest promising architectural directions but still provide limited regression-first evidence for source retrieval under strict no-target-label constraints \citep{Lu2024CauDiTS,Stojanov2023CALDA,Wang2024POND,He2023Raincoat,Yang2025SourceFreeTSDA}. Repositories such as AdaTime and UDA-4-TSC offer reproducible infrastructure, yet direct regression-ready source-selection pipelines remain sparse \citep{AdaTimeRepo2022,UDA4TSCRepo2023,WildsRepo2021,Guo2018MoE,Cui2020MultiSourceAttention}. The resulting gap is precise: we need a source-selection objective that is label-free, theoretically interpretable, finite-sample aware, and benchmark-rigorous. The remainder of this manuscript develops such an objective and evaluates it under leakage-safe protocols. \section{Problem Setting and Notation} We introduce symbols in context below and consolidate them in Table~\ref{tab:notation} for cross-section traceability. We study unsupervised domain adaptation for squared-loss regression. Let \((\mathcal{X},\mathcal{Y})\) denote input-output spaces with \(\mathcal{X}\subseteq \mathbb{R}^d\) and \(\mathcal{Y}\subseteq \mathbb{R}\). We observe a target unlabeled sample \(\train_T^X=\{\vx_j^T\}_{j=1}^{n_T}\sim p_T(x)\), and a candidate pool of labeled source datasets \(\{\train_k\}_{k\in\mathcal{K}}\), where \(\train_k=\{(\vx_i^k,y_i^k)\}_{i=1}^{n_k}\sim p_{S_k}(x,y)\). The target risk of predictor \(f\) is \begin{equation} R_T(f) := \mathbb{E}_{(X,Y)\sim P_T}\left[(f(X)-Y)^2\right]. \label{eq:target-risk} \end{equation} The quantity in \eqref{eq:target-risk} is the objective ultimately used to evaluate every method in \secref{sec:results}. Under covariate shift, \(p_{S_k}(y\mid x)=p_T(y\mid x)\) and \(p_{S_k}(x)\neq p_T(x)\), with overlap on target support. The ratio \(w_k(x)=p_T(x)/p_{S_k}(x)\) induces \begin{equation} R_T(f) = \mathbb{E}_{(X,Y)\sim P_{S_k}}\left[w_k(X)(f(X)-Y)^2\right]. \label{eq:iw-identity} \end{equation} The identity in \eqref{eq:iw-identity} motivates replacing inaccessible target expectations with weighted source expectations; in practice we use \(\hat w_k\), obtained from unlabeled target covariates and source covariates. For each source \(k\), define target excess risk relative to the best target predictor class: \begin{equation} \Delta_k := R_T(\beta_k)-\inf_{\beta}R_T(\beta), \label{eq:delta-def} \end{equation} where \(\beta_k\) is the predictor fitted using source \(k\) with weighted ridge regression, so \eqref{eq:delta-def} defines the latent ranking target. The core source-selection problem is \begin{equation} \hat k\in \arg\min_{k\in\mathcal{K}} \Delta_k, \label{eq:source-selection-ideal} \end{equation} and \eqref{eq:source-selection-ideal} is the ideal but unobservable decision rule because \(\Delta_k\) requires target labels. We therefore optimize an unlabeled surrogate composed of four terms: weighted proxy fit, ratio uncertainty, effective sample size penalty, and conditioning penalty. Define nonnegative coefficients \(\alpha,\beta,\gamma\). For each candidate source, \begin{equation} \Delta_k = A_k + \alpha B_k + \beta C_k + \gamma D_k, \label{eq:decomposition} \end{equation} where \(A_k\) is adaptation benefit proxy, \(B_k\) captures ratio uncertainty, \(C_k\) penalizes low ESS behavior, and \(D_k\) penalizes ill-conditioned weighted design matrices. \begin{table}[t] \caption{Notation used in the methods section. The table appears after formal definitions so every symbol has contextual meaning. It is included because the methodology uses multiple risk, uncertainty, and optimization objects across nested objectives.} \label{tab:notation} \centering \small \renewcommand{\arraystretch}{1.1} \setlength{\tabcolsep}{4pt} \begin{tabular}{p{0.2\linewidth}p{0.75\linewidth}} \hline Symbol & Meaning \\ \hline \(\mathcal{K}\) & Candidate source index set. \\ \(\train_k,\train_T^X\) & Labeled source dataset and unlabeled target covariates. \\ \(\hat w_k(x)\) & Estimated importance ratio for source \(k\). \\ \(\Delta_k\) & Target excess risk of source-specific predictor \(\beta_k\). \\ \(A_k,B_k,C_k,D_k\) & Benefit proxy, ratio uncertainty, ESS penalty, conditioning penalty. \\ \(J_k\) & Composite unlabeled ranking score from \eqref{eq:surrogate-score}. \\ \(\pi\in\Delta^{K}\) & Source-mixture weights over \(K=|\mathcal{K}|\) sources. \\ \(U(\pi)\) & Upper-level bilevel objective in \eqref{eq:upper-objective}. \\ \(G_k\) & Mixed-shift gate score for source pre-selection. \\ \hline \end{tabular} \end{table} Table~\ref{tab:notation} summarizes the symbols used across the ranking, gating, and bilevel optimization derivations. Assumptions used throughout are standard in covariate-shift adaptation but made explicit for reproducibility: (i) overlap and finite second moments; (ii) no target labels are used for source selection or hyperparameter choice; (iii) ratio estimators are nonnegative and normalized on source samples; and (iv) ridge parameter \(\lambda>0\) ensures invertible weighted normal matrices. These assumptions are tested indirectly by ESS, condition-number, and mixed-shift diagnostics. \section{Stability-Aware Bilevel Source Selection Method} \label{sec:method} \subsection{Architecture Overview} The proposed pipeline has four modules with distinct responsibilities. Module 1 performs feasibility filtering for schema compatibility and minimum sample-size constraints. Module 2 computes label-free source diagnostics from source covariates, source labels, and unlabeled target covariates; it outputs composite scores \(J_k\) and mixed-shift gate values \(G_k\). Module 3 optimizes source mixtures through a bilevel objective where the lower level solves weighted ridge regression and the upper level balances adaptation proxy risk against stability penalties. Module 4 performs final weighted fitting on selected sources and reports evaluation metrics only on held-out target labels. This separation keeps deployment-relevant decisions independent of target supervision and allows direct auditing of each module. The architecture is deliberately modular to support inspection and substitution. For example, the ratio-estimation component can be replaced without changing gate construction, and the upper-level optimizer can be modified without changing the lower-level weighted ridge solver. This compositionality is useful when adapting across domains with different compute budgets: one can choose lightweight estimators for large candidate pools and then refine only shortlisted sources with stronger diagnostics. The module interface also makes errors diagnosable, because one can attribute failures to ranking signals, optimization, or final fitting rather than treating the pipeline as a single black box. \subsection{Composite Ranking and Regret Guarantee} The unlabeled ranking score is \begin{equation} J_k := \widehat{A}_k + \alpha\widehat{B}_k + \beta\widehat{C}_k + \gamma\widehat{D}_k. \label{eq:surrogate-score} \end{equation} Let \begin{equation} \varepsilon_{\mathrm{tot}} := \varepsilon_A + \alpha\varepsilon_B + \beta\varepsilon_C + \gamma\varepsilon_D, \label{eq:eps-tot} \end{equation} where \eqref{eq:eps-tot} is the aggregate estimation-error term that controls the ranking gap in the regret theorem, and component-wise estimation errors satisfy \(|\widehat{A}_k-A_k|\le\varepsilon_A\), \(|\widehat{B}_k-B_k|\le\varepsilon_B\), \(|\widehat{C}_k-C_k|\le\varepsilon_C\), \(|\widehat{D}_k-D_k|\le\varepsilon_D\) for all \(k\in\mathcal{K}\). \begin{theorem}[Uniform Surrogate-Ordering Regret] \label{thm:ordering} Under the assumptions above and nonnegative \(\alpha,\beta,\gamma\), if \(\hat k\in\arg\min_{k\in\mathcal{K}} J_k\), then \begin{equation} \Delta_{\hat k} \le \min_{k\in\mathcal{K}}\Delta_k + 2\varepsilon_{\mathrm{tot}}. \label{eq:ordering-bound} \end{equation} \end{theorem} \begin{proof} For any \(k\), absolute-error bounds and nonnegative coefficients imply two-sided sandwiches: \(\Delta_k\le J_k+\varepsilon_{\mathrm{tot}}\) and \(\Delta_k\ge J_k-\varepsilon_{\mathrm{tot}}\). Let \(k^\star\in\arg\min_k\Delta_k\). Then \(\Delta_{\hat k}\le J_{\hat k}+\varepsilon_{\mathrm{tot}}\le J_{k^\star}+\varepsilon_{\mathrm{tot}}\le \Delta_{k^\star}+2\varepsilon_{\mathrm{tot}}\), where the middle inequality uses optimality of \(\hat k\) for \(J_k\). Since \(\Delta_{k^\star}=\min_k\Delta_k\), the claim follows. \end{proof} \Eqref{eq:ordering-bound} translates diagnostic quality into a measurable ranking-regret term: better ratio-uncertainty and stability estimation reduces the gap between surrogate and oracle ordering. In practice, this theorem justifies spending computational budget on improving \(\widehat{B}_k,\widehat{C}_k,\widehat{D}_k\), not only on discrepancy estimation. An immediate practical implication is that clipping and uncertainty-aware ratio estimation are not merely heuristic safeguards; they directly reduce the additive regret constant through \(\varepsilon_{\mathrm{tot}}\). If two candidate scoring systems produce similar discrepancy quality but different ratio-uncertainty quality, the theorem predicts better ranking reliability for the latter even before any target labels are revealed. This is the main reason we center the method on stability-aware terms rather than discrepancy-only nearest-source rules. \subsection{Bilevel Mixture Objective and Convergence} Single-source ranking may discard complementary sources. We therefore optimize mixture weights \(\pi\in\Delta^K\) over retained candidates. The lower-level weighted ridge objective is \begin{equation} \beta^{\star}(\pi)=\arg\min_{\beta}\sum_{k=1}^{K}\pi_k\frac{1}{n_k}\sum_{i=1}^{n_k}\hat w_k(\vx_i^k)\left(y_i^k-(\vx_i^k)^\top\beta\right)^2+\lambda\|\beta\|_2^2. \label{eq:lower-level} \end{equation} With \(A(\pi)=\sum_k\pi_kX_k^\top W_kX_k\) and \(b(\pi)=\sum_k\pi_kX_k^\top W_ky_k\), stationarity yields \begin{equation} \beta^{\star}(\pi)=\left(A(\pi)+\lambda I\right)^{-1}b(\pi). \label{eq:closed-form-beta} \end{equation} The optimization problem in \eqref{eq:lower-level} is the decision layer solved at every upper-level iterate, while \eqref{eq:closed-form-beta} is its stationarity solution. The upper-level objective is \begin{equation} U(\pi)=\widehat{R}_{\mathrm{proxy}}(\beta^{\star}(\pi),\pi)+\eta\widehat{D}(P_{S_\pi}^{X},P_T^{X})+\rho\Omega_{\mathrm{stab}}(\pi), \label{eq:upper-objective} \end{equation} with stability regularizer \(\Omega_{\mathrm{stab}}\) combining ESS and conditioning penalties. The upper-level decision variable is \(\pi\in\Delta^K\), and the explicit optimality criterion is \(\pi^\star\in\arg\min_{\pi\in\Delta^K}U(\pi)\) in \eqref{eq:upper-objective}. \begin{lemma}[Unique Lower-Level Minimizer] \label{lem:lower-unique} For any feasible \(\pi\) and \(\lambda>0\), the lower-level objective in \eqref{eq:lower-level} is strictly convex in \(\beta\), and therefore has a unique minimizer given by \eqref{eq:closed-form-beta}. \end{lemma} \begin{proof} The Hessian of the objective in \eqref{eq:lower-level} is \(H(\pi)=2(A(\pi)+\lambda I)\). For any nonzero \(v\in\mathbb{R}^d\), \( v^\top H(\pi)v=2v^\top A(\pi)v+2\lambda\|v\|_2^2\ge 2\lambda\|v\|_2^2>0 \), because \(A(\pi)\) is positive semidefinite when each \(W_k\) has nonnegative diagonal and \(\lambda>0\). Thus \(H(\pi)\) is positive definite, strict convexity holds, and the first-order condition yields \eqref{eq:closed-form-beta} as the unique minimizer. \end{proof} \begin{theorem}[Linear-Rate Upper-Level Descent] \label{thm:convergence} Assume \(U\) is differentiable, \(L\)-smooth, and \(\mu\)-strongly convex on the interior region visited by \begin{equation} \pi_{t+1}=\pi_t-\frac{1}{L}\nabla U(\pi_t). \label{eq:upper-step} \end{equation} Then \begin{equation} U(\pi_t)-U(\pi^{\star})\le(1-\mu/L)^t\left(U(\pi_0)-U(\pi^{\star})\right). \label{eq:linear-rate} \end{equation} \end{theorem} \begin{proof} By \(L\)-smoothness, setting \(y=x-(1/L)\nabla U(x)\) gives \(U(y)\le U(x)-\frac{1}{2L}\|\nabla U(x)\|_2^2\). For differentiable \(\mu\)-strongly convex \(U\), the Polyak--\L{}ojasiewicz inequality gives \(\|\nabla U(x)\|_2^2\ge2\mu(U(x)-U(\pi^{\star}))\). Combining both inequalities yields \(U(y)-U(\pi^{\star})\le(1-\mu/L)(U(x)-U(\pi^{\star}))\). Applying this recursively with \(x=\pi_t\), \(y=\pi_{t+1}\) proves \eqref{eq:linear-rate}. \end{proof} \Eqref{eq:closed-form-beta} is used to compute deterministic lower-level updates, while \eqref{eq:linear-rate} motivates early stopping and step-size policies in practice. The convergence theorem should be interpreted as an algorithmic guarantee on objective optimization, not as an automatic guarantee on target-risk superiority over all baselines. This distinction is crucial for empirical interpretation in \secref{sec:results}: even when optimization converges rapidly, predictive ranking quality can remain sensitive to surrogate design and ratio uncertainty. We therefore report both optimization-consistent diagnostics and downstream prediction metrics. \subsection{Mixed-Shift Gate Before IWLS Fitting} Discrepancy and ratio estimates can both fail when candidate sources exhibit mixed shifts. We define a gate \begin{equation} G_k=a_1V_k+a_2U_k+a_3T_k, \label{eq:gate-score} \end{equation} where \(V_k\) is cross-discrepancy variance, \(U_k\) is ratio disagreement between estimators, and \(T_k\) is a tail-risk indicator. Feasible candidates are \begin{equation} \mathcal{F}_\tau:=\{k\in\mathcal{K}: G_k\le \tau\}. \label{eq:gate-feasible} \end{equation} \begin{theorem}[Deterministic Gate Separation] \label{thm:gate} Assume nonnegative coefficients \(a_1,a_2,a_3\), nonnegative diagnostics, safe-set upper bounds \((\bar V_s,\bar U_s,\bar T_s)\), harmful-set lower bounds \((\underline V_h,\underline U_h,\underline T_h)\), and strict separation \(g_{\mathrm{safe}}\tau\), so \(k\notin\mathcal{F}_\tau\). Hence separation is exact. Monotonicity of feasible sets under \(\tau_1\le\tau_2\) follows immediately from \eqref{eq:gate-feasible}. \end{proof} \begin{lemma}[Feasible-Threshold Existence and Nesting] \label{lem:gate-feasible} If \(g_{\mathrm{safe}}