arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.07138 2026-05-11 cs.AI cs.LG

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

你能破解RLVER吗?探测RL训练同理心代理的对抗鲁棒性

Deeraj S K, Sadhana Devarajan, Krishna Mehra, Sudhakar Mishra

发表机构 * Department of Artificial Intelligence(人工智能系) Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉布希·国家理工学院)

AI总结 本文通过构建对抗同理心基准AEB和引入情感一致性评分ECS,评估RL训练同理心代理在对抗环境下的鲁棒性,发现RLVER-PPO-Think在情感响应上优于基线模型,但ECS评分无显著提升,表明RL训练增强了情感响应但未改善可观测状态跟踪。

详情
AI中文摘要

基于可验证情绪奖励的强化学习(RLVER)已生成表现出色的语言模型,其在假设合作和诚实用户的基准上进行评估。然而,真实情感互动系统性地违反这一假设:用户会操纵、升级和迫使AI系统提供无条件验证,这些动态无法被合作基准揭示。我们构建了对抗同理心基准(AEB)并引入情感一致性评分(ECS)以评估在对抗条件下同理心的鲁棒性。AEB包含六种心理基础的对抗轨迹类型,具有区分性的奖励结构,惩罚公式化回应;ECS正式分离了模型跟踪用户情绪状态的能力与其改善该能力的能力。在八种场景匹配条件的受控实验中(在两个RLVER模型上进行思考和非思考条件,以及两个基础模型(Qwen 1.5B和7B)共480次对抗对话),RLVER-PPO-Think在相同规模的未调优基线中表现显著优于(0.963 vs. 0.761,p<0.001, r=0.688),无对话崩溃且隐藏意图检测高出47%。然而,ECS仍几乎不变,且RLVER-PPO-Think与Base-7B-Think之间无显著差异(p=0.650):RL训练增强了情感响应,但未在可观测状态跟踪中带来可测量的提升。我们将ECS--FS(最终评分)差距解释为该模拟器家族内部的行为/可解释性脱节,而非内部理解或临床准备的证据。

英文摘要

Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, \(p<0.001, r=0.688\)), with zero dialogue collapses and 47\% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think (\(p=0.650\)): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS--FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.

2605.07137 2026-05-11 cs.LG cs.AI

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

自适应负强化用于大语言模型推理:在强化学习中动态平衡纠正与多样性

Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra

发表机构 * Sardar Vallabhbhai National Institute of Technology (SVNIT)(萨达尔·瓦拉布希·尼尔马伊技术学院)

AI总结 本文提出自适应负强化方法,通过时间依赖调度函数动态平衡纠正与多样性,提升大语言模型推理能力。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLMs)推理能力的有效方法。最近研究表明,负样本强化(NSR)通过惩罚错误步骤而非仅奖励正确步骤,可与PPO和GRPO等复杂框架在Pass@k范围内匹配或超越。然而,现有NSR技术通常使用固定惩罚并同等对待所有错误响应。为此,我们提出两种NSR扩展:自适应负样本强化(A-NSR)。A-NSR采用时间依赖调度函数,初期侧重于纠正错误以稳定模型,后期转向更微妙的更新。我们还引入了置信度加权负强化(CW-NSR),基于不同错误的重要性差异,通过模型的归一化序列似然分配特定惩罚权重。形式分析显示,这些机制通过令牌级更新,使模型能利用先验引导的概率重分布,同时自然防御过拟合。我们在MATH、AIME 2025和AMC23等困难推理数据集上评估了这些方法,使用Qwen2.5-Math-1.5B架构。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model's normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors -- where the model is effectively exploring -- are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture.

2605.07134 2026-05-11 cs.CL cs.AI

Region4Web: Rethinking Observation Space Granularity for Web Agents

Region4Web: 重新思考网页代理的观察空间粒度

Donguk Kwon, Dongha Lee

发表机构 * Yonsei University(延世大学)

AI总结 Region4Web提出通过功能区域划分重构AXTree,以更紧凑的信息基础提升网页代理任务成功率。

详情
AI中文摘要

网页代理通过观察空间感知网页,但其粒度设计仍缺乏深入探讨。现有方法将观察空间与动作空间视为相同元素级粒度,导致页面功能组织隐含且需在每一步推断。本文主张以功能区域粒度进行观察,提出Region4Web框架通过分层分解和语义抽象重构AXTree,暴露页面功能组织作为状态理解基础。同时提出PageDigest,通过网页特定推理管道将区域级观察转化为紧凑的页面摘要,持续跨步骤。在WebArena基准上,PageDigest在不同基础大语言模型和现有代理方法中显著减少观察长度并提升任务成功率,证明功能区域粒度比元素级处理更紧凑且信息丰富。

英文摘要

Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.

2605.07133 2026-05-11 cs.LG cs.AI

GAD in the Wild: Benchmarking Graph Anomaly Detection under Realistic Deployment Challenges

现实场景下的图异常检测:在真实部署挑战下基准测试

Jingjing Zhou, Shiyu Huang, Qing Qing, Zuquan Yuan, Huafei Huang, Ziqi Xu, Mingliang Hou, Xikun Zhang, Renqiang Luo, Ivan Lee

发表机构 * Zhejiang Gongshang University(浙江工商大学) Jilin University(吉林大学) Adelaide University(阿德莱德大学) RMIT University(皇家墨尔本理工大学) Jinan University(济南大学)

AI总结 本文提出一个多维基准测试,评估图异常检测模型在大规模图、极端异常稀少和缺失节点属性等现实挑战下的表现,揭示了现有方法在可扩展性和鲁棒性上的不足。

详情
AI中文摘要

图异常检测(GAD)是图机器学习中的关键任务,广泛应用于金融欺诈检测和社会平台治理。然而,现有GAD基准测试通常局限于小规模、精心编纂的图,具有相对平衡的异常比率,导致学术评估与实际部署之间存在显著差距。为弥合这一差距,我们提出一个多维基准测试,系统评估GAD模型在三个与部署相关的挑战下:百万级图、极端异常稀少和缺失节点属性。我们从五个不同的图中推导出一系列受控基准变体,包括两个原生工业级数据集,拥有超过370万个节点。对九种代表性GAD模型的广泛评估揭示了三个主要局限:(1)大多数基于图神经网络的方法由于内存需求过高而无法扩展到百万节点图;(2)在现实异常比率(例如0.1%)下,检测性能急剧下降,往往导致零召回;(3)基于重建的模型对属性填补策略高度敏感。我们的发现表明,实验室环境中的强大表现并不保证在生产环境中的鲁棒性。我们发布此基准和实证评估作为诊断测试床,以促进开发适用于实际中遇到的大规模、不完美图的稳健和可扩展的GAD系统。代码可在https://anonymous.4open.science/r/Benchmark_GAD-E7A3获取。

英文摘要

Graph Anomaly Detection (GAD) is a critical task in graph machine learning with vital applications in financial fraud detection and social platform governance. However, existing GAD benchmarks are often restricted to small-scale, curated graphs with relatively balanced anomaly ratios, leaving a substantial gap between academic evaluation and real-world deployment. To bridge this gap, we present a multi-dimensional benchmark that systematically evaluates GAD models under three deployment-relevant challenges: million-scale graphs, extreme anomaly scarcity, and missing node attributes. We derive a family of controlled benchmark variants from five diverse graphs, including two native industrial-scale datasets with over 3.7 million nodes. Our extensive evaluation of nine representative GAD models reveals three major limitations: (1) most GNN-based methods fail to scale to million-node graphs due to prohibitive memory requirements; (2) detection performance drops sharply under realistic anomaly ratios (e.g., 0.1\%), often resulting in zero recall; and (3) reconstruction-based models are highly sensitive to attribute imputation strategies. Our findings suggest that strong performance in laboratory settings does not guarantee robustness in production environments. We release this benchmark and empirical evaluation as a diagnostic testbed to promote the development of robust and scalable GAD systems for large-scale, imperfect graphs encountered in practice. Code is available at https://anonymous.4open.science/r/Benchmark_GAD-E7A3.

2605.07130 2026-05-11 cs.LG cs.DS

Simple KNN-Based Outlier Detection Achieves Robust Clustering

基于简单KNN的异常检测实现稳健聚类

Tianle Jiang, Yufa Zhou

发表机构 * Duke University(杜克大学)

AI总结 本文证明在合理假设下,简单移除KNN距离大的点可将稳健k均值转化为标准k均值,且在聚类成本和运行时间上优于其他算法。

Comments Code: https://github.com/MasterZhou1/Robust-Clustering

详情
AI中文摘要

在实际应用中,对异常值的鲁棒性至关重要。在$\textit{鲁棒$k$-Means}$问题(即带有异常值的$k$-Means)中,目标是移除$z$个异常值并最小化剩余点的$k$-Means成本。尽管鲁棒$k$-Means与异常值检测之间存在紧密联系,但现有理论和实证研究对经典异常值检测启发式方法在鲁棒$k$-Means中的有效性仍有限。本文证明,在合理的最优聚类大小假设下,简单移除$K$-最近邻距离大的点可将鲁棒$k$-Means转化为标准$k$-Means,且在聚类成本和运行时间上优于其他算法。这些结果表明,基于简单$K$-NN的启发式方法在鲁棒聚类中可能出乎意料地有效,为连接异常值检测与聚类技术提供了新的机会。

英文摘要

Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

2605.07127 2026-05-11 cs.LG cs.CL

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

位置诅咒:LLM在列表中定位最后几个项目时表现不佳

Zhanqi Zhang, Hua-Dong Xiong, Robert C. Wilson, Mikio Aoi, Marcelo G. Mattar, Li Ji-An

发表机构 * UC San Diego(加州大学圣地亚哥分校) Georgia Tech(佐治亚理工学院) New York University(纽约大学)

AI总结 LLM在定位长列表中单个相关项目时表现优异,但在短列表中最后几个项目定位上存在缺陷,即位置诅咒。通过构建PosBench数据集进行微调后,虽然提升了定位能力,但仍未达到理想水平。

详情
AI中文摘要

现代大型语言模型(LLM)在海量数据中定位单个相关事实时表现优异,但在短列表中最后几个项目的检索上却表现不佳,这种现象称为位置诅咒。例如,在两行代码片段中,Claude Opus 4.6经常错误识别倒数第二行。为研究此问题,我们评估了两种互补查询:给定序列中的位置(字母或单词),检索对应项;以及给定项,返回其位置。每个位置通过相对于锚点的正向或反向偏移指定,锚点可以是列表的端点或列表中的其他项。在开源和前沿闭源模型上,反向检索显著落后于正向检索。为测试是否可通过微调恢复此能力,我们构建了PosBench位置聚焦训练数据集。LoRA微调提高了正向和反向检索能力,并在 held-out 代码理解基准(PyIndex)上实现了泛化。然而,绝对性能仍远未饱和。随着 LLM 编码代理越来越多地处理大型代码库,精确索引对于代码理解和编辑至关重要,因此基于位置的检索成为未来预训练目标和模型设计的关键能力。

英文摘要

Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.

2605.07123 2026-05-11 cs.LG

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

基于思维链的上下文强化学习的收敛与涌现

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文探讨了思维链如何增强上下文强化学习的能力,通过线性Transformer分析证明思维链生成等同于时间差分学习更新,并展示策略评估误差随思维链长度减少而收敛。

详情
AI中文摘要

上下文强化学习(ICRL)指的是强化学习代理在推理时无需参数更新即可适应新任务的能力。近期实证研究进一步表明,思维链生成可以增强这种ICRL能力。本文首次提供了思维链与ICRL相互作用的理论理解。我们在线性Transformer的策略评估设置中进行分析,证明在特定Transformer参数下,思维链生成过程等同于反复执行时间差分学习更新。此外,我们提供了有限样本收敛分析,显示策略评估误差随思维链长度减少而几何收敛,并最终在由上下文长度决定的统计地板上饱和。我们还证明所需Transformer参数是预训练损失的全局最小值,为这些参数的实证涌现提供了理论理解。

英文摘要

In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

2605.07120 2026-05-11 cs.LG stat.ML

When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification

符号名称不应起作用:一种逻辑理论的鲜符号分类

Wenjie Guan, Jelena Bradic

发表机构 * Department of Statistics and Data Science, Cornell University, Ithaca, NY 14850(统计与数据科学系,康奈尔大学,纽约州伊萨卡市,14850)

AI总结 本文研究了固定标签分类问题,探讨了模型如何在符号重命名不变的情况下学习决策规则,通过分析正则化核逻辑分类,提出了鲜符号分类的理论框架。

详情
AI中文摘要

模板任务已浮现为一个清晰的测试平台,用以探讨转换器是否能用抽象符号而非具体令牌名称进行推理。我们研究了这一问题的固定标签分类版本,其中训练和测试示例共享潜在模板但可能使用不相交的词汇表。与下一个令牌预测不同,模型无需发出未见过的符号;它必须学习一个在符号重命名下不变的决策规则。我们分析了变换器核逻辑分类中的正则化核逻辑分类。我们的主要结果将学习的预测器分解为一个理想的模板级分类器和由训练数据中偶然令牌重叠引起的有限样本扰动。我们通过彩色碰撞图编码这些重叠,并证明了对新鲜符号分类的高概率边距转移保证。这种视角将基于模板的分析扩展到逻辑分类,并细化了标量多样性条件:词汇表大小控制碰撞的平均率,但碰撞几何学控制理想分类边距是否得以保持。更广泛地说,相同的扰动框架适用于抽象增强的输入,从而得出一个通用的边距与碰撞标准,用于识别何时提示策略能改善新鲜符号泛化。合成模板实验展示了预测的正则化、样本大小和变换器核结构的作用。

英文摘要

Template tasks have emerged as a clean testbed for asking whether transformers reason with abstract symbols rather than concrete token names. We study the fixed-label classification version of this problem, where train and test examples share latent templates but may use disjoint vocabularies. Unlike next-token prediction, the model need not emit unseen symbols; it must learn a decision rule invariant to symbol renaming. We analyze regularized kernel logistic classification in the transformer-kernel regime. Our main result decomposes the learned predictor into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. We encode these overlaps by a colored collision graph and prove high-probability margin-transfer guarantees for fresh-symbol classification. This perspective extends template-based analyses to logistic classification and refines scalar diversity conditions: vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved. More broadly, the same perturbation framework applies to abstraction-augmented inputs, yielding a general margin-versus-collision criterion for identifying when prompting strategies improve fresh-symbol generalization. Synthetic template experiments illustrate the predicted roles of regularization, sample size, and transformer-kernel structure.

2605.07116 2026-05-11 cs.LG cs.AI cs.NA math.NA math.OC

Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

稳定神经哈密顿-雅可比-贝尔曼求解器:误差分析及在基于模型的强化学习中的应用

Minseok Kim, Yeongjong Kim, Namkyeong Cho, Yeoneung Kim

发表机构 * Seoul National University of Science and Technology(首尔科学技术大学) POSTECH Gachon University(成均馆大学)

AI总结 本文提出稳定神经HJB求解器,通过混合方法分析误差并应用于强化学习,验证了求解器在模型误差和策略不匹配下的稳定性。

详情
AI中文摘要

本文提出稳定神经HJB求解器,通过混合方法分析误差并应用于强化学习,验证了求解器在模型误差和策略不匹配下的稳定性。

英文摘要

Physics-informed neural solvers offer a promising route to model-based reinforcement learning in continuous time, where optimal feedback synthesis is governed by Hamilton--Jacobi--Bellman (HJB) equations. Practical implementations often occupy a regime that is neither a classical grid method nor a continuous-PDE PINN: the value function is represented by a neural network, finite-difference HJB policy-evaluation operators are evaluated by network queries at shifted points, and residuals are minimized by random continuous collocation. This regime preserves the stabilized finite-difference policy-evaluation structure while avoiding grid-based value unknowns. We develop an error theory for this hybrid regime. Interpreting finite differences as shift operators acting on neural networks, we prove a population $L^2$ stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement. Experiments on compact-control LQR upto 64 dimensions, Allen--Cahn control, pendulum, Hopper, and 3D quadrotor benchmarks compare against representative model-based and model-free RL baselines, demonstrating the predicted residual, policy-mismatch, and learned-model error trends.

2605.07115 2026-05-11 cs.LG stat.ML

Conformal-Style Quantile Analyses for Stochastic Bandits

符合性风格的分位数分析用于随机老虎机

Chengyu Du, Mengfan Xu

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出ACP-UCB1算法,通过适应性符合估计和UCB类型乐观奖励bonus,解决随机老虎机中上尾性能优化问题,实现对数上分位数遗憾。

详情
AI中文摘要

随机老虎机算法通常基于均值奖励准则分析,但许多问题更青睐具有强上尾性能的臂。对于固定的误覆盖水平α,臂j的自然上尾目标是中央预测区间上端点F_j^{-1}(1-α/2)。此目标可能与经典老虎机目标产生中央不匹配。为此,本文提出ACP-UCB1算法,结合适应性符合估计的上端点和UCB类型乐观奖励bonus。技术挑战在于ACP-UCB1使用的符合分数需从演变的实证分位数估计重新计算,并在适应性水平上评估。通过奖励-分位数集中度、重新计算分数分位数的扰动论证和适应性水平的确定性局部化控制此端点。ACP-UCB1实现对数上分位数遗憾,每臂贡献为O(logn/Δ_j^{ACP})。本文还提供度量特定的遗憾分解,比较ACP-UCB1与UCB1,并通过数值实验验证性能和改进。

英文摘要

Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(α\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-α/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{Δ_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.

2605.07114 2026-05-11 cs.LG

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

在何处分配 rollout:用于基于组的 RLVR 的 hit-utility 优化 rollout 分配

Tao Wang, Shuo Li, Yan Sun, Dongsheng Ding, Edgar Dobriban

发表机构 * University of Pennsylvania(宾夕法尼亚大学) New Jersey Institute of Technology(新泽西理工学院) University of Tennessee Knoxville(田纳西大学肯纳邦克分校)

AI总结 本文提出 HORA 方法,通过最大化 hit-utility 提升 RLVR 的效率,实验证明其在多个基准上优于 GRPO,且兼容其他组基估计器。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的核心范式。基于组的策略优化方法,如 GRPO,通常将固定数量的 rollout 分配给每个提示。这种均匀分配可能效率低下:它会过度分配计算给已饱和的提示组,同时低估了可能揭示有用正确轨迹的提示。为了解决这一限制,我们引入了 hit utility,即在提议的额外分配中至少有一个 rollout 正确的后验概率。基于这一概念,我们提出 Hit-Utility Optimal Rollout Allocation(HORA),一种无需学习的 rollout 分配策略,旨在在每个分配批次内最大化总后验 hit utility。HORA 能够自适应地重新分配 rollout 预算,同时保持下游奖励评估和基于组的优势估计器不变。在四个数学推理基准和三种模型规模上,HORA 保持了与 GRPO 相当的 Pass@1,但在十二种模型-基准配置中的十种中提高了 Pass@K,其中一种为平局,一种为饱和例外。它还与其他基于组的估计器如 RLOO 兼容。消融研究表明,HORA 使用的均匀先验在五种提示条件下的学习先验替代方案中表现相当。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.

2605.07113 2026-05-11 cs.LG math.OC

Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks

通过可行性保持图神经网络求解最大割问题的全局最优解

Hao Chen, Chendi Qian, Christopher Morris, Andrea Lodi, Can Li

发表机构 * Davidson School of Chemical Engineering, Purdue University(普渡大学化学工程大卫逊学院) Faculty of Computer Science, RWTH Aachen University(亚琛工业大学计算机科学学院) Jacobs Technion-Cornell Institute, Cornell Tech(技术学院-康奈尔研究所)

AI总结 本文提出一种专门用于最大割问题的图神经网络,作为轻量级的SDP求解器替代品,能有效降低精确求解最大割问题的计算成本。

详情
AI中文摘要

精确求解难题的组合优化问题通常依赖于强凸松弛,但将这些松弛嵌入分支定界算法中进行多次求解可能成本过高。因此,我们考虑最大割问题,其中分支定界通常使用半正定规划(SDP)松弛来界子问题。我们提出了一种专门用于最大割问题的图神经网络,作为有原则、轻量级的神经代理替代SDP求解器,并可直接嵌入精确分支定界框架中。所提出的架构具有复杂度为O(n² + ne)的更新步骤,并预测了 primal- 和 dual-feasible 的 SDP 解。primal SDP 解通过 Goemans--Williamson 算法得到可行的最大割解。此外,该架构以自监督的方式进行训练,无需使用已解决的 SDP 松弛作为标签。实验表明,我们的架构可以显著降低精确最大割求解中界计算的成本,相比使用最先进的 SDP 求解器 Mosek 减少多达 10.6 倍。我们的工作强调了学习的、保持有效性的代理在加速精确优化结构凸松弛方面的潜力。

英文摘要

Exact solution of hard combinatorial optimization problems often relies on strong convex relaxations, but solving these relaxations repeatedly inside a branch-and-bound algorithm can be prohibitively expensive. Hence, we consider this challenge for Max-Cut, where branch and bound commonly uses semidefinite programming (SDP) relaxations to bound subproblems. We propose a Max-Cut-specific graph neural network that serves as a principled, lightweight neural proxy for these SDP solvers and can be plugged directly into an exact branch-and-bound framework. The proposed architecture has update steps of complexity $\mathcal{O}(n^2 + ne)$, and predicts both primal- and dual-feasible SDP solutions. The primal SDP solutions yield feasible Max-Cut solutions via the Goemans--Williamson algorithm. In addition, it is trained in a self-supervised fashion without requiring solved SDP relaxations as labels. Empirically, we show that our architecture can substantially reduce the cost of bounding in exact Max-Cut solving by up to $10.6 \times$ compared with using the state-of-the-art SDP solver Mosek. Our work highlights the potential of learned, validity-preserving surrogates for accelerating exact optimization over structured convex relaxations.

2605.07112 2026-05-11 cs.AI cs.MA

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft:面向代理工具调用的AI模型路由

Sharad Agarwal, Pooria Namyar, Alec Wolman, Rahul Ambavat, Ankur Gupta, Qizheng Zhang

发表机构 * Microsoft Research(微软研究院) Stanford(斯坦福大学)

AI总结 Switchcraft是首个面向代理工具调用优化的模型路由系统,通过选择成本最低且正确的模型,降低推理成本达84%,同时保持准确率与最佳单个模型相当。

详情
AI中文摘要

能够调用外部工具的代理AI系统强大但成本高,导致开发者倾向于使用大模型并超支推理预算。模型路由可缓解此问题,但现有路由器针对聊天完成而非工具使用设计。我们提出了Switchcraft,首个(据我们所知)面向代理工具调用优化的模型路由器。Switchcraft在线运行,选择成本最低且正确的模型。我们构建了五个函数调用基准的评估框架,并训练了一个基于DistilBERT的分类器,在延迟预算下部署。Switchcraft实现82.9%的准确率,与最佳单个模型匹配或超过,同时将推理成本降低84%,节省每百万查询超过3600美元。我们发现较大模型在工具使用任务中并不总是优于较小模型,且名义上更便宜的模型可能因token密集的推理而产生更高的总成本。我们的工作实现了在不牺牲正确性的前提下,实现成本感知的代理AI部署。

英文摘要

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

2605.07110 2026-05-11 cs.CL cs.SE

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

保障计算机使用代理:一种面向部署的架构-生命周期框架用于可靠性

Zejian Chen, Zhanyuan Liu, Chaozhuo Li, Mengxiang Han, Songyang Liu, Litian Zhang, Feng Gao, Yiming Hei, Xi Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) China Academy of Information and Communications Technology, Artificial Intelligence Institute(信息与通信技术研究院,人工智能研究所)

AI总结 本文提出一种面向部署的架构-生命周期框架,用于提升计算机使用代理的可靠性,通过分析感知、决策和执行层,以及创建、部署、操作和维护阶段,解决能力形成、授权暴露、故障表现和控制放置之间的联系。

详情
AI中文摘要

计算机使用代理(CUAs)正从有限的基准测试转向真实的软件环境,其中它们操作浏览器、桌面、移动应用、文件系统、终端和工具后端。在这样的设置中,可靠性不再仅由任务成功决定:感知误差、规划漂移、内存使用、工具中介、权限范围和运行时监督共同决定代理行为是否与用户意图保持一致。现有调查按方法、平台、基准或安全威胁组织CUA景观,但较少明确连接能力形成、授权暴露、故障表现和控制放置。为解决这一差距,本文开发了一种面向部署的架构-生命周期框架,用于CUAs的可靠性。架构视图分析感知、决策和执行作为耦合层,将软件观察转化为具有授权的行动。生命周期视图考察创建、部署、操作和维护作为阶段,在此阶段中,先前的被学习,工具和权限被绑定,运行时轨迹被强调,并且在漂移下必须保持保证。通过这种视角,分析综合了代表性系统、基准和安全/隐私研究;区分了故障变得可见的地方与其启用条件引入的地方,并映射了反复出现的干预表面用于控制监督和保证。OpenClaw仅作为公开的动机示例,而不是作为验证的内部案例研究。结论突出了可控接地、长周期约束保持、安全授权绑定、混合信任运行时防御、隐私保护内存和持续保证等方面的开放挑战。

英文摘要

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.

2605.07106 2026-05-11 cs.CL

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

检索、整合与综合:空间-语义 grounded 的潜在视觉推理

Jin Cui, Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学)

AI总结 本文提出RIS框架,通过空间-语义 grounded 方法改进多模态大语言模型的视觉推理,通过构建逐步 grounded 数据集并引入短语言过渡token,提升潜在状态与词汇对齐的解码能力,实验显示在多个基准上优于现有基线。

Comments 19 pages, 8 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉语言推理方面取得了显著进展,但大多数方法仍将视觉证据压缩为离散文本思想,导致细粒度感知的信息瓶颈。最近的潜在视觉推理方法尝试在连续隐藏状态中推理,但发现存在不足的流形兼容性:潜在轨迹偏离预训练推理电路,坍缩为实例无关模式,并在答案生成中常被绕过。为解决这些问题,我们提出RIS(检索、整合与综合),一种空间-语义 grounded 的框架,将潜在推理作为预训练MLLM计算的兼容扩展。我们首先构建包含边界框和区域特定语义描述的逐步 grounded 推理数据集。基于此监督,RIS将潜在token锚定于空间和语义证据,通过渐进注意力瓶颈强制其因果作用,并引入短语言过渡token将合成的潜在状态回溯到词汇对齐的解码。在V*、HRBench4K、HRBench8K、MMVP和BLINK上的实验显示,RIS在封闭/开源和潜在推理基线中均取得一致改进。进一步分析显示,RIS学习了多样、可解释且逐步整合的潜在轨迹,为MLLMs中的忠实内部视觉推理提供了实用路径。

英文摘要

Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.

2605.07105 2026-05-11 cs.LG cs.CL cs.CY cs.IT math.IT

Theoretical Limits of Language Model Alignment

语言模型对齐的理论极限

Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli

发表机构 * Apple(苹果公司)

AI总结 研究语言模型对齐的理论极限,通过推导KL散度预算下的最大预期奖励增益,揭示了KL正则化对齐的信息论限制,并证明了奖励融合能缓解奖励黑客问题。

详情
AI中文摘要

语言模型(LM)对齐通过改进模型输出以反映人类偏好,同时保持基础模型的能力。最常见的对齐方法是(i)强化学习,通过KL散度约束最大化期望奖励,以及(ii)best-of-N对齐,从N个独立样本中选择最高奖励输出。尽管广泛应用,但KL预算下的奖励改进基本限制仍不明确。我们通过推导固定KL散度预算下的最大可实现期望奖励增益,刻画了KL正则化对齐的信息论限制。我们的第一个结果提供了最优奖励改进的闭式表达式,由Jeffreys散度项而非之前的√KL主导。我们进一步将此表达式重新表述为基础模型下的协方差,得到一个仅需基础模型样本即可预测对齐增益的实用估计器。我们扩展分析至代理奖励设置,显示理想与代理对齐(奖励黑客)的差距随奖励误差幅度和KL惩罚因子减小而增大。我们证明奖励融合能缓解奖励黑客问题,为实践中使用的该技术提供了理论依据。实证上,我们计算了安全性和摘要任务中LM的KL-奖励帕累托前沿,并显示best-of-N接近理论极限,而PPO和GRPO仍明显不理想。我们的理论结果揭示了对齐文献中若干实证观察到的现象,并表明需要算法改进以在不增加高推理成本的情况下实现最优对齐。

英文摘要

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

2605.07104 2026-05-11 cs.LG math.OC stat.ML

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

随机逼近与强化学习的几乎必然收敛速度 via 皮奥内-莫鲁 drift

Xinyu Liu, Zixuan Xie, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文研究了在马尔可夫噪声下随机逼近和强化学习的几乎必然收敛速度,提出了一种基于皮奥内-莫鲁漂移的新方法,适用于具有收缩性的随机逼近算法,如Q学习和线性时间差分学习。

详情
AI中文摘要

在马尔可夫噪声下建立随机逼近和强化学习的几乎必然收敛速度是一个基本理论挑战。我们针对一类具有收缩性期望更新的随机逼近算法取得了进展,该类算法出现在许多强化学习算法中,如Q学习和线性时间差分学习。具体而言,对于幂律学习率O(n^{-η}) (η∈(1/2, 1)),我们获得了几乎必然收敛速度接近o(n^{1 - 2η})。对于谐波学习率O(n^{-1}),我们获得了几乎必然收敛速度接近o(n^{-1}),这被认为是强结果,因为它接近最优率O(n^{-1}loglogn),由迭代对数定律给出(对于i.i.d.噪声的特殊情况)。关键在于一种新的Lyapunov漂移构造,应用了基于皮奥内方程的修正来处理马尔可夫噪声,以改进已建立的Moreau包络平滑方法。

英文摘要

Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-η})$ with $η\in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2η})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.

2605.07103 2026-05-11 cs.AI cs.MA

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

ARMOR:一种通过自适应效用感知多工具推理的代理框架用于反应可行性预测

Ye Liu, Botao Yu, Xinyi Ling, Daniel Adu-Ampratwum, Xia Ning

发表机构 * Department of Biomedical Informatics(生物医学信息学系) Department of Computer Science and Engineering(计算机科学与工程系) Division of Medicinal Chemistry and Pharmacognosy(药物化学与药理学系) Translational Data Analytics Institute(转化数据分析研究所)

AI总结 ARMOR通过自适应效用感知多工具推理框架,有效整合多个工具的优势,提升反应可行性预测的准确性,尤其在存在工具预测冲突时表现突出。

详情
AI中文摘要

反应可行性预测作为计算化学中的基础问题,已从人工智能的进展中受益于多种工具,特别是大语言模型。然而,单个工具在不同反应中的性能差异显著,使任何单一工具难以在所有情况下都表现良好。为此,我们提出了ARMOR,一种代理框架,明确建模工具特定效用,自适应优先级工具,并进一步解决潜在的工具冲突以产生每个反应的最终预测。与现有方法不同,ARMOR将工具组织成层次结构,优先选择表现最佳的工具并在需要时推迟其他工具,通过工具特定的模式刻画其优势,并通过记忆增强推理解决冲突。在公开数据集上的广泛实验表明,ARMOR在各种基准上均优于单工具方法以及各种工具聚合和工具选择方法。进一步分析显示,改进在存在冲突工具预测的反应中尤为显著,突显了ARMOR在利用多个工具互补优势方面的有效性。代码可通过https://anonymous.4open.science/r/ARMOR-E13F获取。

英文摘要

Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via https://anonymous.4open.science/r/ARMOR-E13F.

2605.07102 2026-05-11 cs.CL

SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

SAGE:基于本体的层级LLM文学评价

Tianyu Wang, Nianjun Zhou

发表机构 * Mercy University, Math & Computer Science Department(梅里大学数学与计算机科学系) IBM T.J. Watson Research Center(IBM 托马斯·贾维斯·沃森研究中心)

AI总结 SAGE通过本体引导的解释性维度对文学质量进行分层评估,利用结构化大语言模型评估和多轮迭代反思,实现98.8%的评分收敛和94%的评分者一致性,揭示了文学质量的层次结构和影响因素。

Comments 19 pages, 4 figures

详情
AI中文摘要

评估文学质量需要评估诸如文化表现、情感深度和哲学复杂性等解释性维度,这些维度难以通过简单的计算测量。我们介绍了SAGE,一种分层评估框架,将文学质量分解为基于本体的解释性维度,通过结构化大语言模型评估、多轮迭代反思和独立验证进行评估。我们在100个短篇小说(50部经典作品、30部通俗小说、20个LLM生成的叙事)上验证了该框架,跨越三个分析层(文化、情感-心理、存在-哲学),使用双模式评估。在600次评估中,该框架实现了98.8%的评分收敛和超过94%的评分者一致性,内容基于和元数据基于的评估模式几乎完美不变。统计分析揭示了一致的体裁层级(经典>通俗>LLM,所有p<0.001),具有层特定的区分:文化批评和哲学深度表现出非常大的效应量(Cohen's d>2.4),而情感表现显示出较小的差距(d=1.68),表明情感模式比批判立场或哲学深度更容易从训练数据中学习。跨层相关性(r=0.649-0.683)证实了这三个维度捕捉了经验上可区分的质量方面。这些发现表明,理论驱动的LLM评估可以实现测量级的可靠性,并支持系统识别当前生成模型在人类文学生产中的不足,对可扩展的自动开放式文本生成评估有直接的启示。

英文摘要

Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.

2605.07094 2026-05-11 cs.LG

Actor-Critic with Active Importance Sampling

具有主动重要性采样的Actor-Critic

Majid Molaei, Gabor Paczolay, Matteo Papini, Alberto Maria Metelli, Marcello Restelli

发表机构 * Politecnico di Milano(米兰理工学院)

AI总结 本文提出主动重要性采样Actor-Critic算法,通过优化行为策略减少策略梯度估计方差,提升连续动作空间下的学习效率与稳定性。

详情
AI中文摘要

本文介绍了主动重要性采样Actor-Critic(AISAC)算法,该算法是Actor-Critic框架的扩展,用于减少策略梯度估计中的方差。AISAC通过优化行为策略以最小化梯度方差,同时保持无偏的梯度估计。利用重要性采样原理,该算法使行为策略适应于与目标策略梯度一致的数据收集分布。对于连续动作空间,AISAC采用通过交叉熵最小化优化的高斯行为策略。我们提供了理论分析,证明了方差减少和无偏性。在倒立摆和半人马任务中的实验表明,与标准Actor-Critic方法相比,AISAC在学习速度、样本效率和训练稳定性方面有所改进。结果表明,优化行为策略在不同超参数设置下都能提升目标策略更新和批评者估计准确性。AISAC加速了收敛并稳定了强化学习训练,使其在现实应用中具有前景。未来的工作包括将其与Soft Actor-Critic和TD3等先进算法集成,以应对更复杂的环境。

英文摘要

This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.

2605.07093 2026-05-11 cs.CL cs.AI cs.LG

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

翻译税并非标量:对中国多语言基准测试中英语源提示继承的反事实审计

Zezheng Lin, Fengming Liu, Handi Li

发表机构 * OpenAI NeurIPS 2025 workshop(NeurIPS 2025 工作坊)

AI总结 本文通过反事实审计揭示翻译税并非单一标量,指出翻译基准测试中存在估计器和项目依赖的有效性风险,并提供相关证据和检查清单。

Comments 13 pages, 3 figures. Submitted to NeurIPS 2026

详情
AI中文摘要

翻译税常被视为标量:翻译基准测试被假设通过保留英语源提示来提高分数。我们在英语到中文的设定中审计这一主张。三种代理估计器得出不同结论:回译差距较小且解析器脆弱;提示-分数校准无法预测项目级增益;六模型原生对照显示模型族而非统一基准效应。我们添加了一个相同项目LLM-自然化压力测试,固定答案、选项和内容,仅重写中文表层形式。在修正提示构造错误后,此对比不再支持模型族交互,但保留了残余剂量-反应:高残余项目受益而低残余项目无益。结果并非单一翻译税,而是一组估计器和项目依赖的有效性风险。我们发布每单元证据、自然化协议、人工QC和翻译多语言基准论文的报告清单。

英文摘要

The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.

2605.07086 2026-05-11 cs.CV cs.LG

Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

任务相关性并非局部可替换性:通道信息的双轴视角

Houman Safaai, Andrew T. Landau, Celia C. Beron, Yasin Mazloumi, Bernardo L. Sabatini

发表机构 * Kempner Institute for the Study of Natural and Artificial Intelligence(自然与人工智能研究学院) Harvard University(哈佛大学) Department of Neurobiology, Howard Hughes Medical Institute(神经生物学系,霍华德·休斯医学研究所) Harvard Medical School(哈佛医学院)

AI总结 本文提出双轴视角分析通道重要性,区分任务相关性和局部可替换性,发现不同轴在不同网络中表现各异,局部轴更可靠预测可删除性。

详情
AI中文摘要

视觉网络中通道重要性通常通过单一评分总结,但此总结掩盖了两个问题:通道与任务的相关性以及在移除通道时同层其他通道能否替代其功能。我们引入双轴视角,分别衡量输入捕获和同层重叠(局部轴)以及任务信息和目标超额信息(目标轴)。在ResNet-18、VGG-16和MobileNetV2上训练的CIFAR-100数据集中,两个轴弱相关,导致不同的通道分组,并在训练过程中迅速分离,尽管在随机初始化时强相关。高斯线性分析解释了这种分离如何通过残差梯度方向产生,而病变加同层替代实验显示同层支持能细化可删除性,超越输入捕获和任务相关性。在固定FLOPs匹配剪枝协议下,局部轴指标在三个CIFAR-100架构中比目标轴指标更可靠预测可删除性,且在压力测试中方向一致。这些发现表明轴层面的区别而非通用剪枝评分排名:局部可替换性比任务相关性更可靠指导可删除性,而基于范数的基线在VGG-16等架构中仍具竞争力。基于相关性的评分询问通道对任务的说明,而剪枝询问当其他通道可用时网络是否仍需该通道。

英文摘要

Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.

2605.07084 2026-05-11 cs.CL

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation

超越单一真实参考:参考一元论在语音识别评估中的知识不正义

Anna Seo Gyeong Choi, Maria Teleki, James Caverlee, Miguel del Rio, Corey Miller, Hoon Choi

发表机构 * Department of Information Science, Cornell University(康奈尔大学信息科学系) Department of Computer Science, Texas A&M University(德克萨斯农工大学计算机科学系) Rev AI Tundra Technical Solutions(Tundra技术解决方案) Division of Liberal Studies, Kangwon National University(康sworth国立大学文科系)

AI总结 本文探讨语音识别评估中单一参考标准的不公正性,提出通过不同注释惯例衡量ASR性能,并引入Epistemic Injustice Distance来量化其影响。

详情
AI中文摘要

自动语音识别(ASR)评估将系统输出与真实转录进行比较,用词错误率(WER)衡量两者之间的距离。但真实转录并非被发现,而是由人类标注者按照编码规范性假设来生成的。不同的惯例(逐字、非逐字、法律)会产生不同的相同语音转录,并对相同的ASR输出做出不同判断。本文论证参考一元论——强制采用单一转录惯例作为真实参考——造成了知识不正义。患有失语症的说话者,其语音包含临床有意义的不流畅现象,在被

英文摘要

Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.

2605.07082 2026-05-11 cs.CV

ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction

ImplantMamba: 长距离序列建模Mamba用于牙科种植体位置预测

Xinquan Yang, Congmin Wang, Xuguang Li, Yulei Li, Linlin Shen, Yongqiang Deng He Meng

发表机构 * School of Artifical Intelligence, Shenzhen University, Shenzhen, China(人工智能学院,深圳大学,中国深圳) National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China(大数据系统计算技术国家工程实验室,深圳大学,中国) Huangpu People's Hospital, Zhongshan City, China(中山市黄埔人民医院,中国) Department of Stomatology, Shenzhen University General Hospital, Shenzhen, China(口腔科,深圳大学附属医院,中国深圳)

AI总结 本文提出ImplantMamba,通过结合CNN与Mamba层,实现种植体位置与角度的联合预测,提升种植体位置预测的准确性与可靠性。

详情
AI中文摘要

在设计种植体放置手术导板时,确定精确的种植体位置是一个关键步骤。然而,种植体区域本身在医学图像中常缺乏显著的纹理特征。因此,人工智能(AI)模型必须通过分析周围牙齿的纹理来推断正确的种植体位置和角度,这提出了重大挑战。为此,我们提出ImplantMamba,一种用于长距离序列建模的网络架构,旨在整合相邻牙齿的纹理信息。我们的方法明确地将种植体位置的回归与角度回归耦合在一起。ImplantMamba的核心是一个混合编码器,结合卷积神经网络(CNNs)与Mamba层。这种设计使网络能够通过CNNs分层提取局部解剖特征,同时通过Mamba的选择性扫描操作建模整个扫描体积中的全局上下文依赖性,从而更全面地理解种植体位置。此外,我们引入了Slope-Coupled Prediction Branch(SCP)。该分支旨在将种植体位置的预测与角度连接起来,通过强制预测的种植体位置和角度之间存在一致的关系,确保内部一致性和解剖合理性。在大规模牙科种植体数据集上的大量实验表明,所提出的ImplantMamba相比现有方法在性能上更优。

英文摘要

In the design of surgical guides for implant placement, determining the precise implant position is a critical step. However, the implant region itself is often characterized by a lack of distinctive texture in medical images. Consequently, artificial intelligence (AI) models must infer the correct implant position and angulation (slope) primarily by analyzing the texture of the surrounding teeth, which poses a significant challenge. To address this, we propose ImplantMamba, a network architecture designed for long-range sequential modeling to integrate texture information from adjacent teeth. Our approach explicitly couples the regression of the implant position with its slope. The core of ImplantMamba is a hybrid encoder that combines Convolutional Neural Networks (CNNs) with Mamba layers. This design enables the network to hierarchically extract local anatomical features through CNNs while simultaneously modeling global contextual dependencies across the entire scan volume via Mamba's selective scan operations, leading to a more comprehensive understanding of the implant site. Furthermore, we introduce a Slope-Coupled Prediction Branch (SCP). This branch is designed to connect the prediction of implant position with the slope, ensuring internal consistency and anatomical plausibility by thereby enforcing a coherent relationship between the predicted implant location and its angulation. Extensive experiments on a large-scale dental implant dataset demonstrate that the proposed ImplantMamba achieves superior performance compared to existing methods.

2605.07080 2026-05-11 cs.AI cs.DS

Online Allocation with Unknown Shared Supply

在线分配与未知共享供应

Tzeh Yuan Neoh, Davin Choo, Mengchu Yue, Milind Tambe

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出OSSA问题,研究在未知供应下在线分配资源的方法,提出GPA策略并证明其在4/3倍近似最优,同时展示学习增强扩展在稀缺供应下的优越性。

详情
AI中文摘要

许多现实资源分配系统,如人道主义物流和疫苗分配,必须在需求确定前在多个地点预置有限供应,而缺货会导致不可逆的服务损失。为此,我们引入了在线共享供应分配(OSSA)问题,这是一个状态化的在线模型,在固定运输成本和缺货惩罚下,中央枢纽将有限的未知供应分配给多个面临顺序需求的地点。不同于传统的make-to-stock或make-to-order库存模型,OSSA不允许背诵和补货,仅能对未来的市场需求进行对冲。为解决OSSA,我们提出了一种确定性的阈值比例策略GPA,并证明其在4/3倍近似最优,至多一个与总供应无关的加法项。我们进一步通过匹配的下界证明,4/3比率是紧的,且即使对于知道总供应的随机算法,加法误差依赖也是不可避免的。最后,我们开发了GPA的学习增强扩展,主要整合不完美的预测(如来自人类专家或ML模型的预测),使我们能够利用高质量的建议,同时对任意差的建议具有鲁棒性。合成和现实世界实验表明,GPA在稀缺供应下优于自然基线。

英文摘要

Many real-world resource allocation systems, such as humanitarian logistics and vaccine distribution, must preposition limited supply across multiple locations before demand is realized while stockouts incur irreversible service losses. To study this, we introduce the Online Shared Supply Allocation (OSSA) problem, a stateful online model in which a central hub allocates a finite, unknown supply to multiple sites facing sequential demand under fixed-charge transportation costs and lost-sales penalties. Unlike classical make-to-stock or make-to-order inventory models, OSSA precludes backlogging and replenishment only hedges against future demand. To tackle OSSA, we propose a deterministic threshold-proportional policy GPA and prove that it achieves a $4/3$-approximation to the offline optimum up to an additive term independent of the total supply. We complement this with matching lower bounds showing that the $4/3$ ratio is tight and that the additive-error dependence is unavoidable, even for randomized algorithms that know the total supply upfront. Finally, we develop a learning-augmented extension to GPA that principally incorporates imperfect forecasts (e.g., from human experts or ML models) commonly available in practice, enabling us to exploit high-quality advice while being robust against arbitrary bad ones. Synthetic and real-world experiments show that GPA outperforms natural baselines with global supply is scarce.

2605.07079 2026-05-11 cs.CV cs.AI cs.LG cs.RO

Learning Visual Feature-Based World Models via Residual Latent Action

通过残差潜在动作学习基于视觉特征的世界模型

Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias

发表机构 * Rutgers University(罗格斯大学) Purdue University(普渡大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出RLA-WM世界模型,通过流匹配预测残差潜在动作,优于现有特征和视频扩散模型,且在速度和效率上更优,同时开发了两项机器人学习技术。

详情
AI中文摘要

世界模型通过观察和动作预测未来转变。现有研究主要关注图像生成,而基于视觉特征的世界模型预测未来视觉特征,更高效且减少幻觉。然而,现有特征方法依赖直接回归,在复杂交互中预测模糊或坍塌。本文发现残差潜在动作(RLA)可从DINO残差中学习,RLA具有预测性、可推广性和时间进展性。基于RLA,提出RLA-WM模型,通过流匹配预测RLA值,在模拟和现实数据集上优于现有方法,且比视频扩散快多个数量级。此外,开发了两项机器人学习技术,利用RLA-WM提升策略学习。

英文摘要

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

2605.07078 2026-05-11 cs.LG

Test-Time Compositional Generalization in Diffusion Models via Concept Discovery

通过概念发现实现扩散模型中的测试时间组合泛化

Zekun Wang, Anant Gupta, Tianyi Zhu, Christopher J. MacLellan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Virginia(弗吉尼亚大学)

AI总结 本文提出通过发现时间索引分数几何结构,实现扩散模型在测试时的组合生成,优于传统基线方法。

Comments 9 pages

详情
AI中文摘要

组合泛化要求模型能从熟悉部分生成新配置。在扩散模型中,先前方法通常假设相关概念或条件信号已可用。我们探讨预训练扩散模型是否能从时间索引的噪声边缘分布$p_t(x_t)$学习的分数中发现查询特定的概念,并在测试时组合。给定一个分布外查询,我们的方法在多个噪声步骤上对$s_θ(x_t,t) \approx \nabla_{x_t}\log p_t(x_t)$进行梯度上升以恢复局部密度模式,将这些模式映射为干净空间高斯分布,通过子模假设目标贪婪选择相关原型,生成产品专家(PoE)教师模型,具有解析分数。该教师模型可通过分类自由引导直接采样,或用于生成样本池以训练新类别嵌入和低秩适配器。在基于ColorMNIST和CelebA构建的测试组合基准上,解析的PoE采样器和低秩适配模型均优于仅查询和最近训练类基线。这些结果表明,扩散模型的时间索引分数几何包含可重用的密度模式概念,支持在无预定义概念库的情况下实现测试时间组合生成。

英文摘要

Compositional generalization requires models to produce novel configurations from familiar parts. In diffusion models, prior compositional generation methods typically assume that the relevant concepts or conditioning signals are already available. We instead ask whether a pretrained diffusion model can discover query-specific concepts from the time-indexed scores it learns for the noisy marginals $p_t(x_t)$ and compose them at test time. Given a single out-of-distribution query, our method performs gradient ascent on $s_θ(x_t,t) \approx \nabla_{x_t}\log p_t(x_t)$ at multiple noising timesteps to recover local density modes, maps these modes into clean-space Gaussians, greedily selects relevant prototypes with a submodular likelihood objective, and combines them into a product-of-experts (PoE) teacher model with an analytic score. This teacher model can be sampled directly through classifier-free guidance or used to generate a sample pool for training a new class embedding and low-rank adapter. On held-out composition benchmarks built from ColorMNIST and CelebA, both the analytic PoE sampler and the low-rank adapted model outperform query-only and nearest trained-class baselines. These results suggest that the time-indexed score geometry of the diffusion model contains reusable density-mode concepts that support test-time compositional generation without a predefined concept library.

2605.07075 2026-05-11 cs.LG

ModelLens: Finding the Best for Your Task from Myriads of Models

ModelLens: 从众多模型中为你的任务找到最佳模型

Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma, Wenhui Zhu, Xiwen Chen, Muhao Chen, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) Arizona State University(亚利桑那州立大学) Morgan Stanley(摩根大通)

AI总结 ModelLens通过学习模型-数据集-指标的性能感知潜在空间,在不运行候选模型的情况下对新数据集进行排序,优于仅依赖元数据或需运行每个候选模型的基线方法。

详情
AI中文摘要

当前开源模型生态系统包含数十万种预训练模型,但为新数据集选择最佳模型变得越来越不可行:新模型和未基准测试的数据集持续出现,使从业者在双方都缺乏先验记录。现有方法仅处理野外环境中的碎片:AutoML和转移性估计仅从小预定义池中选择模型或需要昂贵的每个模型前向传递,而模型路由假设有一个给定的候选池。我们引入ModelLens,一个统一的野外模型推荐框架。我们的关键见解是,尽管公共排行榜互动零散且嘈杂,但它们总体上揭示了模型能力在异构评估设置中的隐含地图,这种信号足够直接学习。通过学习模型-数据集-指标元组的性能感知潜在空间,ModelLens在不运行候选模型的情况下对未见数据集排序。在包含162万条评估记录、覆盖47,000个模型和9,600个数据集的新基准上,ModelLens超越了仅依赖元数据或需要在目标数据集上运行每个候选模型的基线方法。其推荐的Top-K池进一步在多个代表性路由方法上提高了81%的性能,适用于多样化的问答基准。对最近发布的基准的案例研究进一步确认了其在文本和视觉-语言任务上的泛化能力。

英文摘要

The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.

2605.07073 2026-05-11 cs.AI

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

TeamBench: 在强制角色分离下评估代理协调

Yubin Kim, Chanwoo Park, Taehan Kim, Eugene Park, Samuel Schmidgall, Salman Rahman, Chunjong Park, Cynthia Breazeal, Xin Liu, Hamid Palangi, Hae Won Park, Daniel McDuff

发表机构 * MIT(麻省理工学院) Google Research(谷歌研究院) Google DeepMind(谷歌深Mind) Independent Researcher(独立研究者)

AI总结 TeamBench通过851个任务模板和931个预设实例,评估在操作系统强制角色分离下代理协调的能力,发现提示仅和沙盒强制团队的通过率无显著差异,但提示仅团队更易出现验证者篡改执行者代码的情况。

详情
AI中文摘要

代理系统通常将任务分解到多个角色,但这些角色通常由提示指定而非通过访问控制强制执行。没有强制执行,团队通过率可能掩盖代理是否实际协调或一个角色是否有效完成另一个角色的工作。我们提出了TeamBench基准,包含851个任务模板和931个预设实例,用于评估在操作系统强制角色分离下代理协调的能力。TeamBench将规格访问、工作区编辑和最终认证分别分配给规划者、执行者和验证者角色,使得任何角色都不能读取完整要求、修改工作区或认证最终答案。提示仅和沙盒强制团队在统计上无法区分通过率,但提示仅运行产生3.6倍的验证者篡改执行者代码的情况。验证者批准49%的失败确定性评分提交,移除验证者会改善消融实验的平均部分分数。团队价值也具有条件性。当单个代理挣扎时,团队受益,但当单个代理表现良好时,团队受损。40次人类研究在相同角色分离下显示,我们的基准揭示了通过率遗漏的交互模式。单人参与者直接完成任务,人类参与者与代理配对时往往迅速批准,而人类团队则更努力协调角色间缺失的信息。

英文摘要

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

2605.07072 2026-05-11 cs.LG cs.CR stat.ML

Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?

更随机,更隐私:DP-SGD中最优的子采样方案是什么?

Andy Dong, Ayfer Özgür

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究了DP-SGD中子采样方案的最优选择,证明了平衡迭代子采样(BIS)在隐私放大方面优于泊松子采样,并在噪声谱两端最优。

Comments 17 pages, 1 table. Submitted to NeurIPS 2026

详情
AI中文摘要

Poisson子采样是差分隐私机器学习中的默认子采样方案,主要因其无结构随机性可进行 tractable 的隐私放大分析。然而,这种随机性引入了显著的参与方差:每个样本出现在非常不同的训练迭代次数中。在本文中,我们证明这种方差不仅仅是需要容忍的实用问题,而是隐私放大优化的根本来源。我们证明了平衡迭代子采样(BIS),一种结构化方案,其中每个样本参与恰好固定次数的迭代,比泊松子采样具有更强的隐私放大能力,并在噪声谱的两端(σ→0和σ→∞)最优。我们的分析揭示,隐私-噪声权衡不是由最大化随机性决定,而是通过消除参与方差同时保持迭代的均匀边际参与。为了将这种渐近理论转化为有限噪声保证,我们引入了一种实用的近似蒙特卡洛账务师,以消除现有RDP和基于组合的PLD分析中的分析空隙。在超过60种实际DP-SGD配置上的评估表明,BIS在高效私训练最相关的低噪声区域中始终优于泊松子采样,将所需的噪声乘数减少高达9.6%。这些结果推翻了更多采样随机性必然带来更强隐私放大这一常见直觉:在DP-SGD中,结构化参与可以同时更实用和更隐私。我们的实现可在https://github.com/dong-xin-ao-andy/bis-mc-accountant获取。

英文摘要

Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum ($σ\to 0$ and $σ\to \infty$). Our analysis reveals that the privacy-noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite-noise guarantees, we introduce a practical near-exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition-based PLD analyses. Evaluations across more than 60 practical DP-SGD configurations show that BIS consistently outperforms Poisson subsampling in the low-noise regimes most relevant for high-utility private training, reducing the required noise multiplier by up to $9.6\%$. These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP-SGD, structured participation can be both more practical and more private. Our implementation is available at https://github.com/dong-xin-ao-andy/bis-mc-accountant.