arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
2605.06717 2026-05-11 cs.SE cs.AI

Agentic Coding Needs Proactivity, Not Just Autonomy

代理编码需要主动性,而非仅仅自主性

Nghi D. Q. Bui, Georgios Evangelopoulos

发表机构 * GitHub

AI总结 本文探讨了软件开发中代理编码的主动性与自主性区别,提出主动性分类和评估标准,旨在提升编码代理的洞察力和适应性。

Comments Position Paper

详情
AI中文摘要

编码代理正在迅速改变软件开发的格局,从内联补全发展为自主系统,能够编辑仓库、打开拉取请求、响应问题并运行计划或 webhook 触发的例行程序。下一代代理越来越被描述为主动和长周期:代理应在开发者请求之前注意到相关变化,跨工具连接信号,决定何时中断,并在会话间保持偏好。然而,该领域仍缺乏对主动性在软件开发中意义的清晰描述,以及它如何与自主性不同,主动长周期任务的接受标准是什么,以及哪些指标决定无邀请代理行为是否有益而非仅仅是活跃的。主动编码代理应通过其洞察政策的质量和改进来评估:该政策决定下一步什么重要,什么证据支持它,是否展示它,以及如何在反馈后适应。这一观点基于混合主动交互的原则。我们提出了主动性的三级分类(反应型、计划型和情境感知型),比较了当代编码代理 against 五个实用标准,并勾勒出一个主动用户模拟协议,包含三个评估目标:洞察决策质量(IDQ)、情境接地分数(CGS)和学习提升。

英文摘要

Coding agents are rapidly changing the landscape of software development, moving from inline completion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across sessions. Yet the field still lacks a clear account of what proactivity means for software development, how it differs from autonomy, what acceptance criteria proactive long-horizon tasks should satisfy, and which metrics determine whether unsolicited agent behavior is useful rather than merely active. Proactive coding agents should be evaluated by the quality and improvement of their insight policy: the policy that decides what matters next, what evidence supports it, whether to show it, and how to adapt after feedback. This view is grounded in the principles of mixed initiative interaction. We propose a three level taxonomy of proactivity (Reactive, Scheduled, and Situation Aware), compare contemporary coding agents against five practical criteria, and sketch an active user simulation protocol with three evaluation targets: Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift

2605.06713 2026-05-11 cs.CR cs.AI cs.HC

Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand

代理AI与网络攻击的工业化:预测、后果及企业与中产阶层的防御优先事项

Christopher Koch

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨代理AI对网络攻击生命周期的影响,提出三通道代理网络风险模型和攻击压缩模型,通过2026年Linux内核复制失败事件案例,预测2026-2028年企业及德国中产防御需求,强调身份验证、补丁速度等防御优先级。

Comments 7 pages

详情
AI中文摘要

代理AI系统能够规划、调用工具、检查代码、与网络应用交互并协调多步骤工作流。这些能力改变了网络攻击的经济模式。近期内风险并非所有低技能犯罪者立即成为前沿漏洞研究人员,而是代理AI通过降低侦察、钓鱼、凭证滥用、漏洞优先级评估、漏洞利用适应和攻击后决策支持的成本,压缩攻击周期。本文综合了国家安全机构、行业威胁报告、代理安全指南和LLM代理网络能力研究的公开证据,提出三通道代理网络风险模型和代理攻击压缩模型,以2026年Linux内核复制失败事件为案例研究,探讨 foothold-to-root 加速,并为大型企业和德国及欧洲中产阶层制定2026至2028年的预测。本文最后提出优先防御路线图。组织应将代理AI安全视为立即的运营问题:身份验证、抗钓鱼认证、补丁速度、CI/CD和Linux/容器加固、代理治理、遥测和恢复准备必须立即加强。

英文摘要

Agentic AI systems can plan, call tools, inspect code, interact with web applications, and coordinate multi-step workflows. These same capabilities change the economics of cyber offense. The central near-term risk is not that every low-skill criminal immediately becomes a frontier exploit researcher; it is that agentic AI compresses the attack lifecycle by lowering the cost of reconnaissance, phishing, credential abuse, vulnerability triage, exploit adaptation, and post-compromise decision support. This paper synthesizes current public evidence from national cybersecurity agencies, industry threat reports, agent security guidance, and research on LLM agents cyber capabilities. It introduces a Three Channel Agentic Cyber Risk Model and an Agentic Attack Compression Model, uses the 2026 Linux kernel Copy Fail incident as a case study for foothold-to-root acceleration, and develops a 2026 to 2028 forecast for large enterprises and the German and European Mittelstand. The paper concludes with a prioritized defense roadmap. Organizations should treat agentic AI security as an immediate operational problem: identity, phishing resistant authentication, patch velocity, CI/CD and Linux/container hardening, agent governance, telemetry, and recovery readiness must be strengthened now.

2605.06710 2026-05-11 cs.IT cs.LG math.IT math.ST stat.TH

Information-theoretic Limits of Learning and Estimation

信息论学习与估计的极限

Abbas El Gamal, Maxim Raginsky

发表机构 * Stanford University(斯坦福大学) UIUC(伊利诺伊大学香槟分校)

AI总结 本文探讨信息论在学习与估计算法极限中的作用,介绍集中不等式、度量空间覆盖与包合、度量熵等工具,并推导了泛化误差上界及最小最大风险下界。

详情
AI中文摘要

信息论在确立任何学习或估计算法能实现或无法实现的极限方面起着核心作用,无论计算能力如何。本章提供这些联系的介绍。每章末尾的练习使内容适合课堂教学和自学。我们首先介绍集中不等式以及度量空间中的覆盖和包合概念及其相关的度量熵概念。这些工具是我们的分析所必需的。然后介绍学习理论框架,并以度量熵、Rademacher复杂度、VC维以及互信息和相对熵推导泛化误差的上界。最后讨论最小最大估计框架,并利用Fano不等式建立最小最大风险的下界,得到以相对熵和覆盖与包合数为参数的界。本文包含即将被纳入即将出版的第三版《信息论基础》中的章节预印本,经Wiley许可发布。该章节紧接arXiv:2605.02989发布的章节。新版本的目录可在https://docs.google.com/document/d/1L-m4oQEJw1PJhoxBeMwrrBD8S_HmvzMEkPbYvS24980/edit?usp=sharing找到。如需反馈,请联系abbas@ee.stanford.edu。

英文摘要

Information theory plays a central role in establishing fundamental limits on what any learning or estimation algorithm can -- and cannot -- achieve, regardless of computational power. In this chapter, we provide an introduction to these connections. End-of-chapter exercises makes the material suitable for both classroom use and self-study. We begin by introducing concentration inequalities along with the notions of covering and packing in metric spaces, and the associated concept of metric entropy. These tools are essential for our analysis. We then introduce the learning-theoretic framework and derive upper bounds on generalization error in terms of metric entropy, Rademacher complexity, and the VC dimension, as well as mutual information and relative entropy. Finally we discuss the minimax estimation framework and establish lower bounds on minimax risk using Fano's inequality, yielding bounds in terms of relative entropy and covering and packing numbers. This manuscript contains preprint of a chapter under consideration for inclusion in the forthcoming third edition of Cover and Thomas's Elements of Information Theory, posted with permission from Wiley. It would follow the chapter posted at arXiv:2605.02989 . The table of contents of the new edition can be found at: https://docs.google.com/document/d/1L-m4oQEJw1PJhoxBeMwrrBD8S_HmvzMEkPbYvS24980/edit?usp=sharing . For feedback, please contact abbas@ee.stanford.edu.

2605.06699 2026-05-11 eess.IV cs.AI cs.CV cs.LG

Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

通过交叉注意力在联合潜在空间中利用扩散模型进行MRI和表格数据的多模态合成

Daniel Mensing, Jan Kapar, Jochen G. Hirsch, Matthias Günther, Horst Hahn, Marvin N. Wright

发表机构 * Fraunhofer Institute for Digital Medicine MEVIS(弗劳恩霍夫数字医学研究所MEVIS) Leibniz Institute for Prevention Research and Epidemiology – BIPS(莱比锡预防研究与流行病学研究所 – BIPS) Faculty of Mathematics and Computer Science, University of Bremen(不莱梅大学数学与计算机科学学院) Faculty of Physics and Electrical Engineering, University of Bremen(不莱梅大学物理与电气工程学院)

AI总结 本文提出了一种多模态潜在扩散模型,通过交叉注意力在共享潜在空间中联合生成MRI和临床表格数据,验证了在单一扩散框架中联合建模MRI和混合类型表格数据的可行性。

Journal ref Proc. SPIE 13925, Medical Imaging 2026: Image Processing, 139252D (April 03, 2026)

详情
AI中文摘要

我们提出了一种多模态潜在扩散模型,通过交叉注意力在共享潜在空间中联合生成体积磁共振成像(MRI)和表格临床数据。该方法使MRI和表格模态能够进行一致的联合表示学习。我们的模型利用变分自编码器在扩散生成之前融合两种模态,允许使用分别的解码器进行模态适应性重建。我们在德国国家队列(NAKO Gesundheitsstudie)数据上评估了该框架,包含超过10,000名参与者,其MRI扫描和临床表格特征如年龄、性别、身体测量和种族。生成的MRI体积在解剖学上合理且身体成分与合成的表格属性一致。使用弗雷歇距离和精确-召回度量的定量评估证实了高保真度的图像生成。在表格模态中,我们的模型在标准评估指标上优于CTGAN,并在与TVAE相当的性能上,显示出与现有单模态基线相竞争的结果。本文工作是到目前为止首次在单一扩散框架中展示MRI和混合类型表格数据联合建模的可行性,为生成一致的合成多模态患者数据提供了证明,并与数字孪生在医疗领域的更广泛目标相一致。

英文摘要

We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fréchet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.

2605.06055 2026-05-11 cs.DC cs.LG

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

基于池化HBM的无中继缓冲通信以提高Ascend上的高效MoE推理

Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出一种无中继缓冲的MoE推理通信设计,通过直接放置和读取专家窗口,减少中间中继和重排序缓冲,提升吞吐量和降低延迟,实验表明在Ascend平台有效。

详情
AI中文摘要

混合专家(MoE)推理需要跨设备大规模令牌交换,使调度和组合成为prefill和decode阶段的主要瓶颈。除了网络传输外,路由驱动的布局转换、临时中继和输出恢复会增加显著开销。现有MoE通信路径通常以缓冲器为中心,使用显式的进程间中继和重排序缓冲围绕集体传输。本报告提出了一种无中继缓冲的MoE推理加速设计,用于Ascend系统。该设计围绕直接放置到目标专家窗口和直接从远程专家窗口读取重新组织调度和组合。基于全局池化的高带宽内存和对称内存分配,它减少了大部分中间中继和重排序缓冲,仅保留轻量级的控制状态,包括计数、偏移和同步元数据。我们将其实例化为两种调度程序,用于MoE推理的主要阶段:一种带有更丰富的规划状态的prefill调度程序,用于吞吐量导向的执行,以及一种紧凑的decode调度程序,用于延迟敏感的执行。在基于Ascend的MoE工作负载上的实验表明,在两种设置中都减少了调度和组合延迟。在服务层面,实现提高了时间到第一个令牌(TTFT),保持了具有竞争力的每输出令牌时间(TPOT),并扩大了在实际延迟限制下的可行调度空间。这些结果表明,在具有全局可寻址设备内存的平台上,减少围绕专家执行的中间缓冲和输出恢复是加速MoE推理的有效方向。

英文摘要

Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.

2605.05995 2026-05-11 cs.CR cs.AI cs.CL

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

安全锚:通过几何瓶颈防御有害微调

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou, Hua Dai, Fu Xiao

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 本文提出安全瓶颈正则化(SBR),通过几何瓶颈层限制有害查询的隐藏状态,以对抗有害微调攻击,实验表明单个安全锚即可显著降低有害分数。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLM)的安全对齐仍然容易受到有害微调(HFT)的攻击。尽管现有防御措施对参数、梯度或内部表示施加限制,但这些方法在持续HFT下可以被有效绕过。我们的分析发现,这种失败源于高维参数空间的固有冗余性:攻击者利用与防御约束正交的优化轨迹来恢复有害能力,同时伪装地遵守安全限制。为此,我们提出了安全瓶颈正则化(SBR)。SBR将防御重点从冗余参数空间转移到解嵌层,即几何瓶颈。通过将有害查询的最终隐藏状态锚定到安全对齐模型的隐藏状态,SBR使模型能够在持续HFT下保持安全响应。大量实验验证了SBR的有效性,证明仅使用一个安全锚即可将有害分数降低到<10,同时在良性下游任务上保持竞争性性能。

英文摘要

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.

2605.05703 2026-05-11 cs.MA cs.AI cs.LG

Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems

基于大语言模型多智能体系统的通信结构优化的主动学习

Huchen Yang, Xinghao Dong, Dan Negrut, Jin-Long Wu

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出基于信息论的任务选择框架,通过估计任务信息量优化多智能体系统通信结构,在有限预算下提升性能并减少token使用。

详情
AI中文摘要

优化基于大语言模型的多智能体系统(LLM-MAS)的通信结构已被证明能提高下游性能并减少token使用。现有方法通常依赖随机采样的训练任务,但任务在难度和领域上差异显著,因此对更新通信结构的 informativeness 不均等,导致在有限训练预算下优化不稳定且对训练集敏感。为积极识别最有价值的任务,我们提出基于集成的信息论任务选择框架。所提方法通过候选任务如何改变图参数分布来估计任务信息量,使用集合卡尔曼反演作为相应贝叶斯更新的高效无导数近似。所得到的估计器特别适用于黑盒和噪声多智能体系统。为增强可扩展性,我们通过嵌入基于代表性选择构建紧凑的候选池,并结合信息量选择与代理建模和批量汤普森采样。我们在良性设置和存在智能体攻击的设置中验证了我们的方法,证明了其在受限计算预算下的有效性。

英文摘要

Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication-structure optimization, we propose an ensemble-based information-theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative-free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black-box and noisy multi-agent systems. To enhance scalability, we construct a compact candidate pool through embedding-based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication-structure optimization under constrained computational budgets.

2605.05340 2026-05-11 cs.CR cs.AI

How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

VLMs在物理世界中离隐私意识还有多远?一项实证研究

Junran Wang, Xinjie Shen, Zehao Jin, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过实证研究揭示VLMs在物理环境中隐私意识的不足,提出ImmersedPrivacy框架评估模型在复杂场景中的隐私感知能力,发现现有模型在感知和隐私冲突处理上存在显著缺陷。

详情
AI中文摘要

随着Vision-Language Models (VLMs)越来越多地被用作自主认知核心的具身助手,评估其在物理环境中的隐私意识变得至关重要。不同于数字聊天机器人,这些代理在亲密空间如家庭和医院中运行,能够物理上观察和操纵敏感信息和物品。然而,当前的基准测试仍局限于单模态、文本表示,无法捕捉真实世界的需求。为弥合这一差距,我们提出了ImmersedPrivacy,一个交互式音频-视觉评估框架,利用基于Unity的模拟器模拟现实物理环境。ImmersedPrivacy评估物理基础的隐私意识,通过三个逐步层级测试模型识别敏感物品、适应变化的社会情境以及解决显式命令与推断隐私约束之间的冲突。对12种最新模型的评估揭示了持续的缺陷。在复杂场景中,所有模型随着场景复杂度增加而出现单调性能下降,由于感知缺陷。当社会情境变化时,没有模型超过65%的选取准确率。在冲突命令下,最佳模型gemini-3.1-pro仅在51%的情况下平衡任务完成和隐私保护。这些发现表明,当前物理世界中的VLMs存在感知脆弱性,无法让其隐私提示的知识指导其情境行为。我们的代码和数据可在https://github.com/immersed-privacy/immersed-privacy获取。

英文摘要

As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state-of-the-art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini-3.1-pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at https://github.com/immersed-privacy/immersed-privacy .

2605.04615 2026-05-11 cs.SE cs.AI

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

超越检索:一个多任务基准和代码搜索模型

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

发表机构 * Ant Group Hangzhou, China(蚂蚁集团杭州)

AI总结 本文提出CoREB基准和代码重排模型,评估代码搜索全流程,发现代码专用嵌入优于通用模型,短关键词查询性能差,现成重排器任务不对称,而CoREB-Reranker在三个任务中均表现优异。

Comments project site: https://hq-bench.github.io/coreb-page/

详情
AI中文摘要

代码搜索通常被评估为第一阶段检索,尽管生产系统依赖包含重排和开发者式查询的更广泛流程。现有基准存在数据污染、标签噪声和退化二元相关性。本文引入CoREB,一个受污染限制的多任务代码检索和重排基准,以及微调的代码重排器,超越检索覆盖完整代码搜索流程。CoREB基于五种编程语言的反事实重写LiveCodeBench问题构建,并以定时发布形式提供带分级相关性判断的数据。我们对十一种嵌入模型和五种重排器在三个任务(文本到代码、代码到文本、代码到代码)上进行基准测试。实验发现:代码专用嵌入在代码到代码检索中表现优于通用编码器(约2倍),但无单一模型在所有任务中胜出;短关键词查询(最接近真实开发者搜索格式)使所有模型的nDCG@10接近零;现成重排器任务不对称,代码到代码任务有12分差距,所有任务无基线正向表现;我们的微调CoREB-Reranker是首个在所有三个任务中均表现优异的模型。数据和模型已发布。

英文摘要

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

2604.15533 2026-05-11 cs.PL cs.LG cs.LO cs.SE

Verification Modulo Tested Library Contracts

基于测试库合同的验证

Abhishek Uppar, Omar Muhammad, Sumanth Prabhu, Deepak D'Souza, Madhusudan P, Adithya Murali

发表机构 * Indian Institute of Science(印度科学研究院) University of Illinois Urbana-Champaign, Department of Computer Science(伊利诺伊大学厄巴纳-香槟分校计算机科学系) University of Wisconsin(威斯康星大学)

AI总结 本文提出了一种基于测试库合同的验证方法,通过合成模块化合同和上下文合同来确保客户端程序的正确性,并利用反例引导学习框架进行验证。

Comments Removed LaTeX formatting from abstract text

详情
AI中文摘要

我们考虑了验证模测试库合同的问题,作为自动化验证使用复杂库的客户端程序的一步。我们将此问题表述为合成适用于证明客户端正确性的模块化合同,并且这些合同能通过测试引擎的审查。我们还考虑了一种新的方法合同形式,称为上下文合同,在此设置中出现,且通常比经典模块化合同更简单易推断。我们提供了一个反例引导学习框架来解决这个问题,在此框架中,合成器与约束求解器以及测试引擎交互,以推断合适的模块化/上下文方法合同和归纳不变量。主要的合成引擎是使用ICE学习算法实现的一般化CHC求解器。我们实现这个框架在一个名为DUALIS的工具中,并在客户端调用大型库的基准测试中展示了其有效性。

英文摘要

We consider the problem of verification modulo tested library contracts as a step towards automating the verification of client programs that use complex libraries. We formulate this problem as the synthesis of modular contracts for the library methods used by the client that are adequate to prove the client correct, and that also pass the scrutiny of a testing engine that tests the library against these contracts. We also consider a new form of method contracts called contextual contracts that arise in this setting that hold in the context of the client program, and can often be simpler and easier to infer than classical modular contracts. We provide a counterexample-guided learning framework to solve this problem, in which the synthesizer interacts with a constraint solver as well as the testing engine in order to infer adequate modular/contextual method contracts and inductive invariants for the client. The main synthesis engines we use are generalizing CHC solvers that are realized using ICE learning algorithms. We realize this framework in a tool called DUALIS and show its efficacy on benchmarks where clients call large libraries.

2604.15439 2026-05-11 stat.ML cs.LG math.PR

One-Shot Generative Flows: Existence and Obstructions

单步生成流:存在性与障碍

Panos Tsimpos, Daniel Sharp, Youssef Marzouk

发表机构 * Operations Research Center(运筹学中心) Center for Computational Science & Engineering(计算科学与工程中心) Laboratory of Information and Decision Systems(信息与决策系统实验室)

AI总结 研究生成建模中的动态测度传输,探讨连接源测度与目标测度的传输映射,分析何时能生成具有零加速度的直线流,并证明在端点独立性下存在存在性和不存在性的分明界限。

详情
AI中文摘要

我们研究生成建模中的动态测度传输,聚焦于连接源测度P₀到目标测度P₁的传输映射,通过积分形式为v_t(x) = E[Ẋ_t | X_t = x]的速度场。我们研究当X_•诱导出直线流时的条件,即加速度为零且可由任何一阶方法精确积分。首先,我们发展了多个关于直线流的特征化,涉及过程的条件统计的偏微分方程。然后,我们证明在端点独立性下存在存在性和不存在性的分明界限。一方面,我们构造了任意高斯端点的显式可计算直线过程。另一方面,我们证明对于具有足够分离模式的目标,直线过程不存在。我们通过一系列越来越一般的不可能定理来展示这一障碍,揭示了具有独立端点的过程的样本路径行为与该过程流映射的空间-时间几何之间的基本关系。这些结果共同提供了关于直线生成流何时能存在和不能存在的结构理论。

英文摘要

We study dynamic measure transport for generative modeling, focusing on transport maps that connect a source measure $P_0$ to a target measure $P_1$ by integrating a velocity field of the form $v_t(x) = \mathbb{E}[\dot X_t \mid X_t = x]$, where $X_\bullet = (X_t)_t$ is a stochastic process satisfying $(X_0,X_1)\sim{P_0}\otimes{P_1}$ and $\dot X_t$ is its time derivative. We investigate when $X_\bullet$ induces a \emph{straight-line flow}: a flow whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. First, we develop multiple characterizations of straight-line flows in terms of PDEs involving the conditional statistics of the process. Then, we prove that straight-line flows under endpoint independence exhibit a sharp dichotomy. On the one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show that straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this obstruction through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independent endpoints and the space-time geometry of this process' flow map. Taken together, these results provide a structural theory of when straight-line generative flows can, and cannot, exist.

2604.06738 2026-05-11 cs.GT cs.LG

Beyond Pessimism: Offline Learning in KL-regularized Games

超越悲观主义:KL正则化博弈中的离线学习

Yuheng Zhang, Claire Chen, Nan Jiang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) California Institute of Technology(加州理工学院)

AI总结 本文研究了KL正则化双人零和博弈中的离线学习,提出了一种无需悲观估计的算法,实现了更快的样本复杂度界,并提出高效的自我对弈策略优化算法。

详情
AI中文摘要

本文研究了KL正则化双人零和博弈中的离线学习,提出了一种无需悲观估计的算法,实现了更快的样本复杂度界,并提出高效的自我对弈策略优化算法。

英文摘要

We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.

2603.24946 2026-05-11 cs.SE cs.LG

MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development

MobileDev-Bench: 一个用于移动应用开发中问题解决的基准

Moshood A. Fakorede, Krishna Upadhyay, A. B. Siddique, Umar Farooq

发表机构 * Louisiana State University(路易斯安那州立大学) University of Kentucky(肯塔基大学)

AI总结 本文提出MobileDev-Bench,包含407个真实世界问题解决任务,涵盖Android Native、React Native和Flutter框架。通过验证的开发人员报告问题与可执行测试补丁配对,评估四个前沿LLM在移动构建环境中的端到端解决率。

Comments 30 pages, 14 figures, 12 tables

详情
AI中文摘要

大型语言模型(LLMs)在自动化软件工程任务中表现出色,但现有基准主要关注库式仓库,忽略了移动应用开发的框架特定构建系统、异构制品类型和协调多文件修复需求。我们引入MobileDev-Bench,包含从19个生产移动应用中收集的407个真实世界问题解决任务,涵盖Android Native(Java/Kotlin)、React Native(TypeScript)和Flutter(Dart)。每个任务配对已验证的开发人员报告问题与可执行测试补丁,使模型生成的修复在移动构建环境中能够完全自动验证。该基准的补丁复杂度显著高于现有基准:平均修复修改12.9个文件和334.6行,41%的实例需要在源文件、构建配置和资源文件等多类型制品之间协调更改。对四个前沿LLM(Claude Sonnet 4.5、Qwen3-Coder、GPT-5.2和Gemini 2.5 Flash)的评估显示,在自动化检索下端到端解决率为3.23%-4.23%,在Oracle检索下最高为5.69%,远低于现有基准的解决率。我们发布MobileDev-Bench,包含任务实例、评估工具和容器化环境,以支持AI辅助移动应用开发的可重复研究。

英文摘要

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on library-style repositories, leaving mobile application development largely unexplored despite its framework-specific build systems, heterogeneous artifact types, and coordinated multi-file fix requirements. We introduce MobileDev-Bench, a benchmark comprising 407 real-world issue-resolution tasks collected from 19 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs a verified developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantially greater patch complexity than prior benchmarks: fixes modify 12.9 files and 334.6 lines on average, and 41% of instances require coordinated changes across multiple artifact types, such as source, build configuration, and resource files. Evaluation of four frontier LLMs (Claude Sonnet 4.5, Qwen3-Coder, GPT-5.2, and Gemini 2.5 Flash) yields end-to-end resolution rates of only 3.23% - 4.23% under automated retrieval and at most 5.69% under oracle retrieval, well below resolution rates reported on existing benchmarks. We release MobileDev-Bench with task instances, an evaluation harness, and containerized environments to support reproducible research on AI-assisted mobile application development.

2603.24755 2026-05-11 cs.SE cs.AI cs.CL

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

SlopCodeBench:评估编码代理在长周期迭代任务中性能退化的基准测试

Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Nicholas Roberts, Frederic Sala, Aws Albarghouthi

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Washington State University(华盛顿州立大学) MIT(麻省理工学院)

AI总结 本文提出SlopCodeBench,通过36个问题和196个检查点评估编码代理在长周期迭代任务中的性能退化,发现代理代码在结构上逐渐退化并产生冗余代码,人类代码退化更慢。

Comments Code and Leaderboards are located at https://www.scbench.ai

详情
AI中文摘要

软件开发是迭代的,但代理编码基准测试通过单次设置隐藏了设计问题。最近的迭代基准试图解决这个问题,但限制了代理的设计决策空间,无法真实测量其决策如何影响未来扩展。我们引入SlopCodeBench,包含36个问题和196个检查点,代理反复扩展自己的解决方案。与之前的迭代基准不同,我们的演进规范要求架构决策,但将内部结构留给代理。我们测量两种退化形式:结构侵蚀(集中复杂度)和冗余(冗余代码)。评估15个编码代理,发现没有代理能完全解决任何问题,最佳代理通过14.8%的检查点。质量随检查点下降,结构侵蚀在77%的轨迹中上升,冗余在75.5%中上升。与473个开源Python仓库相比,代理代码冗余和侵蚀更高,人类仓库退化更少且幅度更小。显式质量指导可减少初始冗余和侵蚀,但不影响退化率。SlopCodeBench首次测量了在迭代扩展下的代码退化,揭示代理在通过检查点时生成的代码会随着每次迭代而退化和膨胀。

英文摘要

Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with structural erosion rising in 77% of trajectories and verbosity in 75.5%. Compared to 473 open-source Python repositories, agent code is 2.3x more verbose and 2.0x more eroded, and the human repositories degrade less often and by smaller margins across their git histories. Explicit quality guidance reduces initial verbosity and erosion by up to a third, without affecting degradation rates. SlopCodeBench provides the first measurement of code degradation under iterative extension, revealing that agents pass checkpoints while producing code that erodes and bloats with each turn.

2603.16025 2026-05-11 cond-mat.mes-hall cs.CV quant-ph

3D tomography of exchange phase in a Si/SiGe quantum dot device

Si/SiGe量子点器件中交换相的三维断层扫描

Dylan Albrecht, Sarah Thompson, N. Tobias Jacobson, Ryan Jock

发表机构 * Sandia National Laboratories(桑迪亚国家实验室)

AI总结 本文提出一种方法,通过2D测量序列提取3D相位体积,用于确定量子点器件中交换相互作用系数J(V)。

Comments 11 pages, 6 figures; updated acknowledgements

详情
AI中文摘要

交换相互作用是基于自旋的量子处理器操作的基础组件。提取交换相互作用系数J(V)作为栅极电极电压函数对于理解杂乱、准确模拟设备性能和以高保真度操作自旋量子比特至关重要。典型的相干测量交换在自旋量子比特设备中产生一个调制余弦积累相位,这反过来是交换的时间积分。因此,从实验数据中提取J(V)困难,因为反余弦的歧义性,解卷积相位时的噪声敏感性,以及积分的反演问题。作为获得J(V)的一步,我们解决前两个挑战以揭示积累相位ϕ(V)。我们结合来自广泛领域的技术,以稳健地从2D测量序列中提取和建模自旋量子比特设备的3D相位体积。特别是,我们提出了一种测量技术来获得包裹相位,如相位移数字全息图中所做的,并利用最大流/最小割相位解卷积方法(PUMA)在3D电压空间中解卷积相位。我们证明该方法对设备中观察到的最小漂移具有鲁棒性,我们通过增加扫描分辨率来验证。在构建提取的相位模型后,我们优化模型以在电压空间中定位最小梯度π交换脉冲点。我们的测量协议可能提供有用的信息,以理解支配设备产量的设备变异性根源,使设备模型在操作期间能够校准到特定设备以进行更复杂的误差归因,并使量子比特控制系统化优化。我们预计本文提出的方法可能适用于其他量子比特平台。

英文摘要

The exchange interaction is a foundational building block for the operation of spin-based quantum processors. Extracting the exchange interaction coefficient $J(\mathbf{V})$, as a function of gate electrode voltages, is important for understanding disorder, faithfully simulating device performance, and operating spin qubits with high fidelity. Typical coherent measurements of exchange in spin qubit devices yield a modulated cosine of an accumulated phase, which in turn is the time integral of exchange. As such, extracting $J(\mathbf{V})$ from experimental data is difficult due to the ambiguity of inverting a cosine, the sensitivity to noise when unwrapping phase, as well as the problem of inverting the integral. As a step toward obtaining $J(\mathbf{V})$, we tackle the first two challenges to reveal the accumulated phase, $ϕ(\mathbf{V})$. We incorporate techniques from a wide range of fields to robustly extract and model a 3D phase volume for spin qubit devices from a sequence of 2D measurements. In particular, we present a measurement technique to obtain the wrapped phase, as done in phase-shifting digital holography, and utilize the max-flow/min-cut phase unwrapping method (PUMA) to unwrap the phase in 3D voltage space. We show this method is robust to the minimal observed drift in the device, which we confirm by increasing scan resolution. Upon building a model of the extracted phase, we optimize over the model to locate a minimal-gradient $π$ exchange pulse point in voltage space. Our measurement protocol may provide detailed information useful for understanding the origins of device variability governing device yield, enable calibrating device models to specific devices during operation for more sophisticated error attribution, and enable a systematic optimization of qubit control. We anticipate that the methods presented here may be applicable to other qubit platforms.

2603.03096 2026-05-11 eess.AS cs.CL

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

解读自监督语音特征的维度中说话人特性

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

发表机构 * Department of Electrical and Electronic Engineering, Stellenbosch University(斯特伦博斯大学电气与电子工程系) Concordia University(康科迪亚大学)

AI总结 本文通过PCA分析自监督语音特征的维度,揭示了语音特征如音高、性别等在主成分中的分布及相互影响,展示了特征维度的独立性和可操控性。

Comments 5 pages, 7 figures, submitted to IEEE Signal Processing Letters

详情
AI中文摘要

语音模型通过自监督学习训练如何构建其表示?先前研究关注不同层中信息如何编码在特征向量中。但很少有研究考虑语音特性是否在SSL特征的个体维度中被捕捉。本文通过PCA分析 utterance-averaged 表示,针对多种SSL模型发现,解释最大方差的主维度编码音高及相关特性如性别。其他个体主维度与强度、噪声水平、第二共振峰及高频特性相关。我们进一步通过合成分析表明,大多数特征维度彼此影响较小。此外,特征可通过操控相应维度进行改变。

英文摘要

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.

2602.16928 2026-05-11 cs.GT cs.AI cs.MA

Discovering Multiagent Learning Algorithms with Large Language Models

用大型语言模型发现多智能体学习算法

Zun Li, John Schultz, Daniel Hennes, Marc Lanctot

发表机构 * Google DeepMind(谷歌深思)

AI总结 本文利用AlphaEvolve框架自动发现CFR和PSRO算法,提出VAD-CFR和SHOR-PSRO,通过简化核心机制获得更高效的WOP-CFR和PM-PSRO,提升泛化能力。

Comments More experiments and analysis on algorithmic distilliation

详情
AI中文摘要

多智能体强化学习(MARL)在不完全信息游戏中的进展通常依赖于手动迭代优化算法基线。最近,由大型语言模型(LLM)驱动的进化编码代理成为自动化发现过程的有力工具。本文部署AlphaEvolve框架,探索两种博弈论范式:因果遗憾最小化(CFR)和策略空间响应或acles(PSRO)。自动化搜索得到两种算法:波动适应折扣(VAD-CFR)和平滑混合乐观遗憾(SHOR-PSRO),在18种游戏评估套件中与最先进的手工设计基线一致竞争。然而,由于LLM在特定训练集上优化适应性,它常构建高度协同、复杂的机制。通过系统消融研究,我们证明尽管这些机制紧密耦合,真正推动泛化的是算法的核心。通过将LLM的发现简化到最根本原则,我们产生两种最小求解器:预热乐观预测(WOP-CFR)和投影匹配(PM-PSRO)。这些简化版本在泛化上表现优异,结构复杂性大幅降低,为使用LLM进行算法发现提供了清晰方法。

英文摘要

Much of the advancement in Multi-Agent Reinforcement Learning (MARL) for imperfect-information games has historically depended on the manual, iterative refinement of algorithmic baselines. Recently, evolutionary coding agents powered by Large Language Models (LLMs) have emerged as powerful tools to automate this discovery process. In this work, we deploy one of such agentic frameworks, AlphaEvolve, to navigate the design spaces of two distinct game-theoretic paradigms: counterfactual regret minimization (CFR) and policy-space response oracles (PSRO). This automated search yielded two algorithms: Volatility-Adaptive Discounted (VAD-) CFR and Smoothed Hybrid Optimistic Regret (SHOR-) PSRO, which are consistently competitive with state-of-the-art human-designed baselines across an 18-game evaluation suite spanning Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants. However, because the LLM optimizes for fitness on a specific training set, it often constructs highly synergistic, complex mechanisms tailored to those environments. Through systematic ablation studies, we demonstrate that while these mechanisms are tightly coupled, the true driver of generalization lies in a minimal algorithmic core. By distilling the LLM's discoveries down to their most fundamental principles, we produce two minimal solvers: Warm-started Optimistic Predictive (WOP-)CFR and Projection Matching (PM-)PSRO. These distilled versions achieve superior performance on generalization with greatly reduced structural complexity, providing a clear methodology for using LLMs in algorithmic discovery.

2602.15189 2026-05-11 cs.IR cs.AI cs.CL

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

ScrapeGraphAI-100k:用于模式约束LLM生成的数据集

William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan

发表机构 * Slovak University of Technology(斯洛伐克技术大学) ScrapeGraphAI

AI总结 本文提出ScrapeGraphAI-100k数据集,包含93695个模式约束提取事件,通过真实用户数据和结构化标注,用于评估LLM在模式约束下的生成能力。

详情
AI中文摘要

在现代大语言模型中,生成符合指定JSON模式的输出是工具使用、结构化提取和知识库构建的基础。尽管其重要性,公开数据集仍然小、合成或仅文本,且很少将真实页面内容与实际使用的提示和模式配对。我们介绍了ScrapeGraphAI-100k,93,695个模式约束提取事件通过Opt-in ScrapeGraphAI telemetry在2025年Q2-Q3收集,经过去重和平衡后,来自9M原始事件。该语料库涵盖18,000多个唯一模式,覆盖15种命名语言及长尾Other类别,英语和繁体中文占检测内容的88%,每个实例配对Markdown转换的页面内容、提示、模式、LLM响应及每例jsonschema-rs结构符合标签(语义正确性不在此范围内,原始HTML将在v1.0之后延迟)。我们分析了语料库中的结构多样性,并识别出随着模式复杂性增加而出现的尖锐失败阈值。作为案例研究,一个170亿参数的学生模型在该数据上训练后,其输出分布接近其GPT-5-nano教师模型,尽管仍落后于30B-A3B参考模型(3.3B活跃参数)在模式符合性上的表现。我们提供此提炼结果作为初步证据,表明在大规模真实从业者工作负载中将模式约束生成接地,能够实现训练和基准测试,这之前合成或纯文本语料库无法支持。

英文摘要

Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.

2602.10024 2026-05-11 cs.IR cs.CL

Overview of the TREC 2025 RAGTIME Track

TREC 2025 RAGTIME 跟踪任务概述

Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Eugene Yang, Andrew Yates

发表机构 * Johns Hopkins University Human Language Technology Center of Excellence(约翰霍普金斯大学人机语言技术卓越中心) University of Glasgow(格拉斯哥大学) Allen Institute for AI(人工智能研究院)

AI总结 TREC 2025 RAGTIME 跟踪任务旨在研究多语言源文档的报告生成,包含阿拉伯语、中文、英语和俄语新闻故事的文档集,涵盖多语言报告生成、英语报告生成和多语言信息检索三个任务,共13支队伍提交125次运行。

Comments 14 pages, 3 figures, final version of the RAGTIME 2025 overview paper

详情
AI中文摘要

RAG TREC Instrument for Multilingual Evaluation (RAGTIME) 跟踪任务的主要目标是研究从多语言源文档生成报告。该跟踪任务创建了一个包含阿拉伯语、中文、英语和俄语新闻故事的文档集。RAGTIME 包含三种任务类型:多语言报告生成、英语报告生成和多语言信息检索(MLIR)。总共125次运行由13支参与团队(以及跟踪协调员作为基线)提交给三个任务。本概述描述了这三个任务并呈现了可用的结果。

英文摘要

The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.

2602.09457 2026-05-11 stat.ML cs.DS cs.LG

From Average Sensitivity to Small-Loss Regret Bounds under Random-Order Model

从平均敏感度到随机顺序模型下的小损失遗憾界

Shinsaku Sakaue, Yuichi Yoshida

发表机构 * CyberAgent National Institute of Informatics(国家信息研究所) Center for Advanced Intelligence Project(先进智能项目中心) RIKEN(理化学研究所)

AI总结 本文基于随机顺序模型,通过扩展Dong和Yoshida的方法,证明了在满足特定条件的算法下,可获得小损失遗憾界,适用于在线k均值聚类、低秩逼近等广泛问题,并展示了在子模函数最小化和ℓ₁回归中的应用。

详情
AI中文摘要

我们研究了随机顺序模型下的在线学习,其中损失函数的多重集由对手选择但以均匀随机顺序揭示。通过扩展Dong和Yoshida(2023)的批处理到在线转换,我们证明若一个离线算法具有(1+ε)近似保证、由函数φ(ε)控制的平均敏感度界以及对ε的稳定性,则可以得到通常为O~(φ*(OPT_T))的小损失遗憾界,其中φ*( )是φ的凹共轭,OPT_T是T轮的离线最优解,O~隐藏了T的多项对数因子。我们的结果改进了原始(1+ε)近似遗憾保证,并适用于广泛的问题,包括在线k均值聚类和在线低秩逼近。我们进一步将我们的方法应用于在线子模函数最小化,使用(1±ε)-切分稀疏化子模超图,得到小损失遗憾界为O~(n³ +n^{3/4}OPT_T^{3/4}),其中n是地面集大小;我们还展示了其在在线ℓ₁回归中的适用性。我们的工作揭示了稀疏化和相关算法技术在随机顺序模型中实现小损失遗憾界的能力,无需对损失函数的结构假设,如线性或光滑性。

英文摘要

We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order. By extending the batch-to-online transformation of Dong and Yoshida (2023), we show that if an offline algorithm enjoys a $(1+\varepsilon)$-approximation guarantee, an average sensitivity bound controlled by a function $φ(\varepsilon)$, and stability with respect to $\varepsilon$, then we can obtain a small-loss regret bound typically of order $\tilde O(φ^{\star}(\mathrm{OPT}_T))$, where $φ^{\star}$ is the concave conjugate of $φ$, $\mathrm{OPT}_T$ is the offline optimum over $T$ rounds, and $\tilde O$ hides polylogarithmic factors in $T$. Our result refines their original $(1+\varepsilon)$-approximate regret guarantee and applies to a broad class of problems, including online $k$-means clustering and online low-rank approximation. We further apply our approach to online submodular function minimization using $(1\pm\varepsilon)$-cut sparsifiers of submodular hypergraphs, obtaining a small-loss regret bound of $\tilde O(n^3 + n^{3/4}\mathrm{OPT}_T^{3/4})$, where $n$ is the ground-set size; we also demonstrate its applicability to online $\ell_1$ regression. Our work sheds light on the power of sparsification and related algorithmic techniques in achieving small-loss regret bounds in the random-order model, without requiring structural assumptions on loss functions, such as linearity or smoothness.

2602.09034 2026-05-11 q-bio.NC cs.AI

Latent-Space Causal Discovery from Indirect Neuroimaging Observations

从间接神经影像观测中发现潜在空间因果关系

Sangyoon Bae, Miruna Oprescu, David Keetae Park, Shinjae Yoo, Jiook Cha

发表机构 * Interdisciplinary Program in Artificial Intelligence(人工智能交叉学科项目) Seoul National University(首尔国立大学) Computer Science(计算机科学) Cornell Tech, Cornell University(康奈尔科技,康奈尔大学) Brookhaven National Laboratory(布鲁克海文国家实验室) Department of Psychology(心理学系)

AI总结 本文提出INCAMA方法,通过物理感知逆向与延迟感知Mamba编码器,提升从间接神经影像中恢复因果结构的性能,实验显示在TVB模拟和HCP任务fMRI中表现更优。

Comments 9 pages, 2 figures

详情
AI中文摘要

神经影像不直接观测因果变量:血流动力学和体积传导会扭曲信号,使得统计依赖不反映潜在神经影响。在估计图之前,必须明确在何种假设下可以从此类间接观测中研究延迟的定向结构。我们正式化一个条件设置——在模态物理和非平稳潜在动态下可恢复的逆向,并在显式假设下推导出逆向误差传播界。基于此框架,我们提出INCAMA(INdirect CAusal MAmba):物理感知逆向与延迟感知Mamba编码器,利用机制位移作为信息丰富的变化来评分定向图。我们使用受控模拟进行定量验证,并使用HCP运动任务fMRI作为零样本外部迁移检查,基于解剖学和任务网络一致性。在TVB模拟中,INCAMA在F1指标上比观测空间和两阶段基线提高了2-3倍,在HCP运动任务fMRI中产生稀疏的定向估计,集中在经典视觉-运动通路中。

英文摘要

Neuroimaging does not observe causal variables directly: hemodynamics and volume conduction distort signals so that statistical dependence need not reflect latent neural influence. Before estimating graphs, one must specify under what assumptions delayed directed structure can be studied from such indirect observations. We formalize a conditional setting - recoverable inversion under modality physics together with nonstationary latent dynamics - and derive an inversion-error propagation bound under explicit assumptions. Building on this framing, we propose INCAMA (INdirect CAusal MAmba): physics-aware inversion coupled with a delay-aware Mamba encoder that uses mechanism shifts as informative variation for directed graph scoring. We use controlled simulations for quantitative validation and HCP motor-task fMRI as a zero-shot external transfer check based on anatomical and task-network consistency. Across TVB simulations, INCAMA improves directed-structure recovery by 2-3x in F1 over observation-space and two-stage baselines, and on HCP motor-task fMRI it produces sparse directed estimates concentrated in canonical visuo-motor pathways.

2602.01621 2026-05-11 cs.CR cs.LG

CGF-Softmax: A Cumulant-Based Softmax Reformulation for Efficient Inference under Homomorphic Encryption

CGF-Softmax: 一种基于累积生成函数的softmax重参数化方法,用于在同态加密下高效推理

Hanjun Park, Byeongseo Min, Jiheon Woo, Min-Wook Jeong, Jongho Shin, Yongwoo Lee, Young-Sik Kim, Yongjune Kim

发表机构 * Pohang University of Science and Technology (POSTECH)(釜山科学技术大学) LG Electronics R&D Center(LG电子研发中心) Inha University(inha大学) Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱广开府科学技术院)

AI总结 本文提出CGF-softmax,通过累积生成函数重参数化softmax分母,消除了同态除法和显式最大减法,降低乘法深度并保持softmax关键属性,在视觉Transformer和大语言模型中实现高效准确的加密推理。

详情
AI中文摘要

同态加密(HE)是保护隐私的机器学习框架,允许在加密数据上直接进行推断。然而,评估softmax,作为Transformer架构的核心组件,在HE中尤为具有挑战性,由于其多变量结构、指数函数引起的动态范围大以及昂贵的除法操作。在本文中,我们提出了CGF-softmax,通过累积生成函数(CGF)重参数化softmax分母。通过消除同态除法和显式最大减法,这种重参数化显著减少了乘法深度,同时保持softmax的关键属性。在视觉Transformer和大语言模型上的广泛实验表明,CGF-softmax在加密推断中提供了高效且准确的softmax近似。特别是,它在推断准确性上接近高深度精确方法,同时通过减少乘法深度显著降低了计算成本。

英文摘要

Homomorphic encryption (HE) is a prominent framework for privacy-preserving machine learning, enabling inference directly on encrypted data. However, evaluating softmax, a core component of transformer architectures, remains particularly challenging in HE due to its multivariate structure, the large dynamic range induced by exponential functions, and the costly division operation. In this paper, we propose CGF-softmax, which reformulates the softmax denominator through the cumulant generating function (CGF). By eliminating both homomorphic division and explicit maximum subtraction, this reformulation substantially reduces multiplicative depth while preserving key properties of softmax. Extensive experiments on Vision Transformers and large language models show that CGF-softmax provides an efficient and accurate approximation of softmax in encrypted inference. In particular, it achieves inference accuracy close to that of high-depth exact methods, while requiring substantially lower computational cost through reduced multiplicative depth.

2602.00716 2026-05-11 stat.ML cond-mat.dis-nn cs.LG

Emergence of Distortions in High-Dimensional Guided Diffusion Models

高维引导扩散模型中扭曲的出现

Enrico Ventura, Beatrice Achilli, Luca Ambrogioni, Carlo Lucibello

发表机构 * Department of Computing Sciences, Bocconi University(博科尼大学计算科学系) Donders Institute for Brain, Cognition and Behaviour, Radboud University(拉德堡德大学脑、认知与行为研究所) Bocconi Institute for Data Science and Analytics, Bocconi University(博科尼大学数据科学与分析研究所)

AI总结 本文研究了引导扩散模型中由于分类器自由引导导致的生成扭曲现象,分析了数据维度和类别数对扭曲的影响,并提出改进的引导方案以提升样本多样性。

Comments 41 pages, 21 figures

详情
AI中文摘要

分类器自由引导(CFG)是扩散模型条件采样中的标准方法,但往往降低样本多样性。利用统计物理工具,我们分析了CFG诱导的生成扭曲,即CFG采样分布与真实条件分布之间的不匹配。我们研究了具有精确分数函数的可解析设置,刻画了其依赖于数据维度和类别数的特性。对于高维高斯混合物,我们使用动态均场理论表明,当类别数随数据维度指数增长时会出现扭曲,而在亚指数范围内由于动态相变而消失。我们进一步证明,在无限类别极限下,无论维度如何,扭曲都是不可避免的,因为类别密度增加。最后,我们表明标准CFG调度无法防止方差收缩,并提出了一种理论指导的调度方案,结合负引导窗口,提高了真实世界潜在扩散模型中的类别分离度和样本多样性。

英文摘要

Classifier-free guidance (CFG) is the de facto standard for conditional sampling in diffusion models, yet it often reduces sample diversity. Using tools from statistical physics, we analyze the emergence of generative distortions induced by CFG, namely the mismatch between the CFG sampling distribution and the true conditional distribution. We study this phenomenon in analytically tractable settings with exact score functions, characterizing its dependence on data dimensionality and the number of classes. For high-dimensional Gaussian mixtures, we use dynamic mean-field theory to show that distortions arise when the number of classes scales exponentially with the data dimension, whereas they vanish in the sub-exponential regime due to a dynamical phase transition. We further prove that, in the infinite-class limit, distortions remain unavoidable regardless of dimensionality because of the increasing density of classes. Finally, we show that standard CFG schedules cannot prevent variance shrinkage, and we propose a theoretically grounded guidance schedule incorporating a negative-guidance window that improves both class separability and sample diversity in real-world latent diffusion models.

2601.22246 2026-05-11 cs.CR cs.AI

MirrorMark: Generalizable Mirrored Sampling for Multi-bit LLM Watermarking

MirrorMark: 多位LLM水印的通用镜像采样

Ya Jiang, Massieh Kordi Boroujeny, Surender Suresh Kumar, Kai Zeng

发表机构 * George Mason University(乔治·马歇尔大学) Department of Computer Science(计算机科学系) Wireless Cyber Center(无线网络安全中心) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 MirrorMark通过镜像变换实现多比特LLM水印,通过符号映射规则与基础水印采样器分离,结合CABS调度器平衡token分配,实验显示其在保持文本质量的同时具备强检测性和比特准确性。

详情
AI中文摘要

随着大型语言模型(LLMs)在问答和内容创作等应用中的重要性增加,可靠的内容归属变得越来越重要。水印是一种有前景的方法,但现有方法要么只提供二进制信号,要么通过扭曲生成分布实现多比特嵌入。我们提出MirrorMark,一种通用的以映射为中心的方法用于多比特LLM水印。MirrorMark将符号映射规则与基础水印采样器分离,并将每个符号映射到一个模1的镜像变换的检测者可重现的伪随机对象,如采样值或排列排名。二进制标记分析显示,互补映射产生的匹配-不匹配得分差距大于独立键或位移基映射。当与无扭曲的基础采样器结合时,MirrorMark通过设计保持token概率分布,并在实践中维持文本质量。为了支持实际的载荷嵌入,我们引入了上下文锚定平衡调度器(CABS),该调度器在消息位置上平衡token分配的同时局部化编辑效果。我们进一步为两种代表性的采样器实例提供了理论EER分析。实验表明,MirrorMark在保持文本质量与非水印生成相当的同时,实现了强检测性和比特准确性。

英文摘要

As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but most existing methods either provide only binary signals or achieve multi-bit embedding by distorting the generation distribution. We propose MirrorMark, a generalizable mapping-centric approach for multi-bit LLM watermarking. MirrorMark separates the symbol mapping rule from the base watermarking sampler and maps each symbol to a mod-1 mirroring transformation of a detector-reproducible pseudorandom object, such as sampling values or permutation ranks. A binary-tokenizer analysis shows that complementary mappings yield larger matched--mismatched score gaps than independent-key or shift-based mappings. When composed with a distortion-free base sampler, MirrorMark preserves the token probability distribution by design and maintains text quality in practice. To support practical payload embedding, we introduce a Context-Anchored Balanced Scheduler (CABS), which balances token assignments across message positions while localizing edit effects. We further provide theoretical EER analyses for two representative sampler instantiations. Experiments show that MirrorMark achieves strong detectability and bit accuracy while maintaining text quality comparable to non-watermarked generation.

2601.21839 2026-05-11 cs.CY cs.AI cs.GT cs.LG

Test-Time Compute Games

测试时计算博弈

Ander Artola Velasco, Dimitrios Rontogiannis, Stratis Tsirtsis, Manuel Gomez-Rodriguez

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Hasso Plattner Institute(哈索·普拉特纳研究所)

AI总结 本文研究了大语言模型作为服务市场的社会效率问题,提出反第二种价格拍卖机制以减少计算量浪费,通过实验验证了该机制的有效性。

详情
AI中文摘要

测试时计算已成为增强大语言模型推理能力的有前景策略。然而,这一策略增加了用户使用云服务提供商的费用,因为提供商按测试时计算量收费。本文表明,LLM-as-a-service市场存在社会无效率:提供商有财务激励增加计算量,即使这增加对输出质量贡献很小。为解决这一无效率,我们引入反第二种价格拍卖机制,其中提供商投标其提供的价格和预期质量以获得服务用户的机会,用户支付比例于获胜提供者相对于第二高投标者生成的边际价值。为补充理论结果,我们对多个指令模型以及从DeepSeek-R1蒸馏出的推理模型进行了实验,测试集包括数学和科学基准数据集。

英文摘要

Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

2601.16130 2026-05-11 cs.HC cs.AI

Replicating Human Motivated Reasoning Studies with LLMs

用LLMs复制人类动机性推理研究

Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel C. Molden, Gourab Ghoshal, Ehsan Hoque

发表机构 * University of Rochester(罗切斯特大学) Northwestern University(西北大学)

AI总结 研究通过复制四项政治动机性推理研究,发现基础LLM行为与人类行为不一致,且不同模型在回避回答和整合论点方面有相似表现,表明基础LLM可能不模拟人类动机性推理过程。

详情
AI中文摘要

动机性推理——个体处理信息时可能被驱动去获得准确信念或得出期望结论——已作为人类现象被深入探讨。然而,基础LLM是否受动机操纵影响仍不明确。通过复制四项政治动机性推理研究,发现基础LLM行为与预期人类行为不一致。此外,不同模型在回避回答和整合提供的论点方面存在相似性。结果表明基础LLM可能不模拟人类动机性推理过程。我们强调这些发现对研究人员使用LLM进行意见复制和论点评估等任务的重要性。

英文摘要

Motivated reasoning - the idea that individuals processing information may be motivated to either arrive at accurate beliefs or arrive at desired conclusions - has been well-explored as a human phenomenon. However, it remains unclear whether base LLMs are affected by motivational manipulations. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as when selecting to abstain from question answering and incorporating provided arguments into opinions. The results suggest that base LLMs may not emulate human motivated reasoning processes. We emphasize the importance of these findings for researchers using LLMs to for certain tasks such as opinion replication and argument assessment.

2601.15356 2026-05-11 eess.IV cs.AI

Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

Q-Probe:通过上下文感知代理探测扩展图像质量评估至高分辨率

Xiang Li, Xueheng Li, Yu Wang, Xuanhua He, Zhangchi Hu, Weiwei Yu, Chengjun Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Hefei University of Technology(合肥工业大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Institute of Intelligent Machines, Chinese Academy of Sciences(中国科学院智能 Machines 研究所)

AI总结 Q-Probe通过上下文感知探测方法解决高分辨率图像质量评估中的局部退化捕捉问题,提出Vista-Bench基准和三阶段训练框架,实现高分辨率下的最优性能。

详情
AI中文摘要

强化学习(RL)已使多模态大语言模型(MLLMs)在图像质量评估(IQA)中实现优于人类偏好的对齐。然而,现有基于RL的IQA模型通常依赖于粗粒度的全局视图,无法在高分辨率场景中捕捉细微的局部退化。虽然新兴的

英文摘要

Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging "Thinking with Images" paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious "cropping-implies-degradation" biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.

2601.02602 2026-05-11 cs.CR cs.LG

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

通过强化学习保障代码水印:SWaRL

Neusha Javidnia, Ruisi Zhang, Ashish Kundu, Farinaz Koushanfar

发表机构 * ECE Department UC San Diego(UC圣地亚哥大学电子与计算机工程系) Cisco Research(思科研究)

AI总结 SWaRL通过强化学习框架实现稳健且保真度高的代码水印,保护大语言模型的知识产权,通过在生成程序中嵌入唯一可验证签名,有效抵御移除攻击并保持功能正确性。

Comments Preprint

详情
AI中文摘要

我们提出了SWaRL,一种稳健且保真度保持的水印框架,旨在通过在生成程序中嵌入唯一且可验证的签名来保护代码大语言模型的知识产权。现有水印方法要么依赖手工编码转换,要么在推理时操纵token生成概率,使其易受移除攻击且可能破坏功能正确性。为解决这些问题,SWaRL采用基于强化学习的联合训练框架,利用编译器反馈维持功能正确性,并使用联合训练的保密验证器作为奖励信号以保持水印可检测性。此外,SWaRL在微调过程中采用低秩适应(LoRA),使水印行为能够高效集成并跨模型更新保持可转移性。大量实验表明,SWaRL在水印检测准确性方面优于先前方法,同时完全保持水印代码的功能性。此外,SWaRL对重构和对抗性转换攻击具有强鲁棒性,能可靠地进行归因,且无需显著计算开销。

英文摘要

We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLMs by embedding unique and verifiable signatures in the generated program. Existing watermarking approaches either rely on handcrafted code transformations or manipulate token generation probabilities at inference time, making them vulnerable to removal attacks or prone to breaking functional correctness. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, enabling efficient integration of watermarking behavior and transferability across model updates. Extensive experiments show that SWaRL achieves strong watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks, which maintains reliable attribution without substantial computational overhead.

2512.23927 2026-05-11 stat.ML cs.LG

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

静态重加权实现软拟合Q迭代的局部收敛性

Lars van der Laan, Nathan Kallus

发表机构 * Department of Statistics, University of Washington(华盛顿大学统计学系)

AI总结 本文分析了在无Bellman完备性条件下软拟合Q迭代的稳定性机制,提出静态重加权软拟合Q迭代方法,证明其在近似可实现性和受控加权误差下具有有限样本局部线性收敛性。

详情
AI中文摘要

拟合Q迭代(FQI)和软FQI是离线强化学习中广泛使用的基于价值的方法,但其标准稳定性保证通常依赖于Bellman完备性,这是一种强闭包条件,可能在函数逼近下失效。我们分析了不依赖Bellman完备性的软FQI,并识别出替代其的稳定性机制:局部静态范数对齐。接近软最优固定点时,软Bellman算子与软最优策略的策略评估算子具有相同的一阶行为。此算子在策略的静态状态-动作范数下是收缩的,而标准拟合回归在行为范数下投影Bellman目标。这种不匹配解释了在分布偏移下的不稳定性。我们利用这一见解开发了静态重加权软FQI,该方法将每个回归步骤的权重向当前softmax策略的静态分布重新加权。在近似可实现性和受控加权误差下,我们证明了有限样本局部线性收敛到投影固定点,将统计误差与几何衰减的权重估计误差分离。我们的结果还表明,普通软FQI在on-policy静态采样下局部稳定,即使没有Bellman完备性,且解释温度退火作为达到收缩区域的延续策略。

英文摘要

Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identify the stability mechanism that replaces it: local stationary norm alignment. Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy. This operator contracts in the policy's stationary state-action norm, whereas standard fitted regression projects Bellman targets in the behavior norm. This mismatch explains instability under distribution shift. We use this insight to develop stationary-reweighted soft FQI, which reweights each regression step toward the stationary distribution of the current softmax policy. Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. Our results also show that ordinary soft FQI is locally stable under on-policy stationary sampling, even without Bellman completeness, and explain temperature annealing as a continuation strategy for reaching a contraction region.

2512.23805 2026-05-11 stat.ML cs.LG

Fitted $Q$ Evaluation Without Bellman Completeness via Stationary Weighting

无需贝尔曼完备性而通过稳态加权的拟Q评估

Lars van der Laan, Nathan Kallus

发表机构 * Department of Statistics, University of Washington(华盛顿大学统计学系)

AI总结 本文提出一种无需贝尔曼完备性的拟Q评估方法,通过稳态加权改进回归步骤,实现有限样本线性收敛,减少价值误差。

详情
AI中文摘要

拟Q评估(FQE)是一种基于回归的-off-policy评估标准工具,但现有稳定性保证通常依赖于贝尔曼完备性,这是一种强闭包条件,可能在函数逼近下失效。本文研究了一种替代方法:改变回归步骤中使用的范数。策略评估贝尔曼算子在由目标策略的稳态状态-动作分布诱导的L²范数下是收缩的,而标准的off-policy FQE在行为分布范数下投影贝尔曼目标。我们提出稳态加权FQE,通过稳态目标到行为密度比重新加权每个贝尔曼回归。该方法保持FQE的模块化监督学习形式,同时将拟合投影对齐到该收缩范数。我们证明在模型不准确的情况下,有限样本线性收敛到稳态投影贝尔曼固定点,而无需要求贝尔曼完备性。该界将有限迭代、统计、近似和权重估计误差分开,并表明当固有贝尔曼误差较小时,比率估计误差会减弱。受控实验表明,稳态加权可以稳定FQE并减少价值误差,当行为分布回归过度强调目标策略很少访问的区域时。

英文摘要

Fitted $Q$-evaluation (FQE) is a standard regression-based tool for off-policy evaluation, but existing stability guarantees often rely on Bellman completeness, a strong closure condition that can fail under function approximation. We study an alternative route: changing the norm used in the regression step. The policy-evaluation Bellman operator is contractive in the $L^2$ norm induced by the target policy's stationary state-action distribution, whereas standard off-policy FQE projects Bellman targets in the behavior-distribution norm. We propose stationary-weighted FQE, which reweights each Bellman regression by the stationary target-to-behavior density ratio. The method preserves FQE's modular supervised-learning form while aligning the fitted projection with that contractive norm. We prove finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The bound separates finite-iteration, statistical, approximation, and weight-estimation errors, and shows that ratio-estimation error is attenuated when the inherent Bellman error is small. Controlled experiments show that stationary weighting can stabilize FQE and reduce value error when behavior-norm regression overemphasizes regions rarely visited by the target policy.